=Paper= {{Paper |id=Vol-3611/paper5 |storemode=property |title=Application-based spam detection with machine learning algorithms |pdfUrl=https://ceur-ws.org/Vol-3611/paper5.pdf |volume=Vol-3611 |authors=Ali Erbey,Necaattin Barişçi |dblpUrl=https://dblp.org/rec/conf/ivus/ErbeyB22 }} ==Application-based spam detection with machine learning algorithms== https://ceur-ws.org/Vol-3611/paper5.pdf
                         A p pli c ati o n -b as e d s p a m d et e cti o n wit h m a c hi n e l e ar ni n g al g orit h ms

                         A li Er b e y 1 , N e c a a tti n B arış çı 2
                         1
                                  Uş a k U ni versit y, Dist a nce E d uc ati o n V oc ati o n al Sc h o ol , De p art me nt of C o m p uter Pr o gr a m mi n g , Uş a k, T ur k e y
                         2
                                  G a zi U ni versit y , F ac ult y of Tec h n ol o g y , De p art me nt of C o m p uter E n gi neeri n g , A n k ar a, T ur k e y




                                                                                         A bstr a ct
                                                                                         T o d a y, t h e us e of s o ci al m e di a sit es s u c h as F a c e b o o k, I nst a gr a m, T witt er is i n cr e asi n g d a y b y
                                                                                         d a y. S h ar es o n s o ci al m e di a c a n e v e n c h a n g e t h e a g e n d a b y r e a c hi n g h u g e m ass es. O n T witt er,
                                                                                         a s o ci al m e di a sit e, t h e a g e n d a t o pi cs c a n b e f oll o w e d t hr o u g h t h e s e cti o n c all e d Tr e n d -T o pi c.
                                                                                         T his Tr e n d - T o pi c s e cti o n m a y b e m a ni p ul at e d b y s p a m m ers fr o m ti m e t o ti m e. I n or d er t o
                                                                                         a v oi d s u c h u n w a nt e d sit u ati o ns, it is n e c ess ar y t o d et er mi n e w h et h er t h e us er is s p a m or n ot.
                                                                                         M a c hi n e l e ar ni n g al g orit h ms c a n cl assif y w h et h er a us er is s p a m or n ot. Wit h m a c hi n e l e ar ni n g
                                                                                         al g orit h ms, s u c c essf ul r es ults ar e als o o bt ai n e d i n sit u ati o ns s u c h as i m a g e pr o c essi n g, s p e e c h,
                                                                                         v oi c e r e c o g niti o n a n d m al w ar e d et e ct i o n. I n t his st u d y, m a c hi n e l e ar ni n g al g orit h ms N aiv e
                                                                                         B a y es, K N e ar est N ei g h b or s, R a n d o m F or est, j 4 8, M ultil a y er P er c e ptr o n w er e us e d t o cl assif y
                                                                                         us ers. As a r es ult of t h e e v al u ati o ns, R a n d o m F or est al g orit h m, o n e of t h e m a c hi n e l e ar ni n g
                                                                                         al g orit h ms us e d, m a d e t h e m ost s u c c essf ul cl assifi c ati o n wit h a n a c c ur a c y r at e of 8 8 % .

                                                                                         K e y w or ds 1
                                                                                         T witt er, s p a m d et e cti o n, m a c hi n e l e ar ni n g


                         1. I ntr o d u cti o n                                                                                                                                                 of s o ci al m e di a h as f a cilit at e d t h e n e gl e ct of t h es e
                                                                                                                                                                                                e n vir o n m e nts b y m ali ci o us p e o pl e [ 4]. W a nti n g t o
                                                                                                                                                                                                c h a n g e t h e a g e n d a is als o a m et h o d t h at c a n b e
                               T o d a y, wit h t h e wi d es pr e a d us e of t h e i nt er n et
                                                                                                                                                                                                n e gl e ct e d b y m ali ci o us p e o pl e. I n or d er t o pr e v e nt
                         a n d t h e i n cr e as e i n t h e us e of m o bil e d e vi c es,
                                                                                                                                                                                                s u c h o missi o ns, m a n y st u di es h a v e b e e n c arri e d
                         o nli n e s o ci al n et w or ki n g sit es, s o ci al n et w or ks
                                                                                                                                                                                                o ut i n ar e as s u c h as n at ur al l a n g u a g e pr o c essi n g
                         s u c h as F a c e b o o k, T witt er a n d Li n k e dI n ar e
                                                                                                                                                                                                a n d d at a mi ni n g wit h t h e d at a c oll e ct e d fr o m
                         b e c o mi n g m or e a n d m or e p o p ul ar [ 1]. T h es e sit es
                                                                                                                                                                                                T witt er [ 5].
                         ar e f oll o w e d b y milli o ns of p e o pl e; I n a d diti o n t o
                                                                                                                                                                                                     W h e n w e l o o k at t h e e xisti n g st u di es i n
                         b ei n g sit es w h er e fri e n ds, f a mil y or a c q u ai nt a n c es
                                                                                                                                                                                                d et e cti n g s p a m wit h T witt er p osts, w e s e e t h at
                         c a n b e c o nt a ct e d, t h e y ar e als o us e d as
                                                                                                                                                                                                t h er e is a l ot of w or k. T h es e st u di es ar e cl ust er e d
                         mi cr o bl o g gi n g        s er vi c es,      r e c o m m e n d ati o n
                                                                                                                                                                                                i n c ert ai n ar e as. T witt er s p a m d et e cti o n st u di es ar e
                         s er vi c es, r e al-ti m e n e ws s o ur c es a n d c o nt e nt
                                                                                                                                                                                                m ai nl y h a n dl e d i n t hr e e gr o u ps. T h es e ar e: a)
                         s h ari n g pl a c es [ 2]. Us ers c a n s h ar e b y cr e ati n g
                                                                                                                                                                                                t h os e w h o o nl y e x a mi n e t h e t w e ets b y t e xt
                         st at us m ess a g es o n T witt er, o n e of t h es e sit es. I n
                                                                                                                                                                                                mi ni n g, b) t h os e w h o a n al y z e t h e t w e et t e xt b y
                         T witt er, w hi c h is a p o p ul ar mi cr o bl o g gi n g sit e i n
                                                                                                                                                                                                ass o ci ati n g wit h t h e us er w h o s e nt t h e t w e et, c)
                         t er ms of s h ari n g, t h es e st at us m ess a g es cr e at e d ar e
                                                                                                                                                                                                t h os e w h o e x a mi n e t h e r el ati o ns of us ers wit h
                         c all e d t w e ets [ 3]. Wit h t h es e t w e ets s e nt b y t h e
                                                                                                                                                                                                s p a m.
                         us ers, t h e Tr e n d T o pi c s e cti o n, w hi c h c o nstit ut es
                                                                                                                                                                                                     T e xt mi ni n g -b as e d r es e ar c h m ostl y f o c us es o n
                         t h e e xisti n g a g e n d a t o pi cs, is f or m e d.
                                                                                                                                                                                                t w e et t e xt. I n t h es e st u di es, r es e ar c h ers first
                               Tr e n d T o pi c s e cti o n c a n b e dir e ct e d t o t h e
                                                                                                                                                                                                e xtr a ct f e at ur es a n d t h e n cl assif y t h e m wit h
                         a g e n d a i n a n u n d esir a bl e w a y wit h m ess a g es s e nt
                                                                                                                                                                                                al g orit h ms s u c h as N ai v e B a y es a n d j 4 8. F e at ur e
                         fr o m ti m e t o ti m e, o ut of p ur p os e. T h e h e a v y us e

                          I V U S 2 0 2 2: 2 7t h I nt er n ati o n al C o nf er e n c e o n I nf or m ati o n T e c h n ol o g y
                          E M AI L: ali er b e y @ g m ail . c o m ( A. Er b e y); n b aris ci @ g a zi . ed u.tr ( N. B arış çı)
                          O R CI D: 0 0 0 0 - 00 0 2 - 09 3 0 - 40 8 1 ( A. Er b e y); 0 0 0 0 - 00 0 2 - 87 6 2 - 50 9 1 ( N. B arış çı)
                                                                                   © 2 0 2 2 C o p yri g ht f or t his p a p e r b y its a ut h ors. Us e p er mitt e d u n d er Cr e ati v e
                                                                                   C o m m o ns Li c e ns e Attri b uti o n 4. 0 I nt er n ati o n al (C C B Y 4. 0).

                                                                                   C E U R W or ks h o p Pr o c e e di n gs ( C E U R-W S. or g)
                                C EU R
                                                      h ttp ://c e u r-w s .o rg
                                W o rk s h o p        IS S N 1 6 1 3 -0 0 7 3
                                P ro c e e d in g s




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
extraction sometimes follows feature selection to        100 tweets of each user were collected. Since
improve classification accuracy and reduce               some users did not have 100 tweets, a total of
training time. Gupta and Kumar [6] used multiple         305.604 tweets were reached. Then, these tweets
linear regression to select important features.          were processed for each user, and some
Other features such as the tweet's character count,      unnecessary data such as smileys were removed
word count, like or retweet count are used in most       from the text of the tweet. Because tweets are
research.                                                unofficial texts, some autocorrect libraries were
    Some research takes user characteristics into        used and typos were corrected. Then, as a more
account when deciding whether a user is spam.            complex process, some new features are obtained
These features can be account age, number of             from this tweet data, such as how often the user
followers/followers, follower/followers’ rate,           tweets, the average number of characters of the
format of the profile page. However, since these         user's tweets, or the unique words tweeted in those
features can be easily changed by the user, they         100 tweets, represented as columns in the final
are considered to be less reliable.                      dataset has been done.
    Because user characteristics can change easily,          As a result of the data collection and
some researchers have studied the relationships          preprocessing stage, a dataset consisting of 3798
between spammers and real users. By examining            rows representing each user and 22 columns in
their following / following relationships, they          total, including features such as the age of the
created a network for each user. Setting up these        user, whether he is a verified account, whether he
networks can be costly in terms of computation           entered a URL on the profile page, was obtained.
time, power and data collection time.
    In the following sections of this study,             2.2. Classification
obtaining the spammy dataset, classification,
detection of spammy users and feature selection
processes are carried out. In the last section, the          The next part after data collection is to
obtained results are evaluated.                          determine whether the user is a spam user with
                                                         different machine learning algorithms.
                                                             Weka software was used to classify the
2. Material and method                                   obtained data as spam or not spam. Weka is a
                                                         program developed for machine learning and text
   We collect the dataset before making the              mining intended to assist in the application of
classification. The data collection process has an       machine learning techniques [7].
important role in the classification process.                In this study, Naive Bayes algorithm, k Nearest
                                                         Neighbor algorithm, Random Forest algorithm,
2.1.    Spamming Twitter dataset                         j48 algorithm and Multilayer Perceptron
                                                         classification algorithms were used in Weka
                                                         software.
    In this study, user characteristics and the              Considering the studies in the literature, the
method of evaluating users' tweet attributes were
                                                         algorithms used in other studies are shown in
chosen in order to classify spam. The reason for         Table 1.
choosing this method is that it is less costly in
terms of data collection and it is seen to give better
                                                         Table 1
results regardless of the tweet content, as it
                                                         Algorithms used in other studies
depends on user characteristics.
    For training, a topic was selected from the                   Authors                Algorithms
Turkey Trend topic list, since a data set with spam            Diale et al.[8]         SVM, RF, c4.5
users should be obtained. 15000 tweets were                      Wang[1]             DT, NN, SVM, NB
collected from this trending topic with the Twitter            Aydın et al.[4]          DT, LR, SVM
public API. Then, repetitive data, news content,             McCord et al.[9]        RF, SVM, NB, KNN
tweets containing URL only were removed from
this dataset and the remaining 3798 tweets were              The reason for choosing the NB, KNN, RF ,
classified as spam and not spam. As a result, 3798       j48 and MLP algorithms used in this study is to
tweets were classified as 1666 spam and 2132             try to create a combination of algorithms that are
non-spam users.                                          widely used in the literature and in addition to
    In order to evaluate which users can send spam       them, less used algorithms. The reason for
tweets after classification, the last maximum of
choosing the most used algorithms is to make             general, back propagation algorithm learning
comparisons with previous studies. The reason for        technique based on slope drop method is used in
choosing the less used algorithms is to create an        MLP. With this technique, the error between the
alternative to the frequently used algorithms.           desired output and the produced output is
                                                         minimized [18].
2.2.1. Naive Bayes
                                                         2.3. Determination of spam users
    The Naive Bayes (NB) algorithm is a simple
probabilistic classifier that calculates a probability      The data obtained during the classification of
set by counting the frequency and combinations           the data was divided into 80% training data and
of values in a given data set. It is a classification    20% test data and evaluated in NB, KNN, RB, j48
algorithm that classifies data by calculating it with    and MLP algorithms. It is known that 1666 of
probability principles. Naive Bayes is a popular         3798 users are spam users and 2132 of them are
algorithm used commercially or open source for           non-spam users in the dataset. The number of
email spam filtering [11].                               users to test is 760 people, which is 20% of the
                                                         data. Looking at whether users are spam with the
2.2.2. k Nearest Neighbors                               Naive Bayes algorithm, the Naive Bayes
                                                         algorithm classified users with an accuracy rate of
                                                         76%.
   In 1968, Cover and Hart proposed the k                   When the complexity matrix of the NB
Nearest Neighbor (KNN) algorithm, which they             algorithm is examined, the data are shown in
have been working on for a long time [12]. The           Table 2.
intuition underlying the K Nearest Neighbor
Classification is quite simple, samples are
                                                         Table 2
classified according to the class of their nearest
                                                         NB Complexity Matrix
neighbors [13]. Having an efficient algorithm for
performing nearest neighbor operations on large                                      Predicted Class
datasets can provide rapid improvements for                          Class       Positive      Negative
                                                          True
many applications [14]. KNN is one of the useful                    Positive       382           60
algorithms in terms of speed.                             Class
                                                                    Negative       122          196

2.2.3. Random Forest                                         According to the complexity matrix of the
                                                         algorithm in Table 2; Of the 760 people in the
    Breiman [15] developed the Random Forest             20% test data, 382 people who were spam were
(RF) method as an extension of classification            classified as spam, and 60 people who were spam
trees. In the RF algorithm, each node has a              were classified as non-spam. 122 non-spam were
random feature selection [16]. It is an algorithm        classified as spam, while 196 non-spam were
that aims to increase the classification value by        classified as non-spam. When we look at the KNN
using more than one decision tree.                       algorithm, one of the machine learning
                                                         algorithms, it is seen that it classifies users with
                                                         an accuracy rate of 74%.
2.2.4. j48                                                   The complexity matrix of the KNN algorithm
                                                         is as shown in Table 3.
    The purpose of the Decision Tree Algorithm is
to determine how the feature vector behaves for a        Table 3
few samples [17]. In the WEKA data mining tool,          KNN Complexity Matrix
J48 is an open-source Java implementation of the
C4.5 algorithm [17].                                                                 Predicted Class
                                                                     Class       Positive      Negative
                                                          True
2.2.5. Multilayer Perceptron                                        Positive       357           85
                                                          Class
                                                                    Negative       107          211
   Multilayer perceptron (MLP) is an algorithm
that can be effectively used for classification            According to the complexity matrix of the
purposes and has been used a lot recently. In            KNN algorithm in Table 3; Of the 760 people in
the 20% test data, 357 people who were spam           According to the complexity matrix of the
were classified as spam, and 85 people who were    MLP algorithm in Table 6; Of the 760 people in
spam were classified as non-spam. 107 non-spam     the 20% test data, 390 people who were spam
were classified as spam, while 211 non-spam were   were classified as spam, and 52 people who were
classified as non-spam.                            spam were classified as non-spam. 82 non-spam
   When the Random Forest algorithm was used       people were classified as spam, while 236 non-
in the study, an accuracy rate of 88% was          spam were classified as not spam.
achieved.
                                                   2.4. Feature selection
Table 4
RF Complexity Matrix
                                                      Feature selection is one of the important steps
                            Predicted Class        of pattern recognition, machine learning and data
           Class        Positive      Negative     mining. Its purpose is to eliminate irrelevant and
 True
          Positive        406           36         redundant variables in order to understand the
 Class                                             data, reduce the computational requirement,
          Negative        53           265
                                                   reduce the dimensionality effect, and improve the
                                                   performance of the predictor [19]. The sections
    According to the RF algorithm complexity
matrix in Table 4; Of the 760 people in the 20%    selected by the Weka software after feature
                                                   selection are shown in Table 7.
test data, 406 people who were spam were
classified as spam, and 36 people who were spam
were classified as non-spam. 53 non-spam           Table 7
classified as spam, 265 non-spam classified as     Remaining sections after feature selection
non-spam.                                                          Column Name
    85% accuracy rate was observed with the J48          Friend_count           Favourite_freq
algorithm. The complexity matrix of the J48              Account_age              Reply_freq
algorithm is as shown in Table 5.                         Tweet_freq             Unique_freq
                                                         Hashtag_freg                Spam
Table 5
J48 Complexity Matrix                                 When the data is re-evaluated after the feature
                            Predicted Class        selection, the performances of the algorithms are
           Class        Positive      Negative     seen in Figure 1.
 True
          Positive        382           60
 Class
          Negative        122          196

   According to the complexity matrix of the j48
algorithm in Table 5; Of the 760 people in the
20% test data, 382 people who were spam were
classified as spam, and 60 people who were spam
were classified as non-spam. 50 non-spam
classified as spam, 268 non-spam classified as
non-spam.                                          Figure 1: Performance of algorithm
   The MLP algorithm found an accuracy rate of
82%. The complexity matrix of the MLP                 As shown in Figure 1, it has obtained similar
algorithm is as shown in Table 6.                  results with feature selection and without feature
                                                   selection.
Table 6
MLP Complexity Matrix                              3. Conclusion and discussion
                            Predicted Class
           Class        Positive      Negative         In this study, SPAM users on Twitter were
 True                                              tried to be detected. In this study, the data set
          Positive        390           52
 Class                                             consists of 3798 user information and different
          Negative        82           236
                                                   machine learning algorithms are used to classify
users. The Random Forest algorithm achieved the             Computer Vision and the Internet. 2016.
highest accuracy rate of 88%. The NB algorithm              ACM.
achieved 76%, KNN 74%, j48 83% and MLP 80%             [7] G. Holmes, A. Donkin and I.H. Witten,
correct classification rates. When the algorithms           Weka: A machine learning workbench. 1994.
were applied again after the feature selection was     [8] M. Diale, T. Celik, and C. Van Der Walt,
made, it was observed that the accuracy rate                Unsupervised feature learning for spam
decreased in other algorithms except the NB                 email filtering. Computers & Electrical
algorithm.                                                  Engineering, 2019. 74: pp. 89-104.
   In addition to the accuracy rates in the            [9] M. Mccord and M. Chuah. Spam detection
algorithms, the number of users whose real class            on twitter using traditional classifiers. in
is negative but classified as positive in the               international conference on Autonomic and
complexity matrix is also important. Since these            trusted computing. 2011. Springer.
users are classified as spam even though they are      [10] T.R. Patil, and S. Sherekar, Performance
not spam, they will suffer if the algorithm is              analysis of Naive Bayes and J48
trusted. This will create an undesirable situation.         classification     algorithm      for     data
When we look at the results, it is seen that the j48        classification. International journal of
algorithm gives the lowest rate with 50 users.              computer science and applications, 2013.
   As a suggestion for future research, different           6(2): pp. 256-261.
optimizations of feature extraction and different      [11] V. Metsis, I. Androutsopoulos, and G.
machine learning algorithms methods can be tried            Paliouras. Spam filtering with naive bayes-
on the collected data and more successful                   which naive bayes? in CEAS. 2006.
classification results can be achieved.                     Mountain View, CA.
                                                       [12] A. Kataria, and M. Singh, A review of data
4. References                                               classification using k- nearest neighbour
                                                            algorithm. International Journal of Emerging
[1] A. H. Wang, "Detecting spam bots in online              Technology and Advanced Engineering,
    social networking sites: a machine learning             2013. 3(6): pp. 354-360.
    approach," in IFIP Annual Conference on            [13] P. Cunningham, and S.J. Delany, k-Nearest
    Data and Applications Security and Privacy.             neighbour classifiers. Multiple Classifier
    Springer. Berlin, Heidelberg, 2010. pp. 335-            Systems, 2007. 34(8): pp. 1-17.
    342 doi:10.1007/978-3-642-13739-6_25               [14] M.Mujaand D.G.Lowe, Scalable nearest
[2] M. Pennacchiotti and A.M. Popescu, A                    neighbor algorithms for high dimensional
    machine learning approach to twitter user               data. IEEE transactions on pattern analysis
    classification. in Fifth International AAAI             and machine intelligence, 2014. 36(11): pp.
    Conference on Weblogs and Social Media.                 2227-2240.
    2011.                                              [15] L.Breiman, Random forests. Machine
[3] A. Go, R. Bhayani and L. Huang, Twitter                 learning, 2001. 45(1): pp. 5-32.
    sentiment classification using distant             [16] K.J. Archer, and R.V. Kimes, Empirical
    supervision. CS224N Project Report,                     characterization of random forest variable
    Stanford, 2009. 1(12): pp. 2009.                        importance       measures.     Computational
[4] İ. Aydın, M. Sevi and M.U. Salur, "Detection            Statistics & Data Analysis, 2008. 52(4): pp.
    of Fake Twitter Accounts with Machine                   2249-2260.
    Learning Algorithms," 2018 International           [17] G. Kaur and A. Chhabra, Improved J48
    Conference on Artificial Intelligence and               classification algorithm for the prediction of
    Data Processing (IDAP), 2018, pp. 1-4, doi:             diabetes. International Journal of Computer
    10.1109/IDAP.2018.8620830.                              Applications, 2014. 98(22).
[5] E.S Akgül, C. Ertano and B. Diri, Twitter          [18] A.R. Yılmaz, O. Yavuz, and B. Erkmen.
    verileri ile duygu analizi. Pamukkale                   Training multilayer perceptron using
    University Journal of Engineering Sciences.             differential evolution algorithm for signature
    2016, Vol. 22 Issue 2, p106-110. 5p.                    recognition application. in 2013 21st Signal
[6] D.K Gupta and A. Kumar. Spam And                        Processing        and        Communications
    Sentiment Analysis Model For Twitter Data               Applications Conference (SIU). 2013. IEEE.
    Using Statistical Learning. in Proceedings of      [19] F.Asdaghi and A.Soleimani, An effective
    the Third International Symposium on                    feature selection method for web spam
detection. Knowledge-Based Systems, 2019.
166: pp. 198-206.