=Paper=
{{Paper
|id=Vol-3713/paper-5
|storemode=property
|title=Classification benchmarking of fake account datasets using machine learning models and feature selection strategies
|pdfUrl=https://ceur-ws.org/Vol-3713/paper_5.pdf
|volume=Vol-3713
|authors=Danilo Caivano,Mary Cerullo,Domenico Desiato,Giuseppe Polese
|dblpUrl=https://dblp.org/rec/conf/damocles/CaivanoCDP24
}}
==Classification benchmarking of fake account datasets using machine learning models and feature selection strategies==
Classification benchmarking of fake account datasets
using machine learning models and feature selection
strategies
Danilo Caivano2 , Mary Cerullo1 , Domenico Desiato2,* and Giuseppe Polese1
1
Department of Computer Science, University of Salerno, via Giovanni Paolo II n.132, 84084 Fisciano (SA), Italy
2
Department of Computer Science, University of Bari Aldo Moro, via Edoardo Orabona n.4, 70125 Bari (BA), Italy
Abstract
Social network platforms are highly used for social interactions, and due to their increasing number
of registered users, it is crucial to verify the authenticity of such accounts and the data they generate.
In particular, the phenomenon of malicious accounts represents a crucial aspect that social network
platforms have to deal with, and it is crucial to develop new methodologies and strategies to discriminate
against malicious accounts automatically. To this end, data from social network platforms plays a
crucial role in defining analytical activities devoted to fake account discrimination. In this proposal,
we organized and cleaned fake account datasets collected by online sources and provided classification
results obtained employing machine learning models and feature selection strategies. Moreover, we
extend classification results by using a new proposed fake accounts dataset collected through data
crawling activity. Experimental results produced by employing several machine learning models and
feature selection techniques on the fake account datasets reveal discrimination improvements when
feature selection strategies are exploited. Our proposal aims to support stakeholders, data analysts,
and researchers by providing them with fake account datasets cleaned and organized for analytical
activities, together with statistical classification results obtained using machine learning models and
feature selection strategies.
Keywords
Data Analytics, Fake accounts, Machine Learning, Feature selection, Social networks
1. Introduction
Social networks have simplified how people communicate and exchange information globally. In
particular, social interaction platforms such as Instagram, Twitter, Tumblr, etc., have improved
the dynamics of human interaction, impacting the daily lives of their users and the entire society.
In social network platforms, it is crucial to monitor the popularity of a profile. In particular,
the number of friends or followers significantly determines the profile’s influence and reputation.
Social network profiles with a large following are considered more influential and attractive
to better-paid advertisements. A common practice of several social network users is to buy
fake followers to appear more influential. Users are also stimulated by the meagre price (a few
AVI 2024: 17th International Conference on Advanced Visual Interfaces, 3–7 June 2024, Arenzano, Italy
*
Corresponding author.
$ danilo.caivano@uniba.it (D. Caivano); m.cerullo13@studenti.unisa.it (M. Cerullo); domenico.desiato@uniba.it
(D. Desiato); gpolese@unisa.it (G. Polese)
0000-0001-5719-7447 (D. Caivano); 0000-0002-6327-459X (D. Desiato); 0000-0002-8496-2658 (G. Polese)
© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
dollars for hundreds of fake followers) fake followers may be bought. Of course, such practice
might be considered harmless if used to support individual vanity but dangerous if used to
make an account more reliable and influential. For example, spammers can use fake followers
to promote products, trends, and fashions by compromising the integrity of the social network
platforms [1]. Moreover, phishing campaigns are commonly spread by exploiting fake accounts
created ad-hoc with many followers and following to appear trustable, and in most cases, the
users attracted by such numbers are defrauded [2].
Many anomalous accounts, such as Spammers, Bots, Cyborgs, and Trolls, are disseminated on
social network platforms. In particular, spammers try to share malicious content or dangerous
links. Bots produce accounts to simulate human behaviour, trying to perform typical human
actions automatically. In contrast, Cyborgs are defined by humans and are not necessarily
malicious.
In this proposal, we focus on the discrimination of fake accounts on social network platforms.
In particular, we present an extensive empirical evaluation of fake account datasets available
in online sources. In detail, we employ several machine learning models and evaluate their
capabilities in terms of fake account discrimination. Further, we exploit different feature
selection techniques to enhance models discrimination performances.
The general idea of our proposal is to offer stakeholders, data analysts, and researchers
who work to define new methodologies for malicious account discrimination the possibility of
quickly accessing fake account datasets that have been cleaned and organized for analytical
activities. Moreover, we offer comparative results in terms of fake account discrimination.
To this end, the usage of machine learning models is motivated by the fact that we want to
provide a baseline for classification performances. In contrast, feature selection techniques
are employed to improve the computed baseline. Moreover, we highlight the combinations of
feature selection and model to adopt on the specific dataset to achieve the best classification
results. Additionally, we extend our analysis by using a new fake account dataset collected by
exploiting data crawling activity.
In summary, the main contributions of our proposal are i) fake account datasets collected by
online sources cleaned and organized for analytical activities, ii) baseline and improved results
of classification performances using machine learning models and feature selection techniques,
and iii) a new fake account dataset collected through data crawling activity.
The remainder of the paper is organized as follows. Section 2 reports relevant works con-
cerning fake account discrimination, whereas Section 3 presents our methodology. Section 4
shows results, and conclusions and future directions are provided in Section 5.
2. Related work
A key factor to be monitored for social network platforms is the identification of malicious
accounts. The automatic collection of social network accounts has been addressed in [3]. In
particular, the authors developed an ad-hoc web crawler to automatically collect and filter
public Twitter accounts and organize the data in testing and training datasets. Moreover, a
multi-layer perceptron neural network has been modelled and trained in over nine features
characterizing a fake account. Another machine learning approach is provided in [4]. In
particular, the authors propose DeepProfile, which performs account classification through a
dynamic CNN to train a learning model, which exploits a novel pooling layer to optimize the
neural network performance in the training process. Moreover, In [5], content and metadata
at the tweet level have been exploited for recognizing bots employing a deep neural network
based on contextual long short-term memory (LSTM). In particular, this approach extrapolates
contextual features from user metadata and uses the LSTM deep nets to process the tweet text,
yielding a model capable of obtaining high classification accuracy with little data.
Statistical text analysis is exploited in a novel general framework to discover compromised
accounts [6]. The framework relies on the consideration that an account’s owner uses his/her
profile in a way that is entirely different from the same account when it is hacked, enabling a
syntactic analyzer to identify the features used by hackers (or spammers) when they compromise
a genuine account. Thus, a language modelling algorithm is used to extrapolate the similarities
between language models of genuine users and those of hackers/spammers to characterize
hackers’ features and use them in supervised machine learning approaches.
Further approaches devoted to fake account discrimination also considered feature engi-
neering and/or selection issues [7, 8]. Nevertheless, most of the proposals, including a feature
engineering process, rely on domain experts or include manual work for characterizing mean-
ingful features that permit a classifier to work with high accuracy. For instance, in [9], the
authors have enumerated the main characteristics to discriminate a fake account from a genuine
one. In particular, by manually examining different types of accounts, they extracted a set of
features to highlight the characteristics of malicious accounts. Moreover, they analyzed the
liking behaviour of each account to build an automated mechanism to detect fake likes on
Instagram. Furthermore, In [10], the authors propose a novel technique to discriminate real
accounts on social networks from fake ones. Their technique exploits knowledge automatically
extracted from big data to characterize typical patterns of fake accounts. Additionally, in [11],
the authors extend the previous work by exploiting data correlations to develop a new feature
engineering strategy to augment the social network account dataset with additional features,
aiming to enhance the capability of existing machine learning strategies to discriminate fake
accounts.
Compared to the fake account discrimination approaches described above, in this proposal, we
release cleaned and organized fake account datasets collected from online sources for analytical
activities. To this end, we exploit machine learning models and feature selection techniques
to compute baseline and improved classification performance results. Finally, we extend our
analysis by exploiting a new fake account dataset collected by data crawling activity.
In what follows, we introduce our methodology steps to compute classification results.
3. Methodology
This section presents our approach for computing data utility metrics over malicious account
datasets. In particular, in the first phase of the approach, we employ machine learning models
to classify malicious datasets and compute data utility metrics over them. Moreover, in the
second step, we employ feature selection strategies over malicious account datasets to improve
the discrimination performances of machine learning models.
Dataset1 F/R
Dataset2 F/R
Classifier
...
Datasetn F/R
Figure 1: Classifier working on malicious accounts datasets.
The first phase of our approach (represented by Figure 1) is targeted to compute data utility
metrics over malicious account datasets. In particular, we exploit machine learning models to
compute classification metrics over each dataset in order to yield a first baseline that describes
all datasets in terms of data utility.
The second phase of our approach (represented by Figure 2) is targeted to improve data utility
metrics obtained in the previous step. In particular, we employ feature selection strategies over
each dataset to improve the discrimination performances of machine learning models.
In what follows, we introduce malicious accounts datasets, machine learning models, feature
selection strategies employed, and results obtained.
Dataset Feature Selection Classifier F/R
Figure 2: Features selection and classification.
4. Experimental Evaluation
In this section, we describe the datasets employed in our study and the data preparation
techniques exploited to organize and clean them for data analytics activities. Next, we provide
details concerning the machine learning models, feature selection strategies, and the proposed
fake account dataset. Lastly, we present classification results achieved by using machine learning
models and feature selection techniques over the collected datasets.
4.1. Data preparation and dataset descriptions
In this section, we describe the collected datasets and data preparation activities used to clean
and organize them for data analysis activity. In particular, we collected ten different datasets,
four related to Instagram accounts and six related to X (formerly Twitter) accounts. In what
follows, we describe datasets and provide data sources.
• IG_1: The dataset consists of 1194 Instagram accounts, of which 994 real and 200 fake,
and contains 10 features. The dataset source can be found at the following link: https:
//github.com/Blacjar/instafake-dataset#fake-account-detection
• IG_2: The dataset consists of 785 Instagram accounts, of which 93 real and 692 fake, and
contains 13 features. The dataset source can be found at the following link: https://www.
kaggle.com/datasets/rezaunderfit/instagram-fake-and-real-accounts-dataset/data
• IG_4: The dataset consists of 576 Instagram accounts, of which 288 real and 288
fake, and contains 12 features. The dataset source can be found at the following link:
https://www.kaggle.com/datasets/jasvindernotra/instagram-detecting-fake-accounts/
data?select=instagram.csv
• TW_3: The dataset consists of 2818 Twitter accounts, of which 1481 real and 1337
fake, and contains 34 features. The dataset source can be found at the following link:
https://www.kaggle.com/datasets/whoseaspects/genuinefake-user-profile-dataset/data
• TW_7: The dataset consists of 9019 Twitter accounts, of which 5706 real and 3313 fake,
and contains 16 features. The dataset source can be found in [12].
The datasets mentioned below can all be found at the following link: https://botometer.osome.
iu.edu/bot-repository/datasets.html
• TW_8 (gilani2017): The dataset consists of 2503 Twitter acconts, of which 1413 real and
1090 fake, and contains 41 features.
• TW_9_LP (caverlee11): The dataset consists of 41499 Twitter accounts, of which 19276
real and 22223 fake, and contains 8 features.
• TW_9_M (midterm2018): The dataset consists of 50538 Twitter accounts, of which 8092
real and 42446 fake, and contains 18 features.
• TW_9_10: The dataset consists of 15, 810 Twitter accounts, of which 7905 real and 7905
fake, and contains 43 features. In particular, the following dataset was obtained by con-
catenating multiple datasets, such as the verified-2019 and celebrity-2019 (real accounts),
and the pronbots-2019, botwiki-2019, political-bots-2019, and vendor-purchased-2019
datasets (fake accounts).
Concerning data preparation activities, we replaced null values with zero and adopted an
encoding strategy for non-numeric features to use machine learning models. In particular, the
encoding strategy exploits a dictionary to map values of categorical columns into integers. This
strategy allowed us to maintain the uniqueness of the tuples by avoiding altering the original
data.
In the next section, we describe the machine learning models adopted and the parameters
tuning used.
4.2. Adopted machine learning models and parameter tuning
This section describes the machine learning models employed in our study and the parameters
tuning used for each model. In particular, we involved Decision Tree (DT) [13], K-Nearest
Neighbors (KNN) [14], Logistic Regression (LR) [15], Gaussian Na¨ıve Bayes (NB) [16], Random
Forest (RF) [17] and Support Vector Classifier (SVC) KSB00 by considering their versions available
in the Scikit-learn1 python library. Moreover, for each model, we performed hyperparameter
tuning using the GridSearchCV with 5-fold [18], aiming to identify the best combination of
hyperparameters for the predictive models based on the accuracy scores. Details concerning
machine learning models and parameters tuning are reported below. The decision tree (DT)
model is a supervised learning model that, given a labelled dataset, recursively defines a tree
structure where, at each level, local decisions are associated with a feature. After constructing
the tree, each path from the root to a leaf node represents a classification pattern [13]. In
detail, the hyperparameters utilized for the DT model are max_leaf_nodes in a range from
2 to 100 and [2, 3, 4] for min_samples_split. The k-Nearest Neighbor (KNN) algorithm is an
instance-based technique that operates under the assumption that new instances are similar to
those already provided with a class label. In this algorithm, all instances are treated as points
in an n-dimensional space and are classified based on their similarity to other instances. In
detail, the hyperparameters utilized for the KNN model are in the range of 1 to 25 for the
n_neighbors param, [’uniform’, ’distance’] for the weights param and [1, 2] for the p param.
Logistic regression (LR) is a supervised learning approach capable of inferring a vector of weights
whose elements are associated with each feature. In particular, a weight specifies the relevance
of a feature with respect to the classification task [15]. In detail, the hyperparameters utilized
for the LR model are ’l2’ for penalty and [0.001, 0.01, 0.1, 1, 10, 100, 1000] for the C parameter.
Gaussian Na¨ıve Bayes (NB) is a supervised learning method based on the application of Bayes’
theorem with the assumption of conditional independence between each pair of variables. In
detail, the hyperparameters utilized for the NB model is the var_smoothing parameter ranging
from 0 to -9. The random forest (RF) model is an approach based on the ensemble concept
[19], i.e., exploiting a set of DTs to derive a global model that performs better than the single
DTs composing the ensemble. In detail, the hyperparameters utilized for the RF model are
bootstrap parameter set to true, [10, 20, 30, 100] for the max_depth, [2, 3] for the max_features,
[3, 4, 5] for the min_samples_leaf parameter, [8, 10, 12] for min_samples_split, [10, 20, 30, 100]
for n_estimators and finally, ’gini’ or ’entropy’ for the criterion. Support vector classification
(SVC) is a model in which the training instances are classified separately in different points of a
space and organized into separated groups. The SVC tries to achieve the optimal separation
hyperplane by computing the most significant margins of separation between different classes
[20]. In detail, the hyperparameters utilized for the SVC model are [0.1, 1, 10, 100, 1000] for the
C parameter, [1, 0.1, 0.01, 0.001, 0.0001] for gamma and finally, and ’rbf’ as kernel.
In the next section, we describe the feature selection strategies adopted and their settings.
1
https://scikit-learn.org/
4.3. Adopted feature selection strategies
This section describes the feature selection strategies employed and their settings. In particular,
we involved Spearman’s Correlation (SC) [21], Information Gain (IG) [22], Recursive Feature
Elimination with Cross-Validation (RFECV) [23], Minimum Redundancy Maximum Relevance
(MRMR) [24] and Principal Component Analysis (PCA) [21]. Feature selection strategies details
are provided below. Spearman’s Correlation [21] is a technique that measures the correlation
between two variables. In particular, a positive value indicates that variables have a positive
relationship, whereas a negative value indicates a negative relationship. Moreover, If the value
is zero, no relationship between the two variables is defined. For this reason, variables with a
higher absolute correlation value may be considered more relevant and retained. Information
Gain [22] consists of a non-parametric entropy-based technique that measures the dependence
between two variables. In particular, If such dependence is equal to zero, the two variables
are independent, while a higher value indicates greater dependence. Selected features are
those with the highest score, while discarded features are those with the score closest to
zero. Recursive Feature Elimination with Cross-Validation (RFECV) [23] is a technique for
extracting the most significant features by removing the weakest feature. The Logistic Regression
estimator provides information on the importance of features. The best subset of features is then
selected using the accuracy scores in combination with cross-validation. Minimum Redundancy
Maximum Relevance (MRMR) [24] involves selecting features with the highest relevance and
least redundancy through mutual information. In this case, the number of features to be selected
is chosen based on the k returned by the RFECV method. Finally, Principal Component Analysis
(PCA) [21] is a dimensionality reduction method that identifies principal components in the
direction that preserves most of the variety in the original data. To this end, we use PCA
by selecting a number of components that preserve the 95% of data variance considering all
features.
In the next section, we provide details concerning the proposed fake accounts dataset.
4.4. Proposed fake accounts dataset
In this section, we describe the proposed fake accounts dataset. In particular, we collected the
Proposed Dataset (PD) from Instagram’s social platform. The dataset consists of 1937 accounts,
divided into 944 real accounts and 993 fake accounts, and contains 11 features. In detail, we
collected Instagram accounts by exploiting a data crawling technique implemented with an
ad hoc script to extract the features needed. More specifically, fake accounts are purchased
using the online service: https://serviceiggrowthstar.it, whereas real accounts are collected
by involving real users and verified manually. All the above-described datasets are organized,
cleaned and made accessible to the following GitHub repository: https://github.com/Macerul/
FakeAccountDatasets.git. Additionally, we also provide in-depth statistics concerning additional
metrics computed using machine learning models over each dataset, i.e., Precision, Recall, F1 and
Accuracy, at the following link: https://github.com/Macerul/FakeAccountDatasets_Scores.git
In the next section, we describe experimental results achieved by employing machine learning
models and feature selection techniques over each dataset.
B PCA SC IG RFECV MRMR B PCA SC IG RFECV MRMR
1.00 1.00
0.95
0.95
0.90
0.90
KNN Accuracy
RF Accuracy
0.85
0.85
0.80
0.80
0.75
0.75
0.70
0.65 0.70
7
7
1
2
4
_3
_8
LP
M
10
1
2
4
_3
_8
LP
M
10
PD
PD
_
_
IG_
IG_
IG_
IG_
IG_
IG_
_9_
_9_
TW
TW
TW
TW
TW
TW
_9_
_9_
_9_
_9_
TW
TW
TW
TW
TW
TW
Datasets Datasets
Figure 3: Accuracy results of KNN model over Figure 4: Accuracy results of RF model over
the collected fake account datasets. the collected fake account datasets.
4.5. Results
In order to compute classification results, we run several experimental sessions in which different
classification models are trained over the datasets described in Section 4.1. In particular, we
discuss the performances achieved with the employed predictive models over each dataset
and evaluate the feature selection strategies described in Section 4.3 regarding classification
improvements. In detail, Figures 3 to 8 highlight results obtained by employing each classification
model (x-axis) for each dataset (y-axis). Moreover, each figure presents results obtained by
using only the model (B), Principal Component Analysis (PCA), Spearman’s Correlation (SC),
Information Gain (IG), Recursive Feature Elimination with Cross-Validation (RFECV), and
Minimum Redundancy Maximum Relevance (MRMR) as feature selection strategies.
Figure 3 reports classification results achieved by employing the KNN model. In particular, it
is possible to notice that classification results computed over the IG_1, IG_4, TW_8, TW_9_10
and PD datasets by exploiting SC, IG and RFEVC overcome the baseline results computed by
using only the KNN model, whereas those computed over the remaining datasets preserve the
baseline. Moreover, the best classification results are obtained over the TW_3, TW_7, TW_9_LP,
and TW_9_M datasets, whereas the worst are obtained over the TW_8 dataset. In general,
RFEVC is the feature selection strategy offering more improvements in terms of fake account
discrimination when combined with the KNN model.
Figure 4 reports classification results achieved by employing the RF model. In particular,
classification results computed over the IG_2 dataset by exploiting IG and RFEVC overcome
the baseline results computed by using only the RF model, whereas those computed over
the remaining datasets preserve the baseline. Moreover, the best classification results are
obtained over the TW_3, TW_7, TW_9_LP, TW_9_M, and TW_9_10 datasets, whereas the
B PCA SC IG RFECV MRMR B PCA SC IG RFECV MRMR
1.00 1.00
0.95 0.95
0.90
0.90
NB Accuracy
DT Accuracy
0.85
0.85
0.80
0.80
0.75
0.75
0.70
0.70
0.65
7
7
1
2
4
_3
_8
LP
M
10
1
2
4
_3
_8
LP
M
10
PD
PD
_
_
IG_
IG_
IG_
IG_
IG_
IG_
_9_
_9_
TW
TW
TW
TW
TW
TW
_9_
_9_
_9_
_9_
TW
TW
TW
TW
TW
TW
Datasets Datasets
Figure 5: Accuracy results of DT model over Figure 6: Accuracy results of NB model over
the collected fake account datasets. the collected fake account datasets.
worst classification results are obtained over the TW_8 dataset. In general, RFEVC and IG are
feature selection strategies offering more improvements in terms of fake account discrimination
when combined with the RF model.
Figure 5 reports classification results achieved by employing the DT model. In particular,
classification results computed over the IG_1 and IG_2 datasets by exploiting SC and IG overcome
the baseline results computed by using only the DT model, whereas those computed over
the remaining datasets preserve the baseline. Moreover, the best classification results are
obtained over the TW_3, TW_7, TW_9_LP, TW_9_M, and TW_9_10 datasets, whereas the
worst classification results are obtained over the TW_8 dataset. In general, SC and IG are feature
selection strategies offering more improvements in terms of fake account discrimination when
combined with the DT model.
Figure 6 reports classification results achieved by employing the NB model. In particular, it is
possible to notice that classification results computed over the IG_1, IG_2, IG_4, TW_7, TW_8,
TW_9, and TW_9_10 datasets by exploiting IG, RFECV, and MRMR overcome the baseline
results computed by using only NB model, whereas those computed over the remaining datasets
preserve the baseline. Moreover, the best classification results are obtained over the TW_3
and TW_9 datasets, whereas the worst are obtained over the TW_8 dataset. In general, IG and
RFECV are feature selection strategies offering more improvements in terms of fake account
discrimination when combined with the NB model.
Figure 7 reports classification results achieved by employing the SVC model. In particular, it
is possible to notice that classification results computed over the IG_2, IG_4, and TW_8 datasets
by exploiting MRMR and RFECV overcome the baseline results computed by using only the SVC
model, whereas those computed over the remaining datasets preserve the baseline. Moreover,
the best classification results are obtained over the TW_3, TW_7, TW_9_LP, TW_9_M, and
B PCA SC IG RFECV MRMR B PCA SC IG RFECV MRMR
1.00 1.00
0.95 0.95
0.90 0.90
SVC Accuracy
LR Accuracy
0.85
0.85
0.80
0.80
0.75
0.75
7
7
1
2
4
_3
_8
LP
M
10
1
2
4
_3
_8
LP
M
10
PD
PD
_
_
IG_
IG_
IG_
IG_
IG_
IG_
_9_
_9_
TW
TW
TW
TW
TW
TW
_9_
_9_
_9_
_9_
TW
TW
TW
TW
TW
TW
Datasets Datasets
Figure 7: Accuracy results of SVC model over Figure 8: Accuracy results of LR model over
the collected fake account datasets. the collected fake account datasets.
TW_9_10 datasets, whereas the worst classification results are obtained over the TW_8 dataset.
In general, RFECV is the feature selection strategy offering more improvements in terms of fake
account discrimination when combined with the SVC model.
Figure 8 reports classification results achieved by employing the LR model. In particular, it
is possible to notice that classification results computed over the IG_1 and TW_8 datasets by
exploiting SC and RFECV overcome the baseline results computed by using only the LR model,
whereas those computed over the remaining datasets preserve the baseline. Moreover, the best
classification results are obtained over TW_3, TW_7, TW_9_LP, and TW_9_M datasets, whereas
the worst classification results are obtained over the TW_8 dataset. In general, RFECV is the
feature selection strategy offering more improvements in terms of fake account discrimination
when combined with the LR model.
In order to evaluate the number of selected features yielded by each feature selection technique
over each collected dataset, we report in Figures 9 the obtained results. In particular, the x-
axis represents the analyzed datasets, whereas the y-axis represents the amount of features
selected for each dataset. Each line reported in Figure 9 represents the application of Spearman’s
Correlation (SC), Information Gain (IG), and Recursive Feature Elimination with Cross-Validation
(RFECV) as feature selection strategies, respectively. We do not report results concerning
Principal Component Analysis (PCA) and Minimum Redundancy Maximum Relevance (MRMR)
since the PCA does not yield the number of selected features and MRMR yields the same number
of features of RFECV (see Section 4.3). In general, as it is possible to see from Figure 9, the IG
strategy selects the minimum number of features for almost all datasets, whereas RFECV yields
the maximum one for all datasets except for the TW_3. Concerning SC, it overcomes RFECV
results only over the TW_3 dataset and yields a number of features less than IG only over the
IG_1, IG_4, and PD datasets.
SC IG RFECV
25
20
#Selected features
15
10
5
_7
1
2
4
_3
_8
LP
M
10
PD
IG_
IG_
IG_
_9_
TW
TW
TW
_9_
_9_
TW
TW
TW
Datasets
Figure 9: Number of selected features by each feature selection strategy w.r.t. analyzed datasets.
As illustrated in our results, feature selection strategies improve the classification results
of machine learning models. In particular, the model that achieved the best results is the
RF, whereas, for almost all models, RFECV is the feature selection strategy offering more
improvements in fake account discrimination even if it selects a large number of features.
5. Conclusions
The increasingly widespread use of malicious accounts can compromise the trustability of
social network platforms. In this context, the number of techniques for detecting fake accounts
has grown proportionally to the number of new algorithms developed for harmful purposes,
and it is necessary to collect and organize data associated with social network accounts to
improve discrimination capacities. Under this view, we cleaned and organized fake account
datasets from online sources for analytical activities. Evaluation results achieved over different
machine learning models demonstrated that feature selection strategies improve classification
performances. The main objective of our proposal is to support stakeholders, data analysts, and
researchers by offering the possibility of quickly accessing fake account datasets together with
machine learning classification results.
In the future, we would like to collect more data, including additional social network platforms,
to improve the proposed analysis. Moreover, we would like to analyze the impact of the removed
features of each feature selection strategy over training times and classification performances.
Acknowledgments
This Publication was produced with the co-funding of the European union - Next Generation EU:
NRRP Initiative, Mission 4, Component 2, Investment 1.3 – Partnerships extended to universities,
research centers, companies and research D.D. MUR n. 341 del 5.03.2022 – Next Generation EU
(PE0000014 - "Security and Rights In the CyberSpace - SERICS" - CUP: H93C22000620001).
References
[1] C. Castillo, M. Mendoza, B. Poblete, Information credibility on twitter, in: Proceedings of
the 20th international conference on World wide web, ACM, Hyderabad, India, 2011, pp.
675–684.
[2] Z. Alkhalil, C. Hewage, L. Nawaf, I. Khan, Phishing attacks: A recent comprehensive study
and a new anatomy, Frontiers in Computer Science 3 (2021) 563060.
[3] C. Braker, S. Shiaeles, G. Bendiab, N. Savage, K. Limniotis, Botspot: Deep learning
classification of bot accounts within twitter, in: Internet of Things, Smart Spaces, and
Next Generation Networks and Systems, Springer, Petersburg, Russia, 2020, pp. 165–175.
[4] P. Wanda, H. J. Jie, Deepprofile: Finding fake profile in online social network using dynamic
cnn, Journal of Information Security and Applications 52 (2020) 102465.
[5] S. Kudugunta, E. Ferrara, Deep neural networks for bot detection, Information Sciences
467 (2018) 312–322.
[6] D. Seyler, L. L. andChengXiang Zhai, Identifying compromised accounts on social media
using statistical text analysis, Computing Research Repository abs/1804.07247 (2018) 1–8.
[7] S. Kodati, P. R. Kumbala, S. Mekala, P. S. Murthy, P. C. S. Reddy, Detection of fake profiles
on twitter using hybrid svm algorithm, in: E3S Web of Conferences, volume 309, EDP
Sciences, Ternate, Indonesia, 2021, p. 01046.
[8] K. K. Bharti, S. Pandey, Fake account detection in twitter using logistic regression with
particle swarm optimization, Soft Computing 25 (2021) 11333–11345.
[9] I. Sen, A. Aggarwal, S. Mian, S. Singh, P. Kumaraguru, A. Datta, Worth its weight in likes:
Towards detecting fake likes on instagram, in: Proceedings of the 10th ACM Conference
on Web Science, ACM, Amsterdam, Netherlands, 2018, pp. 205–209.
[10] L. Caruccio, D. Desiato, G. Polese, Fake account identification in social networks, in: 2018
IEEE international conference on big data (big data), IEEE, 2018, pp. 5078–5085.
[11] L. Caruccio, G. Cimino, S. Cirillo, D. Desiato, G. Polese, G. Tortora, Malicious account
identification in social network platforms, ACM Transactions on Internet Technology 23
(2023) 1–25.
[12] S. Cresci, R. Di Pietro, M. Petrocchi, A. Spognardi, M. Tesconi, Social fingerprinting:
detection of spambot groups through dna-inspired behavioral modeling, IEEE Transactions
on Dependable and Secure Computing 15 (2018) 561–576.
[13] P. H. Swain, H. Hauska, The decision tree classifier: Design and potential, IEEE Transac-
tions on Geoscience Electronics 15 (1977) 142–147.
[14] O. Kramer, O. Kramer, K-nearest neighbors, Dimensionality reduction with unsupervised
nearest neighbors (2013) 13–23.
[15] F. O. Redelico, F. Traversaro, M. d. C. García, W. Silva, O. A. Rosso, M. Risk, Classification
of normal and pre-ictal eeg signals using permutation entropies and a generalized linear
model as a classifier, Entropy 19 (2017) 72.
[16] S. Xu, Bayesian naïve bayes classifiers to text classification, Journal of Information Science
44 (2018) 48–59.
[17] M. Pal, Random forest classifier for remote sensing classification, International journal of
remote sensing 26 (2005) 217–222.
[18] D. Kartini, D. T. Nugrahadi, A. Farmadi, et al., Hyperparameter tuning using gridsearchcv
on the comparison of the activation function of the elm method to the classification of
pneumonia in toddlers, in: 2021 4th International Conference of Computer and Informatics
Engineering (IC2IE), IEEE, Jakarta, Indonesia, 2021, pp. 390–395.
[19] J. C.-W. Chan, D. Paelinckx, Evaluation of random forest and adaboost tree-based ensemble
classification and spectral band selection for ecotope mapping using airborne hyperspectral
imagery, Remote Sensing of Environment 112 (2008) 2999–3011.
[20] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, K. R. Murthy, A fast iterative nearest
point algorithm for support vector machine classifier design, IEEE transactions on neural
networks 11 (2000) 124–136.
[21] A. Homsi, J. Al-Nemri, N. Naimat, H. Kareem, M. Al-Fayoumi, M. Abu Snober, Detecting
twitter fake accounts using machine learning and data reduction techniques, 2021, pp.
88–95. doi:10.5220/0010604300880095.
[22] S. Lei, A feature selection method based on information gain and genetic algorithm,
in: 2012 International Conference on Computer Science and Electronics Engineering,
volume 2, 2012, pp. 355–358. doi:10.1109/ICCSEE.2012.97.
[23] A. Z. Mustaqim, S. Adi, Y. Pristyanto, Y. Astuti, The effect of recursive feature elim-
ination with cross-validation (rfecv) feature selection algorithm toward classifier per-
formance on credit card fraud detection, in: 2021 International Conference on Ar-
tificial Intelligence and Computer Science Technology (ICAICST), 2021, pp. 270–275.
doi:10.1109/ICAICST53116.2021.9497842.
[24] Y. Jiang, C. Li, mrmr-based feature selection for classification of cotton foreign matter
using hyperspectral imaging, Computers and Electronics in Agriculture 119 (2015) 191–200.
URL: https://www.sciencedirect.com/science/article/pii/S0168169915003300. doi:https:
//doi.org/10.1016/j.compag.2015.10.017.