<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Information Science
44 (2018) 48-59.
[17] M. Pal</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.5220/0010604300880095</article-id>
      <title-group>
        <article-title>Classification benchmarking of fake account datasets using machine learning models and feature selection strategies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Danilo Caivano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mary Cerullo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Desiato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Polese</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Bari Aldo Moro</institution>
          ,
          <addr-line>via Edoardo Orabona n.4, 70125 Bari (BA)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Salerno</institution>
          ,
          <addr-line>via Giovanni Paolo II n.132, 84084 Fisciano (SA)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>2</volume>
      <fpage>48</fpage>
      <lpage>59</lpage>
      <abstract>
        <p>Social network platforms are highly used for social interactions, and due to their increasing number of registered users, it is crucial to verify the authenticity of such accounts and the data they generate. In particular, the phenomenon of malicious accounts represents a crucial aspect that social network platforms have to deal with, and it is crucial to develop new methodologies and strategies to discriminate against malicious accounts automatically. To this end, data from social network platforms plays a crucial role in defining analytical activities devoted to fake account discrimination. In this proposal, we organized and cleaned fake account datasets collected by online sources and provided classification results obtained employing machine learning models and feature selection strategies. Moreover, we extend classification results by using a new proposed fake accounts dataset collected through data crawling activity. Experimental results produced by employing several machine learning models and feature selection techniques on the fake account datasets reveal discrimination improvements when feature selection strategies are exploited. Our proposal aims to support stakeholders, data analysts, and researchers by providing them with fake account datasets cleaned and organized for analytical activities, together with statistical classification results obtained using machine learning models and feature selection strategies.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data Analytics</kwd>
        <kwd>Fake accounts</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Feature selection</kwd>
        <kwd>Social networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Social networks have simplified how people communicate and exchange information globally. In
particular, social interaction platforms such as Instagram, Twitter, Tumblr, etc., have improved
the dynamics of human interaction, impacting the daily lives of their users and the entire society.</p>
      <p>
        In social network platforms, it is crucial to monitor the popularity of a profile. In particular,
the number of friends or followers significantly determines the profile’s influence and reputation.
Social network profiles with a large following are considered more influential and attractive
to better-paid advertisements. A common practice of several social network users is to buy
fake followers to appear more influential. Users are also stimulated by the meagre price (a few
dollars for hundreds of fake followers) fake followers may be bought. Of course, such practice
might be considered harmless if used to support individual vanity but dangerous if used to
make an account more reliable and influential. For example, spammers can use fake followers
to promote products, trends, and fashions by compromising the integrity of the social network
platforms [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Moreover, phishing campaigns are commonly spread by exploiting fake accounts
created ad-hoc with many followers and following to appear trustable, and in most cases, the
users attracted by such numbers are defrauded [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Many anomalous accounts, such as Spammers, Bots, Cyborgs, and Trolls, are disseminated on
social network platforms. In particular, spammers try to share malicious content or dangerous
links. Bots produce accounts to simulate human behaviour, trying to perform typical human
actions automatically. In contrast, Cyborgs are defined by humans and are not necessarily
malicious.</p>
      <p>In this proposal, we focus on the discrimination of fake accounts on social network platforms.
In particular, we present an extensive empirical evaluation of fake account datasets available
in online sources. In detail, we employ several machine learning models and evaluate their
capabilities in terms of fake account discrimination. Further, we exploit diferent feature
selection techniques to enhance models discrimination performances.</p>
      <p>The general idea of our proposal is to ofer stakeholders, data analysts, and researchers
who work to define new methodologies for malicious account discrimination the possibility of
quickly accessing fake account datasets that have been cleaned and organized for analytical
activities. Moreover, we ofer comparative results in terms of fake account discrimination.
To this end, the usage of machine learning models is motivated by the fact that we want to
provide a baseline for classification performances. In contrast, feature selection techniques
are employed to improve the computed baseline. Moreover, we highlight the combinations of
feature selection and model to adopt on the specific dataset to achieve the best classification
results. Additionally, we extend our analysis by using a new fake account dataset collected by
exploiting data crawling activity.</p>
      <p>In summary, the main contributions of our proposal are i) fake account datasets collected by
online sources cleaned and organized for analytical activities, ii) baseline and improved results
of classification performances using machine learning models and feature selection techniques,
and iii) a new fake account dataset collected through data crawling activity.</p>
      <p>The remainder of the paper is organized as follows. Section 2 reports relevant works
concerning fake account discrimination, whereas Section 3 presents our methodology. Section 4
shows results, and conclusions and future directions are provided in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        A key factor to be monitored for social network platforms is the identification of malicious
accounts. The automatic collection of social network accounts has been addressed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In
particular, the authors developed an ad-hoc web crawler to automatically collect and filter
public Twitter accounts and organize the data in testing and training datasets. Moreover, a
multi-layer perceptron neural network has been modelled and trained in over nine features
characterizing a fake account. Another machine learning approach is provided in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In
particular, the authors propose DeepProfile, which performs account classification through a
dynamic CNN to train a learning model, which exploits a novel pooling layer to optimize the
neural network performance in the training process. Moreover, In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], content and metadata
at the tweet level have been exploited for recognizing bots employing a deep neural network
based on contextual long short-term memory (LSTM). In particular, this approach extrapolates
contextual features from user metadata and uses the LSTM deep nets to process the tweet text,
yielding a model capable of obtaining high classification accuracy with little data.
      </p>
      <p>
        Statistical text analysis is exploited in a novel general framework to discover compromised
accounts [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The framework relies on the consideration that an account’s owner uses his/her
profile in a way that is entirely diferent from the same account when it is hacked, enabling a
syntactic analyzer to identify the features used by hackers (or spammers) when they compromise
a genuine account. Thus, a language modelling algorithm is used to extrapolate the similarities
between language models of genuine users and those of hackers/spammers to characterize
hackers’ features and use them in supervised machine learning approaches.
      </p>
      <p>
        Further approaches devoted to fake account discrimination also considered feature
engineering and/or selection issues [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Nevertheless, most of the proposals, including a feature
engineering process, rely on domain experts or include manual work for characterizing
meaningful features that permit a classifier to work with high accuracy. For instance, in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the
authors have enumerated the main characteristics to discriminate a fake account from a genuine
one. In particular, by manually examining diferent types of accounts, they extracted a set of
features to highlight the characteristics of malicious accounts. Moreover, they analyzed the
liking behaviour of each account to build an automated mechanism to detect fake likes on
Instagram. Furthermore, In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the authors propose a novel technique to discriminate real
accounts on social networks from fake ones. Their technique exploits knowledge automatically
extracted from big data to characterize typical patterns of fake accounts. Additionally, in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
the authors extend the previous work by exploiting data correlations to develop a new feature
engineering strategy to augment the social network account dataset with additional features,
aiming to enhance the capability of existing machine learning strategies to discriminate fake
accounts.
      </p>
      <p>Compared to the fake account discrimination approaches described above, in this proposal, we
release cleaned and organized fake account datasets collected from online sources for analytical
activities. To this end, we exploit machine learning models and feature selection techniques
to compute baseline and improved classification performance results. Finally, we extend our
analysis by exploiting a new fake account dataset collected by data crawling activity.</p>
      <p>In what follows, we introduce our methodology steps to compute classification results.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section presents our approach for computing data utility metrics over malicious account
datasets. In particular, in the first phase of the approach, we employ machine learning models
to classify malicious datasets and compute data utility metrics over them. Moreover, in the
second step, we employ feature selection strategies over malicious account datasets to improve
the discrimination performances of machine learning models.</p>
      <sec id="sec-3-1">
        <title>Dataset2</title>
      </sec>
      <sec id="sec-3-2">
        <title>Datasetn</title>
        <p>The first phase of our approach (represented by Figure 1) is targeted to compute data utility
metrics over malicious account datasets. In particular, we exploit machine learning models to
compute classification metrics over each dataset in order to yield a first baseline that describes
all datasets in terms of data utility.</p>
        <p>The second phase of our approach (represented by Figure 2) is targeted to improve data utility
metrics obtained in the previous step. In particular, we employ feature selection strategies over
each dataset to improve the discrimination performances of machine learning models.</p>
        <p>In what follows, we introduce malicious accounts datasets, machine learning models, feature
selection strategies employed, and results obtained.</p>
        <p>Dataset</p>
      </sec>
      <sec id="sec-3-3">
        <title>Feature Selection</title>
      </sec>
      <sec id="sec-3-4">
        <title>Classifier</title>
        <p>F/R</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Evaluation</title>
      <p>In this section, we describe the datasets employed in our study and the data preparation
techniques exploited to organize and clean them for data analytics activities. Next, we provide
details concerning the machine learning models, feature selection strategies, and the proposed
fake account dataset. Lastly, we present classification results achieved by using machine learning
models and feature selection techniques over the collected datasets.</p>
      <sec id="sec-4-1">
        <title>4.1. Data preparation and dataset descriptions</title>
        <p>In this section, we describe the collected datasets and data preparation activities used to clean
and organize them for data analysis activity. In particular, we collected ten diferent datasets,
four related to Instagram accounts and six related to X (formerly Twitter) accounts. In what
follows, we describe datasets and provide data sources.</p>
        <p>
          • IG_1: The dataset consists of 1194 Instagram accounts, of which 994 real and 200 fake,
and contains 10 features. The dataset source can be found at the following link: https:
//github.com/Blacjar/instafake-dataset#fake-account-detection
• IG_2: The dataset consists of 785 Instagram accounts, of which 93 real and 692 fake, and
contains 13 features. The dataset source can be found at the following link: https://www.
kaggle.com/datasets/rezaunderfit/instagram-fake-and-real-accounts-dataset/data
• IG_4: The dataset consists of 576 Instagram accounts, of which 288 real and 288
fake, and contains 12 features. The dataset source can be found at the following link:
https://www.kaggle.com/datasets/jasvindernotra/instagram-detecting-fake-accounts/
data?select=instagram.csv
• TW_3: The dataset consists of 2818 Twitter accounts, of which 1481 real and 1337
fake, and contains 34 features. The dataset source can be found at the following link:
https://www.kaggle.com/datasets/whoseaspects/genuinefake-user-profile-dataset/data
• TW_7: The dataset consists of 9019 Twitter accounts, of which 5706 real and 3313 fake,
and contains 16 features. The dataset source can be found in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>The datasets mentioned below can all be found at the following link: https://botometer.osome.
iu.edu/bot-repository/datasets.html
• TW_8 (gilani2017): The dataset consists of 2503 Twitter acconts, of which 1413 real and
1090 fake, and contains 41 features.
• TW_9_LP (caverlee11): The dataset consists of 41499 Twitter accounts, of which 19276
real and 22223 fake, and contains 8 features.
• TW_9_M (midterm2018): The dataset consists of 50538 Twitter accounts, of which 8092
real and 42446 fake, and contains 18 features.
• TW_9_10: The dataset consists of 15, 810 Twitter accounts, of which 7905 real and 7905
fake, and contains 43 features. In particular, the following dataset was obtained by
concatenating multiple datasets, such as the verified-2019 and celebrity-2019 (real accounts),
and the pronbots-2019, botwiki-2019, political-bots-2019, and vendor-purchased-2019
datasets (fake accounts).</p>
        <p>Concerning data preparation activities, we replaced null values with zero and adopted an
encoding strategy for non-numeric features to use machine learning models. In particular, the
encoding strategy exploits a dictionary to map values of categorical columns into integers. This
strategy allowed us to maintain the uniqueness of the tuples by avoiding altering the original
data.</p>
        <p>
          In the next section, we describe the machine learning models adopted and the parameters
tuning used.
4.2. Adopted machine learning models and parameter tuning
This section describes the machine learning models employed in our study and the parameters
tuning used for each model. In particular, we involved Decision Tree (DT) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], K-Nearest
Neighbors (KNN) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], Logistic Regression (LR) [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], Gaussian Na¨ıve Bayes (NB) [16], Random
Forest (RF) [17] and Support Vector Classifier (SVC) KSB00 by considering their versions available
in the Scikit-learn1 python library. Moreover, for each model, we performed hyperparameter
tuning using the GridSearchCV with 5-fold [18], aiming to identify the best combination of
hyperparameters for the predictive models based on the accuracy scores. Details concerning
machine learning models and parameters tuning are reported below. The decision tree (DT)
model is a supervised learning model that, given a labelled dataset, recursively defines a tree
structure where, at each level, local decisions are associated with a feature. After constructing
the tree, each path from the root to a leaf node represents a classification pattern [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. In
detail, the hyperparameters utilized for the DT model are max_leaf_nodes in a range from
2 to 100 and [
          <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
          ] for min_samples_split. The k-Nearest Neighbor (KNN) algorithm is an
instance-based technique that operates under the assumption that new instances are similar to
those already provided with a class label. In this algorithm, all instances are treated as points
in an n-dimensional space and are classified based on their similarity to other instances. In
detail, the hyperparameters utilized for the KNN model are in the range of 1 to 25 for the
n_neighbors param, [’uniform’, ’distance’] for the weights param and [
          <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
          ] for the p param.
Logistic regression (LR) is a supervised learning approach capable of inferring a vector of weights
whose elements are associated with each feature. In particular, a weight specifies the relevance
of a feature with respect to the classification task [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. In detail, the hyperparameters utilized
for the LR model are ’l2’ for penalty and [0.001, 0.01, 0.1, 1, 10, 100, 1000] for the C parameter.
Gaussian Na¨ıve Bayes (NB) is a supervised learning method based on the application of Bayes’
theorem with the assumption of conditional independence between each pair of variables. In
detail, the hyperparameters utilized for the NB model is the var_smoothing parameter ranging
from 0 to -9. The random forest (RF) model is an approach based on the ensemble concept
[19], i.e., exploiting a set of DTs to derive a global model that performs better than the single
DTs composing the ensemble. In detail, the hyperparameters utilized for the RF model are
bootstrap parameter set to true, [
          <xref ref-type="bibr" rid="ref10">10, 20, 30, 100</xref>
          ] for the max_depth, [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ] for the max_features,
[
          <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
          ] for the min_samples_leaf parameter, [
          <xref ref-type="bibr" rid="ref10 ref12 ref8">8, 10, 12</xref>
          ] for min_samples_split, [
          <xref ref-type="bibr" rid="ref10">10, 20, 30, 100</xref>
          ]
for n_estimators and finally, ’gini’ or ’entropy’ for the criterion. Support vector classification
(SVC) is a model in which the training instances are classified separately in diferent points of a
space and organized into separated groups. The SVC tries to achieve the optimal separation
hyperplane by computing the most significant margins of separation between diferent classes
[20]. In detail, the hyperparameters utilized for the SVC model are [0.1, 1, 10, 100, 1000] for the
C parameter, [1, 0.1, 0.01, 0.001, 0.0001] for gamma and finally, and ’rbf’ as kernel.
        </p>
        <p>In the next section, we describe the feature selection strategies adopted and their settings.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.3. Adopted feature selection strategies</title>
        <p>This section describes the feature selection strategies employed and their settings. In particular,
we involved Spearman’s Correlation (SC) [21], Information Gain (IG) [22], Recursive Feature
Elimination with Cross-Validation (RFECV) [23], Minimum Redundancy Maximum Relevance
(MRMR) [24] and Principal Component Analysis (PCA) [21]. Feature selection strategies details
are provided below. Spearman’s Correlation [21] is a technique that measures the correlation
between two variables. In particular, a positive value indicates that variables have a positive
relationship, whereas a negative value indicates a negative relationship. Moreover, If the value
is zero, no relationship between the two variables is defined. For this reason, variables with a
higher absolute correlation value may be considered more relevant and retained. Information
Gain [22] consists of a non-parametric entropy-based technique that measures the dependence
between two variables. In particular, If such dependence is equal to zero, the two variables
are independent, while a higher value indicates greater dependence. Selected features are
those with the highest score, while discarded features are those with the score closest to
zero. Recursive Feature Elimination with Cross-Validation (RFECV) [23] is a technique for
extracting the most significant features by removing the weakest feature. The Logistic Regression
estimator provides information on the importance of features. The best subset of features is then
selected using the accuracy scores in combination with cross-validation. Minimum Redundancy
Maximum Relevance (MRMR) [24] involves selecting features with the highest relevance and
least redundancy through mutual information. In this case, the number of features to be selected
is chosen based on the k returned by the RFECV method. Finally, Principal Component Analysis
(PCA) [21] is a dimensionality reduction method that identifies principal components in the
direction that preserves most of the variety in the original data. To this end, we use PCA
by selecting a number of components that preserve the 95% of data variance considering all
features.</p>
        <p>In the next section, we provide details concerning the proposed fake accounts dataset.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.4. Proposed fake accounts dataset</title>
        <p>In this section, we describe the proposed fake accounts dataset. In particular, we collected the
Proposed Dataset (PD) from Instagram’s social platform. The dataset consists of 1937 accounts,
divided into 944 real accounts and 993 fake accounts, and contains 11 features. In detail, we
collected Instagram accounts by exploiting a data crawling technique implemented with an
ad hoc script to extract the features needed. More specifically, fake accounts are purchased
using the online service: https://serviceiggrowthstar.it, whereas real accounts are collected
by involving real users and verified manually. All the above-described datasets are organized,
cleaned and made accessible to the following GitHub repository: https://github.com/Macerul/
FakeAccountDatasets.git. Additionally, we also provide in-depth statistics concerning additional
metrics computed using machine learning models over each dataset, i.e., Precision, Recall, F1 and
Accuracy, at the following link: https://github.com/Macerul/FakeAccountDatasets_Scores.git</p>
        <p>In the next section, we describe experimental results achieved by employing machine learning
models and feature selection techniques over each dataset.
1.00
0.95
0.90
y
c
a
r
cu0.85
c
A
F
R
0.80
0.75
0.70
IG_1 IG_2 IG_4 TW_3 DTWa_7taseTtWs_8 TW_9_LP TW_9_M TW_9_10 PD
IG_1 IG_2 IG_4 TW_3 DTWa_7taseTtWs_8 TW_9_LP TW_9_M TW_9_10 PD</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.5. Results</title>
        <p>In order to compute classification results, we run several experimental sessions in which diferent
classification models are trained over the datasets described in Section 4.1. In particular, we
discuss the performances achieved with the employed predictive models over each dataset
and evaluate the feature selection strategies described in Section 4.3 regarding classification
improvements. In detail, Figures 3 to 8 highlight results obtained by employing each classification
model (x-axis) for each dataset (y-axis). Moreover, each figure presents results obtained by
using only the model (B), Principal Component Analysis (PCA), Spearman’s Correlation (SC),
Information Gain (IG), Recursive Feature Elimination with Cross-Validation (RFECV), and
Minimum Redundancy Maximum Relevance (MRMR) as feature selection strategies.</p>
        <p>Figure 3 reports classification results achieved by employing the KNN model. In particular, it
is possible to notice that classification results computed over the IG_1, IG_4, TW_8, TW_9_10
and PD datasets by exploiting SC, IG and RFEVC overcome the baseline results computed by
using only the KNN model, whereas those computed over the remaining datasets preserve the
baseline. Moreover, the best classification results are obtained over the TW_3, TW_7, TW_9_LP,
and TW_9_M datasets, whereas the worst are obtained over the TW_8 dataset. In general,
RFEVC is the feature selection strategy ofering more improvements in terms of fake account
discrimination when combined with the KNN model.</p>
        <p>Figure 4 reports classification results achieved by employing the RF model. In particular,
classification results computed over the IG_2 dataset by exploiting IG and RFEVC overcome
the baseline results computed by using only the RF model, whereas those computed over
the remaining datasets preserve the baseline. Moreover, the best classification results are
obtained over the TW_3, TW_7, TW_9_LP, TW_9_M, and TW_9_10 datasets, whereas the
1.00
0.95
0.90
worst classification results are obtained over the TW_8 dataset. In general, RFEVC and IG are
feature selection strategies ofering more improvements in terms of fake account discrimination
when combined with the RF model.</p>
        <p>Figure 5 reports classification results achieved by employing the DT model. In particular,
classification results computed over the IG_1 and IG_2 datasets by exploiting SC and IG overcome
the baseline results computed by using only the DT model, whereas those computed over
the remaining datasets preserve the baseline. Moreover, the best classification results are
obtained over the TW_3, TW_7, TW_9_LP, TW_9_M, and TW_9_10 datasets, whereas the
worst classification results are obtained over the TW_8 dataset. In general, SC and IG are feature
selection strategies ofering more improvements in terms of fake account discrimination when
combined with the DT model.</p>
        <p>Figure 6 reports classification results achieved by employing the NB model. In particular, it is
possible to notice that classification results computed over the IG_1, IG_2, IG_4, TW_7, TW_8,
TW_9, and TW_9_10 datasets by exploiting IG, RFECV, and MRMR overcome the baseline
results computed by using only NB model, whereas those computed over the remaining datasets
preserve the baseline. Moreover, the best classification results are obtained over the TW_3
and TW_9 datasets, whereas the worst are obtained over the TW_8 dataset. In general, IG and
RFECV are feature selection strategies ofering more improvements in terms of fake account
discrimination when combined with the NB model.</p>
        <p>Figure 7 reports classification results achieved by employing the SVC model. In particular, it
is possible to notice that classification results computed over the IG_2, IG_4, and TW_8 datasets
by exploiting MRMR and RFECV overcome the baseline results computed by using only the SVC
model, whereas those computed over the remaining datasets preserve the baseline. Moreover,
the best classification results are obtained over the TW_3, TW_7, TW_9_LP, TW_9_M, and
1.00
0.95
0.90
y
c
a
r
u
c
c
A0.85
R
L
0.80
0.75
IG_1 IG_2 IG_4 TW_3 DTWa_7taseTtWs_8 TW_9_LP TW_9_M TW_9_10 PD
IG_1 IG_2 IG_4 TW_3 DTWa_7taseTtWs_8 TW_9_LP TW_9_M TW_9_10 PD</p>
        <p>TW_9_10 datasets, whereas the worst classification results are obtained over the TW_8 dataset.
In general, RFECV is the feature selection strategy ofering more improvements in terms of fake
account discrimination when combined with the SVC model.</p>
        <p>Figure 8 reports classification results achieved by employing the LR model. In particular, it
is possible to notice that classification results computed over the IG_1 and TW_8 datasets by
exploiting SC and RFECV overcome the baseline results computed by using only the LR model,
whereas those computed over the remaining datasets preserve the baseline. Moreover, the best
classification results are obtained over TW_3, TW_7, TW_9_LP, and TW_9_M datasets, whereas
the worst classification results are obtained over the TW_8 dataset. In general, RFECV is the
feature selection strategy ofering more improvements in terms of fake account discrimination
when combined with the LR model.</p>
        <p>In order to evaluate the number of selected features yielded by each feature selection technique
over each collected dataset, we report in Figures 9 the obtained results. In particular, the
xaxis represents the analyzed datasets, whereas the y-axis represents the amount of features
selected for each dataset. Each line reported in Figure 9 represents the application of Spearman’s
Correlation (SC), Information Gain (IG), and Recursive Feature Elimination with Cross-Validation
(RFECV) as feature selection strategies, respectively. We do not report results concerning
Principal Component Analysis (PCA) and Minimum Redundancy Maximum Relevance (MRMR)
since the PCA does not yield the number of selected features and MRMR yields the same number
of features of RFECV (see Section 4.3). In general, as it is possible to see from Figure 9, the IG
strategy selects the minimum number of features for almost all datasets, whereas RFECV yields
the maximum one for all datasets except for the TW_3. Concerning SC, it overcomes RFECV
results only over the TW_3 dataset and yields a number of features less than IG only over the
IG_1, IG_4, and PD datasets.</p>
        <p>IG_1 IG_2 IG_4 TW_3 DTWa_7taseTtWs_8 TW_9_LP TW_9_M TW_9_10 PD</p>
        <p>As illustrated in our results, feature selection strategies improve the classification results
of machine learning models. In particular, the model that achieved the best results is the
RF, whereas, for almost all models, RFECV is the feature selection strategy ofering more
improvements in fake account discrimination even if it selects a large number of features.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>The increasingly widespread use of malicious accounts can compromise the trustability of
social network platforms. In this context, the number of techniques for detecting fake accounts
has grown proportionally to the number of new algorithms developed for harmful purposes,
and it is necessary to collect and organize data associated with social network accounts to
improve discrimination capacities. Under this view, we cleaned and organized fake account
datasets from online sources for analytical activities. Evaluation results achieved over diferent
machine learning models demonstrated that feature selection strategies improve classification
performances. The main objective of our proposal is to support stakeholders, data analysts, and
researchers by ofering the possibility of quickly accessing fake account datasets together with
machine learning classification results.</p>
      <p>In the future, we would like to collect more data, including additional social network platforms,
to improve the proposed analysis. Moreover, we would like to analyze the impact of the removed
features of each feature selection strategy over training times and classification performances.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This Publication was produced with the co-funding of the European union - Next Generation EU:
NRRP Initiative, Mission 4, Component 2, Investment 1.3 – Partnerships extended to universities,
research centers, companies and research D.D. MUR n. 341 del 5.03.2022 – Next Generation EU
(PE0000014 - "Security and Rights In the CyberSpace - SERICS" - CUP: H93C22000620001).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Castillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mendoza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Poblete</surname>
          </string-name>
          , Information credibility on twitter,
          <source>in: Proceedings of the 20th international conference on World wide web, ACM</source>
          , Hyderabad, India,
          <year>2011</year>
          , pp.
          <fpage>675</fpage>
          -
          <lpage>684</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Alkhalil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hewage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Nawaf</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Khan</surname>
          </string-name>
          ,
          <article-title>Phishing attacks: A recent comprehensive study and a new anatomy</article-title>
          ,
          <source>Frontiers in Computer Science</source>
          <volume>3</volume>
          (
          <year>2021</year>
          )
          <fpage>563060</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Braker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shiaeles</surname>
          </string-name>
          , G. Bendiab,
          <string-name>
            <given-names>N.</given-names>
            <surname>Savage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Limniotis</surname>
          </string-name>
          , Botspot:
          <article-title>Deep learning classification of bot accounts within twitter</article-title>
          ,
          <source>in: Internet of Things, Smart Spaces, and Next Generation Networks and Systems</source>
          , Springer, Petersburg, Russia,
          <year>2020</year>
          , pp.
          <fpage>165</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Wanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Jie</surname>
          </string-name>
          , Deepprofile:
          <article-title>Finding fake profile in online social network using dynamic cnn</article-title>
          ,
          <source>Journal of Information Security and Applications</source>
          <volume>52</volume>
          (
          <year>2020</year>
          )
          <fpage>102465</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kudugunta</surname>
          </string-name>
          , E. Ferrara,
          <article-title>Deep neural networks for bot detection</article-title>
          ,
          <source>Information Sciences 467</source>
          (
          <year>2018</year>
          )
          <fpage>312</fpage>
          -
          <lpage>322</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Seyler</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. L.</surname>
          </string-name>
          andChengXiang Zhai,
          <article-title>Identifying compromised accounts on social media using statistical text analysis</article-title>
          ,
          <source>Computing Research Repository abs/1804</source>
          .07247 (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kodati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. R.</given-names>
            <surname>Kumbala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mekala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Murthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. C. S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <article-title>Detection of fake profiles on twitter using hybrid svm algorithm</article-title>
          ,
          <source>in: E3S Web of Conferences</source>
          , volume
          <volume>309</volume>
          ,
          <source>EDP Sciences, Ternate</source>
          , Indonesia,
          <year>2021</year>
          , p.
          <fpage>01046</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K. K.</given-names>
            <surname>Bharti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <article-title>Fake account detection in twitter using logistic regression with particle swarm optimization</article-title>
          ,
          <source>Soft Computing</source>
          <volume>25</volume>
          (
          <year>2021</year>
          )
          <fpage>11333</fpage>
          -
          <lpage>11345</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I.</given-names>
            <surname>Sen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kumaraguru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Datta</surname>
          </string-name>
          ,
          <article-title>Worth its weight in likes: Towards detecting fake likes on instagram</article-title>
          ,
          <source>in: Proceedings of the 10th ACM Conference on Web Science</source>
          , ACM, Amsterdam, Netherlands,
          <year>2018</year>
          , pp.
          <fpage>205</fpage>
          -
          <lpage>209</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Caruccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Desiato</surname>
          </string-name>
          , G. Polese,
          <article-title>Fake account identification in social networks</article-title>
          ,
          <source>in: 2018 IEEE international conference on big data (big data)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>5078</fpage>
          -
          <lpage>5085</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Caruccio</surname>
          </string-name>
          , G. Cimino,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cirillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Desiato</surname>
          </string-name>
          , G. Polese, G. Tortora,
          <article-title>Malicious account identification in social network platforms</article-title>
          ,
          <source>ACM Transactions on Internet Technology</source>
          <volume>23</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cresci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Di</given-names>
            <surname>Pietro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Petrocchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Spognardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tesconi</surname>
          </string-name>
          ,
          <article-title>Social fingerprinting: detection of spambot groups through dna-inspired behavioral modeling</article-title>
          ,
          <source>IEEE Transactions on Dependable and Secure Computing</source>
          <volume>15</volume>
          (
          <year>2018</year>
          )
          <fpage>561</fpage>
          -
          <lpage>576</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Swain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hauska</surname>
          </string-name>
          ,
          <article-title>The decision tree classifier: Design and potential</article-title>
          ,
          <source>IEEE Transactions on Geoscience Electronics</source>
          <volume>15</volume>
          (
          <year>1977</year>
          )
          <fpage>142</fpage>
          -
          <lpage>147</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>O.</given-names>
            <surname>Kramer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kramer</surname>
          </string-name>
          ,
          <article-title>K-nearest neighbors, Dimensionality reduction with unsupervised nearest neighbors (</article-title>
          <year>2013</year>
          )
          <fpage>13</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F. O.</given-names>
            <surname>Redelico</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Traversaro</surname>
          </string-name>
          , M. d. C.
          <string-name>
            <surname>García</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>O. A.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Risk</surname>
          </string-name>
          ,
          <article-title>Classification of normal and pre-ictal eeg signals using permutation entropies and a generalized linear model as a classifier</article-title>
          ,
          <source>Entropy</source>
          <volume>19</volume>
          (
          <year>2017</year>
          )
          <fpage>72</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>