<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Classification benchmarking of fake account datasets using machine learning models and feature selection strategies</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Danilo</forename><surname>Caivano</surname></persName>
							<email>danilo.caivano@uniba.it</email>
							<affiliation key="aff1">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<addrLine>via Edoardo Orabona n.4</addrLine>
									<postCode>70125</postCode>
									<settlement>Bari (BA)</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mary</forename><surname>Cerullo</surname></persName>
							<email>m.cerullo13@studenti.unisa.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Salerno</orgName>
								<address>
									<addrLine>via Giovanni Paolo II n.132</addrLine>
									<postCode>84084</postCode>
									<settlement>Fisciano (SA)</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Domenico</forename><surname>Desiato</surname></persName>
							<email>domenico.desiato@uniba.it</email>
							<affiliation key="aff1">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<addrLine>via Edoardo Orabona n.4</addrLine>
									<postCode>70125</postCode>
									<settlement>Bari (BA)</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giuseppe</forename><surname>Polese</surname></persName>
							<email>gpolese@unisa.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Salerno</orgName>
								<address>
									<addrLine>via Giovanni Paolo II n.132</addrLine>
									<postCode>84084</postCode>
									<settlement>Fisciano (SA)</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Classification benchmarking of fake account datasets using machine learning models and feature selection strategies</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">5ADF7CAF623C552463A46924FBCBFD68</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:25+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Data Analytics</term>
					<term>Fake accounts</term>
					<term>Machine Learning</term>
					<term>Feature selection</term>
					<term>Social networks</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Social network platforms are highly used for social interactions, and due to their increasing number of registered users, it is crucial to verify the authenticity of such accounts and the data they generate. In particular, the phenomenon of malicious accounts represents a crucial aspect that social network platforms have to deal with, and it is crucial to develop new methodologies and strategies to discriminate against malicious accounts automatically. To this end, data from social network platforms plays a crucial role in defining analytical activities devoted to fake account discrimination. In this proposal, we organized and cleaned fake account datasets collected by online sources and provided classification results obtained employing machine learning models and feature selection strategies. Moreover, we extend classification results by using a new proposed fake accounts dataset collected through data crawling activity. Experimental results produced by employing several machine learning models and feature selection techniques on the fake account datasets reveal discrimination improvements when feature selection strategies are exploited. Our proposal aims to support stakeholders, data analysts, and researchers by providing them with fake account datasets cleaned and organized for analytical activities, together with statistical classification results obtained using machine learning models and feature selection strategies.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Social networks have simplified how people communicate and exchange information globally. In particular, social interaction platforms such as Instagram, Twitter, Tumblr, etc., have improved the dynamics of human interaction, impacting the daily lives of their users and the entire society.</p><p>In social network platforms, it is crucial to monitor the popularity of a profile. In particular, the number of friends or followers significantly determines the profile's influence and reputation. Social network profiles with a large following are considered more influential and attractive to better-paid advertisements. A common practice of several social network users is to buy fake followers to appear more influential. Users are also stimulated by the meagre price (a few dollars for hundreds of fake followers) fake followers may be bought. Of course, such practice might be considered harmless if used to support individual vanity but dangerous if used to make an account more reliable and influential. For example, spammers can use fake followers to promote products, trends, and fashions by compromising the integrity of the social network platforms <ref type="bibr" target="#b0">[1]</ref>. Moreover, phishing campaigns are commonly spread by exploiting fake accounts created ad-hoc with many followers and following to appear trustable, and in most cases, the users attracted by such numbers are defrauded <ref type="bibr" target="#b1">[2]</ref>.</p><p>Many anomalous accounts, such as Spammers, Bots, Cyborgs, and Trolls, are disseminated on social network platforms. In particular, spammers try to share malicious content or dangerous links. Bots produce accounts to simulate human behaviour, trying to perform typical human actions automatically. In contrast, Cyborgs are defined by humans and are not necessarily malicious.</p><p>In this proposal, we focus on the discrimination of fake accounts on social network platforms. In particular, we present an extensive empirical evaluation of fake account datasets available in online sources. In detail, we employ several machine learning models and evaluate their capabilities in terms of fake account discrimination. Further, we exploit different feature selection techniques to enhance models discrimination performances.</p><p>The general idea of our proposal is to offer stakeholders, data analysts, and researchers who work to define new methodologies for malicious account discrimination the possibility of quickly accessing fake account datasets that have been cleaned and organized for analytical activities. Moreover, we offer comparative results in terms of fake account discrimination. To this end, the usage of machine learning models is motivated by the fact that we want to provide a baseline for classification performances. In contrast, feature selection techniques are employed to improve the computed baseline. Moreover, we highlight the combinations of feature selection and model to adopt on the specific dataset to achieve the best classification results. Additionally, we extend our analysis by using a new fake account dataset collected by exploiting data crawling activity.</p><p>In summary, the main contributions of our proposal are i) fake account datasets collected by online sources cleaned and organized for analytical activities, ii) baseline and improved results of classification performances using machine learning models and feature selection techniques, and iii) a new fake account dataset collected through data crawling activity.</p><p>The remainder of the paper is organized as follows. Section 2 reports relevant works concerning fake account discrimination, whereas Section 3 presents our methodology. Section 4 shows results, and conclusions and future directions are provided in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related work</head><p>A key factor to be monitored for social network platforms is the identification of malicious accounts. The automatic collection of social network accounts has been addressed in <ref type="bibr" target="#b2">[3]</ref>. In particular, the authors developed an ad-hoc web crawler to automatically collect and filter public Twitter accounts and organize the data in testing and training datasets. Moreover, a multi-layer perceptron neural network has been modelled and trained in over nine features characterizing a fake account. Another machine learning approach is provided in <ref type="bibr" target="#b3">[4]</ref>. In particular, the authors propose DeepProfile, which performs account classification through a dynamic CNN to train a learning model, which exploits a novel pooling layer to optimize the neural network performance in the training process. Moreover, In <ref type="bibr" target="#b4">[5]</ref>, content and metadata at the tweet level have been exploited for recognizing bots employing a deep neural network based on contextual long short-term memory (LSTM). In particular, this approach extrapolates contextual features from user metadata and uses the LSTM deep nets to process the tweet text, yielding a model capable of obtaining high classification accuracy with little data.</p><p>Statistical text analysis is exploited in a novel general framework to discover compromised accounts <ref type="bibr" target="#b5">[6]</ref>. The framework relies on the consideration that an account's owner uses his/her profile in a way that is entirely different from the same account when it is hacked, enabling a syntactic analyzer to identify the features used by hackers (or spammers) when they compromise a genuine account. Thus, a language modelling algorithm is used to extrapolate the similarities between language models of genuine users and those of hackers/spammers to characterize hackers' features and use them in supervised machine learning approaches.</p><p>Further approaches devoted to fake account discrimination also considered feature engineering and/or selection issues <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8]</ref>. Nevertheless, most of the proposals, including a feature engineering process, rely on domain experts or include manual work for characterizing meaningful features that permit a classifier to work with high accuracy. For instance, in <ref type="bibr" target="#b8">[9]</ref>, the authors have enumerated the main characteristics to discriminate a fake account from a genuine one. In particular, by manually examining different types of accounts, they extracted a set of features to highlight the characteristics of malicious accounts. Moreover, they analyzed the liking behaviour of each account to build an automated mechanism to detect fake likes on Instagram. Furthermore, In <ref type="bibr" target="#b9">[10]</ref>, the authors propose a novel technique to discriminate real accounts on social networks from fake ones. Their technique exploits knowledge automatically extracted from big data to characterize typical patterns of fake accounts. Additionally, in <ref type="bibr" target="#b10">[11]</ref>, the authors extend the previous work by exploiting data correlations to develop a new feature engineering strategy to augment the social network account dataset with additional features, aiming to enhance the capability of existing machine learning strategies to discriminate fake accounts.</p><p>Compared to the fake account discrimination approaches described above, in this proposal, we release cleaned and organized fake account datasets collected from online sources for analytical activities. To this end, we exploit machine learning models and feature selection techniques to compute baseline and improved classification performance results. Finally, we extend our analysis by exploiting a new fake account dataset collected by data crawling activity.</p><p>In what follows, we introduce our methodology steps to compute classification results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>This section presents our approach for computing data utility metrics over malicious account datasets. In particular, in the first phase of the approach, we employ machine learning models to classify malicious datasets and compute data utility metrics over them. Moreover, in the second step, we employ feature selection strategies over malicious account datasets to improve the discrimination performances of machine learning models. The first phase of our approach (represented by Figure <ref type="figure" target="#fig_0">1</ref>) is targeted to compute data utility metrics over malicious account datasets. In particular, we exploit machine learning models to compute classification metrics over each dataset in order to yield a first baseline that describes all datasets in terms of data utility.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Classifier</head><formula xml:id="formula_0">Dataset 1 . . . Dataset 2 Dataset n F/R F/R F/R</formula><p>The second phase of our approach (represented by Figure <ref type="figure" target="#fig_1">2</ref>) is targeted to improve data utility metrics obtained in the previous step. In particular, we employ feature selection strategies over each dataset to improve the discrimination performances of machine learning models.</p><p>In what follows, we introduce malicious accounts datasets, machine learning models, feature selection strategies employed, and results obtained. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental Evaluation</head><p>In this section, we describe the datasets employed in our study and the data preparation techniques exploited to organize and clean them for data analytics activities. Next, we provide details concerning the machine learning models, feature selection strategies, and the proposed fake account dataset. Lastly, we present classification results achieved by using machine learning models and feature selection techniques over the collected datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Data preparation and dataset descriptions</head><p>In this section, we describe the collected datasets and data preparation activities used to clean and organize them for data analysis activity. In particular, we collected ten different datasets, four related to Instagram accounts and six related to X (formerly Twitter) accounts. In what follows, we describe datasets and provide data sources. Concerning data preparation activities, we replaced null values with zero and adopted an encoding strategy for non-numeric features to use machine learning models. In particular, the encoding strategy exploits a dictionary to map values of categorical columns into integers. This strategy allowed us to maintain the uniqueness of the tuples by avoiding altering the original data.</p><p>In the next section, we describe the machine learning models adopted and the parameters tuning used.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Adopted machine learning models and parameter tuning</head><p>This section describes the machine learning models employed in our study and the parameters tuning used for each model. In particular, we involved Decision Tree (DT) <ref type="bibr" target="#b12">[13]</ref>, K-Nearest Neighbors (KNN) <ref type="bibr" target="#b13">[14]</ref>, Logistic Regression (LR) <ref type="bibr" target="#b14">[15]</ref>, Gaussian Na¨ıve Bayes (NB) <ref type="bibr" target="#b15">[16]</ref>, Random Forest (RF) <ref type="bibr" target="#b16">[17]</ref> and Support Vector Classifier (SVC) KSB00 by considering their versions available in the Scikit-learn<ref type="foot" target="#foot_0">1</ref> python library. Moreover, for each model, we performed hyperparameter tuning using the GridSearchCV with 5-fold <ref type="bibr" target="#b17">[18]</ref>, aiming to identify the best combination of hyperparameters for the predictive models based on the accuracy scores. Details concerning machine learning models and parameters tuning are reported below. The decision tree (DT) model is a supervised learning model that, given a labelled dataset, recursively defines a tree structure where, at each level, local decisions are associated with a feature. After constructing the tree, each path from the root to a leaf node represents a classification pattern <ref type="bibr" target="#b12">[13]</ref>. In detail, the hyperparameters utilized for the DT model are max_leaf_nodes in a range from 2 to 100 and <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4]</ref> for min_samples_split. The k-Nearest Neighbor (KNN) algorithm is an instance-based technique that operates under the assumption that new instances are similar to those already provided with a class label. In this algorithm, all instances are treated as points in an n-dimensional space and are classified based on their similarity to other instances. In detail, the hyperparameters utilized for the KNN model are in the range of 1 to 25 for the n_neighbors param, ['uniform', 'distance'] for the weights param and <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref> for the p param. Logistic regression (LR) is a supervised learning approach capable of inferring a vector of weights whose elements are associated with each feature. In particular, a weight specifies the relevance of a feature with respect to the classification task <ref type="bibr" target="#b14">[15]</ref>. In detail, the hyperparameters utilized for the LR model are 'l2' for penalty and [0.001, 0.01, 0.1, 1, 10, 100, 1000] for the C parameter. Gaussian Na¨ıve Bayes (NB) is a supervised learning method based on the application of Bayes' theorem with the assumption of conditional independence between each pair of variables. In detail, the hyperparameters utilized for the NB model is the var_smoothing parameter ranging from 0 to -9. The random forest (RF) model is an approach based on the ensemble concept <ref type="bibr" target="#b18">[19]</ref>, i.e., exploiting a set of DTs to derive a global model that performs better than the single DTs composing the ensemble. In detail, the hyperparameters utilized for the RF model are bootstrap parameter set to true, <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr">30</ref>, 100] for the max_depth, <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3]</ref> for the max_features, <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5]</ref> for the min_samples_leaf parameter, <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b11">12]</ref> for min_samples_split, <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr">30,</ref><ref type="bibr">100]</ref> for n_estimators and finally, 'gini' or 'entropy' for the criterion. Support vector classification (SVC) is a model in which the training instances are classified separately in different points of a space and organized into separated groups. The SVC tries to achieve the optimal separation hyperplane by computing the most significant margins of separation between different classes <ref type="bibr" target="#b19">[20]</ref>. In detail, the hyperparameters utilized for the SVC model are [0.1, 1, 10, 100, 1000] for the C parameter, [1, 0.1, 0.01, 0.001, 0.0001] for gamma and finally, and 'rbf' as kernel.</p><p>In the next section, we describe the feature selection strategies adopted and their settings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Adopted feature selection strategies</head><p>This section describes the feature selection strategies employed and their settings. In particular, we involved Spearman's Correlation (SC) <ref type="bibr" target="#b20">[21]</ref>, Information Gain (IG) <ref type="bibr" target="#b21">[22]</ref>, Recursive Feature Elimination with Cross-Validation (RFECV) <ref type="bibr" target="#b22">[23]</ref>, Minimum Redundancy Maximum Relevance (MRMR) <ref type="bibr" target="#b23">[24]</ref> and Principal Component Analysis (PCA) <ref type="bibr" target="#b20">[21]</ref>. Feature selection strategies details are provided below. Spearman's Correlation <ref type="bibr" target="#b20">[21]</ref> is a technique that measures the correlation between two variables. In particular, a positive value indicates that variables have a positive relationship, whereas a negative value indicates a negative relationship. Moreover, If the value is zero, no relationship between the two variables is defined. For this reason, variables with a higher absolute correlation value may be considered more relevant and retained. Information Gain <ref type="bibr" target="#b21">[22]</ref> consists of a non-parametric entropy-based technique that measures the dependence between two variables. In particular, If such dependence is equal to zero, the two variables are independent, while a higher value indicates greater dependence. Selected features are those with the highest score, while discarded features are those with the score closest to zero. Recursive Feature Elimination with Cross-Validation (RFECV) <ref type="bibr" target="#b22">[23]</ref> is a technique for extracting the most significant features by removing the weakest feature. The Logistic Regression estimator provides information on the importance of features. The best subset of features is then selected using the accuracy scores in combination with cross-validation. Minimum Redundancy Maximum Relevance (MRMR) <ref type="bibr" target="#b23">[24]</ref> involves selecting features with the highest relevance and least redundancy through mutual information. In this case, the number of features to be selected is chosen based on the k returned by the RFECV method. Finally, Principal Component Analysis (PCA) <ref type="bibr" target="#b20">[21]</ref> is a dimensionality reduction method that identifies principal components in the direction that preserves most of the variety in the original data. To this end, we use PCA by selecting a number of components that preserve the 95% of data variance considering all features.</p><p>In the next section, we provide details concerning the proposed fake accounts dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Proposed fake accounts dataset</head><p>In this section, we describe the proposed fake accounts dataset. In particular, we collected the Proposed Dataset (PD) from Instagram's social platform. The dataset consists of 1937 accounts, divided into 944 real accounts and 993 fake accounts, and contains 11 features. In detail, we collected Instagram accounts by exploiting a data crawling technique implemented with an ad hoc script to extract the features needed. More specifically, fake accounts are purchased using the online service: https://serviceiggrowthstar.it, whereas real accounts are collected by involving real users and verified manually. All the above-described datasets are organized, cleaned and made accessible to the following GitHub repository: https://github.com/Macerul/ FakeAccountDatasets.git. Additionally, we also provide in-depth statistics concerning additional metrics computed using machine learning models over each dataset, i.e., Precision, Recall, F1 and Accuracy, at the following link: https://github.com/Macerul/FakeAccountDatasets_Scores.git In the next section, we describe experimental results achieved by employing machine learning models and feature selection techniques over each dataset. </p><formula xml:id="formula_1">IG _ 1 IG _ 2 IG _ 4 T W _ 3 T W _ 7 T W _ 8 T W _ 9 _ L P T W _ 9 _ M T W _ 9 _</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.">Results</head><p>In order to compute classification results, we run several experimental sessions in which different classification models are trained over the datasets described in Section 4.1. In particular, we discuss the performances achieved with the employed predictive models over each dataset and evaluate the feature selection strategies described in Section 4. Figure <ref type="figure">3</ref> reports classification results achieved by employing the KNN model. In particular, it is possible to notice that classification results computed over the IG_1, IG_4, TW_8, TW_9_10 and PD datasets by exploiting SC, IG and RFEVC overcome the baseline results computed by using only the KNN model, whereas those computed over the remaining datasets preserve the baseline. Moreover, the best classification results are obtained over the TW_3, TW_7, TW_9_LP, and TW_9_M datasets, whereas the worst are obtained over the TW_8 dataset. In general, RFEVC is the feature selection strategy offering more improvements in terms of fake account discrimination when combined with the KNN model.</p><p>Figure <ref type="figure">4</ref> reports classification results achieved by employing the RF model. In particular, classification results computed over the IG_2 dataset by exploiting IG and RFEVC overcome the baseline results computed by using only the RF model, whereas those computed over the remaining datasets preserve the baseline. Moreover, the best classification results are obtained over the TW_3, TW_7, TW_9_LP, TW_9_M, and TW_9_10 datasets, whereas the worst classification results are obtained over the TW_8 dataset. In general, RFEVC and IG are feature selection strategies offering more improvements in terms of fake account discrimination when combined with the RF model. Figure <ref type="figure">5</ref> reports classification results achieved by employing the DT model. In particular, classification results computed over the IG_1 and IG_2 datasets by exploiting SC and IG overcome the baseline results computed by using only the DT model, whereas those computed over the remaining datasets preserve the baseline. Moreover, the best classification results are obtained over the TW_3, TW_7, TW_9_LP, TW_9_M, and TW_9_10 datasets, whereas the worst classification results are obtained over the TW_8 dataset. In general, SC and IG are feature selection strategies offering more improvements in terms of fake account discrimination when combined with the DT model.</p><formula xml:id="formula_2">IG _ 1 IG _ 2 IG _ 4 T W _ 3 T W _ 7 T W _ 8 T W _ 9 _ L P T W _ 9 _ M T W _ 9 _</formula><p>Figure <ref type="figure">6</ref> reports classification results achieved by employing the NB model. In particular, it is possible to notice that classification results computed over the IG_1, IG_2, IG_4, TW_7, TW_8, TW_9, and TW_9_10 datasets by exploiting IG, RFECV, and MRMR overcome the baseline results computed by using only NB model, whereas those computed over the remaining datasets preserve the baseline. Moreover, the best classification results are obtained over the TW_3 and TW_9 datasets, whereas the worst are obtained over the TW_8 dataset. In general, IG and RFECV are feature selection strategies offering more improvements in terms of fake account discrimination when combined with the NB model.</p><p>Figure <ref type="figure">7</ref> reports classification results achieved by employing the SVC model. In particular, it is possible to notice that classification results computed over the IG_2, IG_4, and TW_8 datasets by exploiting MRMR and RFECV overcome the baseline results computed by using only the SVC model, whereas those computed over the remaining datasets preserve the baseline. Moreover, the best classification results are obtained over the TW_3, TW_7, TW_9_LP, TW_9_M, and TW_9_10 datasets, whereas the worst classification results are obtained over the TW_8 dataset.</p><formula xml:id="formula_3">IG _ 1 IG _ 2 IG _ 4 T W _ 3 T W _ 7 T W _ 8 T W _ 9 _ L P T W _ 9 _ M T W _ 9 _</formula><p>In general, RFECV is the feature selection strategy offering more improvements in terms of fake account discrimination when combined with the SVC model. Figure <ref type="figure">8</ref> reports classification results achieved by employing the LR model. In particular, it is possible to notice that classification results computed over the IG_1 and TW_8 datasets by exploiting SC and RFECV overcome the baseline results computed by using only the LR model, whereas those computed over the remaining datasets preserve the baseline. Moreover, the best classification results are obtained over TW_3, TW_7, TW_9_LP, and TW_9_M datasets, whereas the worst classification results are obtained over the TW_8 dataset. In general, RFECV is the feature selection strategy offering more improvements in terms of fake account discrimination when combined with the LR model.</p><p>In order to evaluate the number of selected features yielded by each feature selection technique over each collected dataset, we report in Figures 9 the obtained results. In particular, the xaxis represents the analyzed datasets, whereas the y-axis represents the amount of features selected for each dataset. Each line reported in Figure <ref type="figure" target="#fig_6">9</ref> represents the application of Spearman's Correlation (SC), Information Gain (IG), and Recursive Feature Elimination with Cross-Validation (RFECV) as feature selection strategies, respectively. We do not report results concerning Principal Component Analysis (PCA) and Minimum Redundancy Maximum Relevance (MRMR) since the PCA does not yield the number of selected features and MRMR yields the same number of features of RFECV (see Section 4.3). In general, as it is possible to see from Figure <ref type="figure" target="#fig_6">9</ref>, the IG strategy selects the minimum number of features for almost all datasets, whereas RFECV yields the maximum one for all datasets except for the TW_3. Concerning SC, it overcomes RFECV results only over the TW_3 dataset and yields a number of features less than IG only over the IG_1, IG_4, and PD datasets. As illustrated in our results, feature selection strategies improve the classification results of machine learning models. In particular, the model that achieved the best results is the RF, whereas, for almost all models, RFECV is the feature selection strategy offering more improvements in fake account discrimination even if it selects a large number of features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions</head><p>The increasingly widespread use of malicious accounts can compromise the trustability of social network platforms. In this context, the number of techniques for detecting fake accounts has grown proportionally to the number of new algorithms developed for harmful purposes, and it is necessary to collect and organize data associated with social network accounts to improve discrimination capacities. Under this view, we cleaned and organized fake account datasets from online sources for analytical activities. Evaluation results achieved over different machine learning models demonstrated that feature selection strategies improve classification performances. The main objective of our proposal is to support stakeholders, data analysts, and researchers by offering the possibility of quickly accessing fake account datasets together with machine learning classification results.</p><p>In the future, we would like to collect more data, including additional social network platforms, to improve the proposed analysis. Moreover, we would like to analyze the impact of the removed features of each feature selection strategy over training times and classification performances.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Classifier working on malicious accounts datasets.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Features selection and classification.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :Figure 4 :</head><label>34</label><figDesc>Figure 3: Accuracy results of KNN model over the collected fake account datasets.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>3 regarding classification improvements. In detail, Figures 3 to 8 highlight results obtained by employing each classification model (x-axis) for each dataset (y-axis). Moreover, each figure presents results obtained by using only the model (B), Principal Component Analysis (PCA), Spearman's Correlation (SC), Information Gain (IG), Recursive Feature Elimination with Cross-Validation (RFECV), and Minimum Redundancy Maximum Relevance (MRMR) as feature selection strategies.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 5 :Figure 6 :</head><label>56</label><figDesc>Figure 5: Accuracy results of DT model over the collected fake account datasets.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 7 :Figure 8 :</head><label>78</label><figDesc>Figure 7: Accuracy results of SVC model over the collected fake account datasets.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 9 :</head><label>9</label><figDesc>Figure 9: Number of selected features by each feature selection strategy w.r.t. analyzed datasets.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>• IG_1: The dataset consists of 1194 Instagram accounts, of which 994 real and 200 fake, and contains 10 features. The dataset source can be found at the following link: https: //github.com/Blacjar/instafake-dataset#fake-account-detection • IG_2: The dataset consists of 785 Instagram accounts, of which 93 real and 692 fake, and contains 13 features. The dataset source can be found at the following link: https://www. kaggle.com/datasets/rezaunderfit/instagram-fake-and-real-accounts-dataset/data • IG_4: The dataset consists of 576 Instagram accounts, of which 288 real and 288 fake, and contains 12 features. The dataset source can be found at the following link: https://www.kaggle.com/datasets/jasvindernotra/instagram-detecting-fake-accounts/ data?select=instagram.csv • TW_3: The dataset consists of 2818 Twitter accounts, of which 1481 real and 1337 fake, and contains 34 features. The dataset source can be found at the following link:</figDesc><table /><note>https://www.kaggle.com/datasets/whoseaspects/genuinefake-user-profile-dataset/data • TW_7: The dataset consists of 9019 Twitter accounts, of which 5706 real and 3313 fake, and contains 16 features. The dataset source can be found in<ref type="bibr" target="#b11">[12]</ref>.The datasets mentioned below can all be found at the following link: https://botometer.osome. iu.edu/bot-repository/datasets.html • TW_8 (gilani2017): The dataset consists of 2503 Twitter acconts, of which 1413 real and 1090 fake, and contains 41 features. • TW_9_LP (caverlee11): The dataset consists of 41499 Twitter accounts, of which 19276 real and 22223 fake, and contains 8 features. • TW_9_M (midterm2018): The dataset consists of 50538 Twitter accounts, of which 8092 real and 42446 fake, and contains 18 features. • TW_9_10: The dataset consists of 15, 810 Twitter accounts, of which 7905 real and 7905 fake, and contains 43 features. In particular, the following dataset was obtained by concatenating multiple datasets, such as the verified-2019 and celebrity-2019 (real accounts), and the pronbots-2019, botwiki-2019, political-bots-2019, and vendor-purchased-2019 datasets (fake accounts).</note></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://scikit-learn.org/</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This Publication was produced with the co-funding of the European union -Next Generation EU: NRRP Initiative, Mission 4, Component 2, Investment 1.3 -Partnerships extended to universities, research centers, companies and research D.D. MUR n. 341 del 5.03.2022 -Next Generation EU (PE0000014 -"Security and Rights In the CyberSpace -SERICS" -CUP: H93C22000620001).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Information credibility on twitter</title>
		<author>
			<persName><forename type="first">C</forename><surname>Castillo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mendoza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Poblete</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 20th international conference on World wide web</title>
				<meeting>the 20th international conference on World wide web<address><addrLine>Hyderabad, India</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="675" to="684" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Phishing attacks: A recent comprehensive study and a new anatomy</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Alkhalil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hewage</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Nawaf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Khan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Frontiers in Computer Science</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page">563060</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Botspot: Deep learning classification of bot accounts within twitter</title>
		<author>
			<persName><forename type="first">C</forename><surname>Braker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shiaeles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Bendiab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Savage</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Limniotis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Internet of Things, Smart Spaces, and Next Generation Networks and Systems</title>
				<meeting><address><addrLine>Petersburg, Russia</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="165" to="175" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Deepprofile: Finding fake profile in online social network using dynamic cnn</title>
		<author>
			<persName><forename type="first">P</forename><surname>Wanda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">J</forename><surname>Jie</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Information Security and Applications</title>
		<imprint>
			<biblScope unit="volume">52</biblScope>
			<biblScope unit="page">102465</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Deep neural networks for bot detection</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kudugunta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ferrara</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Sciences</title>
		<imprint>
			<biblScope unit="volume">467</biblScope>
			<biblScope unit="page" from="312" to="322" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Identifying compromised accounts on social media using statistical text analysis</title>
		<author>
			<persName><forename type="first">D</forename><surname>Seyler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">L</forename><surname>Zhai</surname></persName>
		</author>
		<idno>Repository abs/1804.07247</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1" to="8" />
		</imprint>
	</monogr>
	<note type="report_type">Computing Research</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Detection of fake profiles on twitter using hybrid svm algorithm</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kodati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">R</forename><surname>Kumbala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mekala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">S</forename><surname>Murthy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">C S</forename><surname>Reddy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">E3S Web of Conferences</title>
				<meeting><address><addrLine>Ternate, Indonesia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">309</biblScope>
			<biblScope unit="page">1046</biblScope>
		</imprint>
		<respStmt>
			<orgName>EDP Sciences</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Fake account detection in twitter using logistic regression with particle swarm optimization</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">K</forename><surname>Bharti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pandey</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Soft Computing</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="page" from="11333" to="11345" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Worth its weight in likes: Towards detecting fake likes on instagram</title>
		<author>
			<persName><forename type="first">I</forename><surname>Sen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Aggarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Kumaraguru</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Datta</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 10th ACM Conference on Web Science</title>
				<meeting>the 10th ACM Conference on Web Science<address><addrLine>Amsterdam, Netherlands</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="205" to="209" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Fake account identification in social networks</title>
		<author>
			<persName><forename type="first">L</forename><surname>Caruccio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Desiato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Polese</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2018 IEEE international conference on big data (big data)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="5078" to="5085" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Malicious account identification in social network platforms</title>
		<author>
			<persName><forename type="first">L</forename><surname>Caruccio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Cimino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Cirillo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Desiato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Polese</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Tortora</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Internet Technology</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="page" from="1" to="25" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Social fingerprinting: detection of spambot groups through dna-inspired behavioral modeling</title>
		<author>
			<persName><forename type="first">S</forename><surname>Cresci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Di Pietro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Petrocchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Spognardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tesconi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Dependable and Secure Computing</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page" from="561" to="576" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">The decision tree classifier: Design and potential</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">H</forename><surname>Swain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hauska</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Geoscience Electronics</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page" from="142" to="147" />
			<date type="published" when="1977">1977</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">O</forename><surname>Kramer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Kramer</surname></persName>
		</author>
		<title level="m">K-nearest neighbors, Dimensionality reduction with unsupervised nearest neighbors</title>
				<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="13" to="23" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Classification of normal and pre-ictal eeg signals using permutation entropies and a generalized linear model as a classifier</title>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">O</forename><surname>Redelico</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Traversaro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">D C</forename><surname>García</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Silva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><forename type="middle">A</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Risk</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Entropy</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="page">72</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Bayesian naïve bayes classifiers to text classification</title>
		<author>
			<persName><forename type="first">S</forename><surname>Xu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Information Science</title>
		<imprint>
			<biblScope unit="volume">44</biblScope>
			<biblScope unit="page" from="48" to="59" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Random forest classifier for remote sensing classification</title>
		<author>
			<persName><forename type="first">M</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International journal of remote sensing</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="page" from="217" to="222" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Hyperparameter tuning using gridsearchcv on the comparison of the activation function of the elm method to the classification of pneumonia in toddlers</title>
		<author>
			<persName><forename type="first">D</forename><surname>Kartini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">T</forename><surname>Nugrahadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Farmadi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2021 4th International Conference of Computer and Informatics Engineering (IC2IE)</title>
				<meeting><address><addrLine>Jakarta, Indonesia</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="390" to="395" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Evaluation of random forest and adaboost tree-based ensemble classification and spectral band selection for ecotope mapping using airborne hyperspectral imagery</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>-W. Chan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Paelinckx</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Remote Sensing of Environment</title>
		<imprint>
			<biblScope unit="volume">112</biblScope>
			<biblScope unit="page" from="2999" to="3011" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">A fast iterative nearest point algorithm for support vector machine classifier design</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Keerthi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">K</forename><surname>Shevade</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Bhattacharyya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">R</forename><surname>Murthy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE transactions on neural networks</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page" from="124" to="136" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">Detecting twitter fake accounts using machine learning and data reduction techniques</title>
		<author>
			<persName><forename type="first">A</forename><surname>Homsi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Al-Nemri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Naimat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kareem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Al-Fayoumi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">Abu</forename><surname>Snober</surname></persName>
		</author>
		<idno type="DOI">10.5220/0010604300880095</idno>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="88" to="95" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">A feature selection method based on information gain and genetic algorithm</title>
		<author>
			<persName><forename type="first">S</forename><surname>Lei</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICCSEE.2012.97</idno>
	</analytic>
	<monogr>
		<title level="m">International Conference on Computer Science and Electronics Engineering</title>
				<imprint>
			<date type="published" when="2012">2012. 2012</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="355" to="358" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">The effect of recursive feature elimination with cross-validation (rfecv) feature selection algorithm toward classifier performance on credit card fraud detection</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Z</forename><surname>Mustaqim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Adi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Pristyanto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Astuti</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICAICST53116.2021.9497842</idno>
	</analytic>
	<monogr>
		<title level="m">International Conference on Artificial Intelligence and Computer Science Technology (ICAICST)</title>
				<imprint>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="page" from="270" to="275" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">mrmr-based feature selection for classification of cotton foreign matter using hyperspectral imaging</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.compag.2015.10.017</idno>
		<ptr target="https://doi.org/10.1016/j.compag.2015.10.017" />
	</analytic>
	<monogr>
		<title level="j">Computers and Electronics in Agriculture</title>
		<imprint>
			<biblScope unit="volume">119</biblScope>
			<biblScope unit="page" from="191" to="200" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
