Identifying Fake Profile in Online Social Network Himanshi Gupta Nagariya, Neha Dhanotiya, Shruti Joshi and Sarika Jain National Institute of Technology, Kurukshetra, India Abstract Online Social Networks involve a huge amount of people from all over the world and it has become a big part of their life. People use social networks to share their feelings, to make new friends, to set up new businesses, to connect with friends and family and what not. The Online Social Networks provides a great advantage to individuals in different ways but it also suffers with some disadvantages. There are many people who use these networks to cause harm to others by making fake accounts on these networks. For detection of such fake and genuine accounts we can use machine learning algorithms. The machine learning algorithms are applied for the prediction and classification of datasets through the different models that are prepared. It sometimes become difficult to differentiate between the results of different models and so we to use a hybrid approach of machine learning algorithm can make this task easy. In our work we compared the 8 different combinations of classification algorithms and calculated their accuracy on the dataset of an Online Social Network. We used the combination of Random Forest, Support Vector Machine, Logistic Regression, KNN, and Decision Trees. After comparing the result of each hybrid approach, we concluded that the best accuracy was obtained by combination of SVM and Logistic Regression and Neural Network. So, we proposed a model for the detection of fake account with the hybrid approach giving the best accuracy among all the combinations. Keywords Online Social Network, Fake Account Detection, Feature Extraction, Spammer. 1. Introduction Our work is concerned with the Classification algorithms that come under Machine Learning is a branch of the Supervised Machine Learning. artificial intelligence (AI) which is able to Classification is a supervised learning provide a system the ability to act without approach in which the machine takes the being programmed explicitly. It is used in input data learns from that data and then many fields like Google cars, further classifies the testing data according recommendation engines, friend suggestions to its training data. in social media networks, shopping apps, Although classification algorithms cybercrimes etc. (Support Vector Machine, Logistic Machine Learning has made a Regression, Decision Tree, Random Forest, phenomenal change in the way how data Artificial Neural Network) can be used was extracted and interpreted by replacing separately and individually but in our the old statistical techniques. Classifications system we are developing a hybrid model of machine learning techniques are: combining two or three machine learning Reinforcement, Supervised and models has helped in increasing the Unsupervised Machine Learning. accuracy of the model and its predicative power. The fact that which hybrid model ACI’21: Workshop on Advances in Computational Intelligence will perform better is unknown, but it is at ISIC 2021, February 25–27, 2021, New Delhi, India EMAIL: himanshi100497@gmail.com (H. Nagariya); also affected by the dataset provided and nehadhanotiya1612@gmail.com (N. Dhanotiya); also the feature selection. The concept to shrutijoshijma@gmail.com (S.Joshi); jasarika@nitkkr.ac.in (S. Jain) develop a hybrid model is in a two- stage ORCID: 0000-0002-7432-8506 (S. Jain) manner, first using clustering or © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). classification techniques for pre-processing CEUR Workshop Proceedings (CEUR-WS.org) of data and in second stage the output of the 1.1. Motivation first stage to build second stage predictive classifier. It can be made using different As the number of people using OSN algorithms of supervised or unsupervised increases, so does the fake social media learning but in our work, we developed the accounts creation. The main motivational model using classification algorithms of factor in identifying those fake accounts is supervised learning. Our main contribution the cyber-crime rate, as these accounts were is to propose a hybrid approach of machine created primarily to commit cyber robbery learning algorithms and to compare the or to commit cybercrime anonymously or hybrid of different classification algorithms. unidentified is a significant increase from Eight different experiments were conducted, last few years. Fake account owners also try and the accuracy thus obtained was to take advantage of people's kindness by compared. composing fake messages and spreading The total number of users in online false news through these fake accounts in social networking sites is continuously order to steal money from sinless people. In increasing and with that the number of fake addition, people want to create multiple accounts is also increasing. As in accounts that don't belong to anyone, September 2019, monthly active users on created just to raise votes in an online Facebook are 2.45 billion worldwide. voting system, and receive referral According to Alexa, after Google and incentives, as in online games. YouTube the third most visited website is The detection of fake accounts in OSN Facebook. In a survey it is found that there attracts many researchers, so several are a greater number of female accounts in algorithms for detection of fake accounts have the world than the total population of been developed using machine learning female. From this, we can infer how many techniques and various functions to connect to fake profiles have been created. According the account. Spammers can also find ways to to Statistics April 2018 stats report, support such techniques. These security Facebook has more than 336 million active technologies provide sophisticated detection Twitter accounts, but Facebook is the leader mechanisms that require the continuous with 2,196 million users worldwide. In development of new approaches to spam September 2019, monthly active users on detection. The main hazards in detection of Facebook are 2.45 billion, of which India fake accounts are to achieve accuracy and has the most. 270 million users. People who response time in the analysis of log on to Facebook daily are approximately characteristics. 1.62 billion. And among these 83 million accounts are fake on Facebook. This statistics was given by Facebook in their 1.2. Challenges Wall Street reports (SOURCE: Zephoria Digital Marketing). Figure 1 shows the Modeling a Fake Profile Detection monthly active users in the year 2019 on System is an old problem but due to the various OSNs. many challenges this problem presents there still exist a lot of gaps that have been identified and need to be worked upon. The many challenges this system presents have been listed below:  The data is not readily available: accounts on online social networks are highly private and protected, so the networking sites do not reveal any account information to maintain the confidential nature and keep the trust of their users.  There is a lot of overlapping between genuine and fake accounts: At Figure 1: Monthly active users in different OSNs in times the feature set of legitimate and fake year 2019 accounts overlap, and this poses a providing an optimal solution. But the fact considerable setback when it comes to that the mannerism of fake accounts keeps training the neural network by making it on evolving with time and there are learn the pattern to differentiate between enormous numbers of challenges and gaps them. still left to tackle, this problem still has a lot  The number of parameters to of significance. In order to study the work process: The enormous number of already done on Fake Account Detection we parameters between learning and decision searched articles and research papers on two making is a major obstacle in developing major sources: i) general online indexing systems for detecting fake accounts. websites, ii) publisher databases. Examples  Selection of optimal features of former are Research Gate, Towards Data (variables) is a big challenge: When it Science, IEEE Explorer and Google Scholar comes to optimal feature selection, it needs and examples of latter are Scopus, Springer, to be really dealt with care as the ACM Digital Library and Elsevier performance of whole system depends on databases. which features it’s taking into consideration The major machine learning techniques for classification of fake and genuine we used in detection of fake accounts are accounts. And at times it’s really perplexing Neural Network, Support Vector Machine, to decide on these optimal features.  Ability to handle noise in the data: Random Forest, and Hybrid Models for Noise means missing or incorrect data Comparative analysis of Fake Account which poses challenges while processing Detection. the dataset. There is no means by which we Yang et al. trained SVM using the can make up for this lost information as ground- truth obtained by Ren Ren for such systems aren’t partition tolerant, so detecting fake accounts. By making use of this adversely affects the outcome. simple features like frequency of friend  Heterogeneity in features. requests, accepted requests and per-  Single user multiple accounts. account clustering coefficient they trained  Many of the times it resembles a legitimate transaction: At times the fake the classifier and got 99% true-positive rate account activities are stacked up in close (TPR) and 0.7% false- positive rate (FPR). resemblance with the legitimate ones. Íntegro draws out low-cost features from Hence, it becomes difficult to comprehend user-level activities to train the classifier for them and abort them before they make it to the identification of undetermined victims completion. in social graph and used feature-based detection. 1.3. Gaps Identified A different approach for hybrid was introduced by Mateen et al., by using  We can extend the evaluation of content- based features like total number of propose feature by testing on different tweets, hash tag ratio, URL’s ratio and social networking sites like Facebook, some graph- based features also and used Twitter etc. as most the previous researches the dataset of Twitter. They also made a were done on any one social site among comparison J48, Decorate and Naïve Bayes Facebook, Twitter, LinkedIn, Myspace etc. in which Decorate was the best performer.  The existing system does not work Somya et al.’s approach was quite different for the real time accurately on changing from others for detection as they tried to the features. detect the account as fake on the user’s  Identification of rumor sources on homepage using Chrome extension which social media by using the content-based runs on the user site. Along with this they features. used Petri net based solution for the identification of source of malicious content running on Pn2 simulator environment. 1.4. Related Work Using a support vector machine and a neural network, Khaled et al [20] obtained Fake Profile Detection is an old problem 98% accuracy and compared the accuracy and there has been a lot of work done in obtained by the hybrid of SVM and NN. BalaAnand et al. [3] achieved 90.3% accuracy done like improving upon the response time, using a random forest classifier, support prevention from fake accounts instead of vector machine, and k-nearest neighbor detecting and dealing with their aftermaths. method. For their work, Gupta et al [7] Our work is aiming to deliver a system selected a dataset on Twitter and used a which will have the highest accuracy and labeled dataset with a specific user and tweet hence will be effective in prevention from feature. They used a hybrid of naive such fake profiles by implementing and algorithms to classify, cluster, and make comparing different algorithms. This is done highly accurate decisions. by ensemble machine learning technique which speeds up the training of neural networks and helps them to take decisions 1.5. Organization faster. Efficient parameter selection is also one of the major objectives of this work for In our work we have implemented various which we are selecting six features manually algorithm to find the most efficient which will give a better control on the algorithm. To do so we have conducted output of neural networks. The proposed several experiments and compared their solution makes use of the hybrid of the results. Further, in this paper we have three machine learning techniques and combines sections which are briefly define below: their advantages and uses one to cancel out This section is followed by Section 2, the loopholes of the other and hence System Architecture. In this section, flow delivering an efficient and cost-effective diagram and architecture of our work is system. introduced and is described in brief. In our proposed system we are aiming to In Section 3, Experimental Results, of up design a hybrid system using artificial to now what modules we have implemented neural network, support vector machine and is shown along with pseudocode and logistic regression that will be able to discussed the various results produced by precisely and accurately detect fake profiles our system and have shown the outputs in online social network. Goal of the work generated on various inputs in the form of is to maximize the accuracy and to the graph for the better understanding and minimize the time required by using hybrid algorithm of the technique is also approach of the Neural Network, Support mentioned in this section. vector machine and Logistic Regression. In Section 4, Conclusion, we provide an Figure 2 depicts the flowchart of our understanding of the overall conclusion of system. The dataset which we have is the proposed solution i.e. the combination partitioned into two sets, Train Dataset and of the techniques which is efficient than Test Dataset in the ratio 4:1.The train others and is given better accuracy. dataset then goes into Support Vector Machine and Logistic Regression Classifier 2. System Architecture where classes are predicted. Then these classifiers are appended to a voting classifier where final decision of class is Although fake profile detection is a made. The output from voting classifier i.e. robust field, but it has many challenges and train data and the predicted class from gaps which we have discussed and have voting classifier is fed to Neural Network based our work on. There are a lot of classifier as input. After training has been existing solutions to fake profile detection completed, we get a Trained System on but all of them have some or the other which Test dataset is ran to find the drawback. There is a lot of work already accuracy of the system. done in this field and a lot more needs to be Figure 3: Architecture of proposed system 2.1. Algorithm for System Algorithm INPUT: The dataset from CSV files OUTPUT: Accuracy 1. Read dataset: Read genuineusers.csv and fake users.csv and append them in a list, named x, and make list y for labelling class. Return x,y 2. Feature Extraction: Convert non-integer features in dataset to integer. Store and overwrite selected 6 features in list x. Figure 2: Flowchart of proposed system Return x 3. Split data into training data and test data Figure 3 depicts the architecture of the using 5 cross validation and store them proposed system in which the first step is separately in x_train, x_test, y_train, y_test. collection of data of any social networking 4. Scaling of the X_data. (x_train, x_test) sites in which you want to detect the fake 5. Use ensemble classifier, voting classifier, accounts. In our proposed work we collect with SVM and Logistic Regression. the data from the web sources. And then the 6. Store result in y_pred variable and data is preprocessed by using feature Return y_pred. extraction techniques in our work we 7. Repeat step 3 with y pred and x_test manually select the features. And then training of data is there and then pass the 8. Output from step 3 is given to Neural result in voting classifier and then training Network and then store the output in y_pred. and testing of data in neural network 9. Testing: Evaluating our trained model classifier and then we got the result in the against the test data. The output is visual form of fake and real accounts. graph consisting of True_Positive_Rate and False_Positive_Rate with accuracy, i.e, ROC curve. 10. Print the classification accuracy on testing dataset. Plot the confusion matrix. Print the execution time. 11. Exit 3. Experimental Result select the features manually and we compare the result obtained from three No proposal can be modeled into a ways and we get better result from the system without some experiments to manually selection of features and the support it. In this section we have included features we select manually are: the results and outputs produced during experiment with our system and by our  statuses_count system under various inputs and parameters.  followers_count  friends_count 3.1. Implementation Details  favourites_count  listed_count Each phase of our proposed system is  lang_code briefly described in this section along with description, results at each stage are also The language code feature is of string provided. type we convert it into integer. After calling extract feature function it prints the extracted feature name and describes the 3.1.1. Data Collection entire extracted feature in summarized by printing mean, quartile, count, std, min, For the model to work upon, there is a max etc. need for data collection. The dataset can be Figure 4 shows the data distribution in collected from various online platforms and each column or feature in terms of count, can also be created by using Crawler. We mean, standard deviation, minimum and have collected two datasets through online maximum values, and average of 25%, 50% from well-known websites Kaggle and and 75% of the data points when taken in GitHub. But we worked on the dataset ascending order. which is collected by Kaggle and in that we are using two CSV files corresponding to fake and genuine users. Figure 5 shows the sample of csv file. And the code for reading both the files are: genuineusers=pd.read_csv("users.csv") fakeusers= pd.read_csv(“fusers.csv") , Figure 4: Data distribution in each column 3.1.2. Data Preprocessing Data pre-processing is used to achieve 3.1.3. Training of Classifiers the better result from any machine learning model and data processing is used to clean As we are using the hybrid approach of the data from raw data we import the useful the techniques in our proposed system, so libraries which will rescale or clean our data we have done experiments with six and the libraries we import are numpy, techniques i.e. SVM, RF, LR, DTC, NN, panda, scikit-learn and from sklearn we KNN and finalize the techniques that gives import preprocessing to clean our data. the best result and they are Support Vector Now in the next part for data pre- Machine, Logistic Regression and Neural processing we use feature extraction Network. First we train our data using technique first we try the principal support vector machine independently and component analysis technique and then we then we train our data on Logistic use the genetic algorithm and then after we Regression independently and after Figure 5: Sample of CSV file analyzing the result of both the classification techniques we merge both the called epoch. For this instance, we have techniques to check the accuracy of both of taken our epoch to be 10, total number of them together and hybrid approach of both layers to be 3, it took approximately minutes the techniques gives us the best result and and seconds to train the system with final after training the data from both the voting accuracy and loss value to be respectively. classifier is used to get the best result from both and then passing value for any one of them and then we use 5 fold cross validation technique to avoid the situation of overfitting as in k-fold cross validation technique dataset in divided into k folds where 1 fold is used for validation or testing while others are used for training and in these way we can avoid the situation of overfitting. After getting the score of each fold final estimated score is printed and in these we got 0.91 and the accuracy on testing dataset is 99.56.and after that the confusion matrix is plotted which will gives Figure 6: Training of Neural Network us the 261 true positive value and 7 false Now the output produced by several negative value and 29 false positive and 267 hybrid techniques. We have collected two true negative value and then we plot the datasets say, D1 and D2 and the difference normalized confusion matrix which gives us between these datasets is in their size, D2 is all the four (TP,TN,FP,FN) values in large as compared to D1. D2 contains percentage form along with precision, approx. 3500 rows while D1 contains recall, f1 score and support and all these are approx. 1500 rows. The results that we have evaluation criteria. For fake recall we got is obtained with different algorithms on both 0.98 and for genuine it is 1.00 and f1 score for both is 0.99 and overall accuracy is0.99. datasets are different and D2 gives less system with less accuracy as compared to D1. 3.1.4. Training of Neural Network Figure 6 shows the training of neural network. Each line corresponds to each round of forward and backward propagation Table 1: Comparison of Accuracy (Support Vector Machine, Decision Tree Classifier, Logistic Regression Random Forest, Neural Network) Hybrid Techniques D2 D1 SVM+RF+NN 91.94% 96.01% SVM+LR+NN 97.3% 99.56% RF+LR+NN 93.32% 95.79% SVM+DTC+LR+NN 96.34% 99.33% SVM+DTC+RF+NN 92.87% 95.79% SVM+DTC+NN 91.48% 96.45% SVM+RF+KNN+NN 92.31% 97.12% Figure 8: Confusion matrix of proposed hybrid model As we can see there is an accuracy difference between both datasets used by Table 2 shows the results of the seven different algorithms so further, we will be experiments that we performed using working and showing results for only different combination of classification dataset, D1. We are using two csv files one algorithms like Support Vector Machine, is of genuine users and other one is of fake Random Forest, Logistic Regression, KNN users. with Neural Network. In the above table we Figure 7 shows the accuracy of each of can see that SVM, Log Reg, and NN is our experimental model in ascending order giving the maximum of true positive true and the model with highest accuracy being negative resulting in maximum accuracy of our trained system. all. Table 2: Results of combination of several techniques Hybrid TP FP FN TN Accuracy Techniques (%) SVM+RF+NN 56 3 0 54 96.01 SVM+LR+NN 55 1 0 57 99.56 RF+LR+NN 55 4 0 54 95.79 SVM+DTC+LR 54 2 0 57 99.33 +NN SVM+DTC+RF 55 4 0 54 95.79 Figure 7: Accuracy of models in ascending order +NN SVM+DTC+N 55 5 0 53 96.45 N Figure 8 shows the confusion matrices SVM+RF+KNN 55 4 0 54 97.12 for our proposed hybrid model which gives +NN us the summary of true positive, true negative, false positive and false negative without normalization. 4. Conclusion If we look at the system designs, majority of implementations for fake account detection is either graph-based or feature- based and they may use the graph analysis techniques or machine learning techniques to identification of accounts as fake or real. In our proposed framework we use feature-based 24.2 (2015): 773-787. dataset and selected the features manually. [9] Sahoo, SomyaRanjan, and Brij B. Gupta. This approach is based upon the user-level "Hybrid approach for detection of activities and the user’s account details. We malicious profiles in twitter." Computers are comparing the hybrid approach of different & Electrical Engineering 76 (2019): 65-81. classification algorithms and pass them in [10] Kaur, Ravneet, and Sarbjeet Singh. "A voting classifier and then pass the result in survey of data mining and social network Neural network what we got from the voting analysis based anomaly detection classifier. In addition to our satisfying techniques." Egyptian informatics journal conclusion, we have maintained the 17.2 (2016): 199-216. highest accuracy in detecting fake accounts by [11] Jia, Jinyuan, Binghui Wang, and Neil testing and training the dataset on different Zhenqiang Gong. "Random walk based hybrid approach of classification algorithms. fake account detection in online social The results show the increase of the accuracy networks." 2017 47th Annual IEEE/IFIP results of the different classification algorithm. International Conference on Dependable Systems and Networks (DSN). IEEE, 5. References 2017. [12] Dhawan, Sanjeev. "Implications of Various Fake Profile Detection [1] Joshi, Shruti, et al. "Identifying Fake Techniques in Social Networks." IOSR Profile in Online Social Network: An Journal of Computer Engineering (IOSR- Overview and Survey." International JCE), AETM'16 (2016): 49-55. Conference on Machine Learning, Image [13] Gurajala, Supraja, et al. "Fake Twitter Processing, Network Security and Data accounts: profile characteristics obtained Sciences. Springer, Singapore, 2020. using an activity-based pattern detection [2] Mohanty, Sachi, et al. Recommender approach." Proceedings of the 2015 System with Machine Learning and International Conference on Social Artificial Intelligence. Wiley-Scrivener, Media & Society. 2015. 2020. [14] Xiao, Cao, David Mandell Freeman, and [3] Balaanand, Muthu, et al. "An enhanced Theodore Hwa. "Detecting clusters of graph-based semi-supervised learning fake accounts in online social networks." algorithm to detect fake users on Twitter." Proceedings of the 8th ACM Workshop The Journal of Supercomputing 75.9 on Artificial Intelligence and Security. (2019): 6085-6105. 2015. [4] Boshmaf, Yazan, et al. "Integro: [15] Adikari, Shalinda, and Kaushik Dutta. Leveraging Victim Prediction for Robust "Identifying fake profiles in linkedin." Fake Account Detection in OSNs." NDSS. arXiv preprint arXiv:2006.01381 (2020). Vol. 15. 2015. [16] Al-Qurishi, Muhammad, et al. "A [5] Erşahin, Buket, et al. "Twitter fake account prediction system of Sybil attack in detection." 2017 International Conference social network using deep-regression on Computer Science and Engineering model." Future Generation Computer (UBMK). IEEE, 2017. Systems 87 (2018): 743-753. [6] Mateen, Malik, et al. "A hybrid approach [17] Masood, Faiza, et al. "Spammer detection for spam detection for Twitter." 2017 14th and fake user identification on social International Bhurban Conference on networks." IEEE Access 7 (2019): Applied Sciences and Technology 68140-68152. (IBCAST). IEEE, 2017. [18] Cresci, Stefano, et al. "Fame for [7] Gupta, Arushi, and Rishabh Kaushal. sale:Efficient detection of fake Twitter "Improving spam detection in online social followers." Decision Support Systems 80 networks." 2015 International conference (2015): 56-71. on cognitive computing and information [19] Yang, Zhi, et al. "Uncovering social processing (CCIP). IEEE, 2015. network sybils in the wild." ACM [8] Rahman, Sazzadur, et al. "Detecting Transactions on Knowledge Discovery malicious Facebook applications." from Data (TKDD) 8.1 (2014): 1-29. IEEE/ACM transactions on networking [20] Khaled, Sarah, Neamat El-Tazi, and Hoda MO Mokhtar. "Detecting fake accounts on social media." 2018 IEEE International Conference on Big Data (Big Data). IEEE, 2018. [21] Gupta, Aditi, and Rishabh Kaushal. "Towards detecting fake user accounts in facebook." 2017 ISEA Asia Security and Privacy (ISEASP). IEEE, 2017. [22] Benevenuto, Fabricio, et al. "Detecting spammers on twitter." Collaboration, electronic messaging, anti-abuse and spam conference (CEAS). Vol. 6. No. 2010. 2010. [23] Stein, Tao, Erdong Chen, and Karan Mangla. "Facebook immune system." Proceedings of the 4th workshop on social network systems. 2011.