=Paper=
{{Paper
|id=Vol-2915/paper3
|storemode=property
|title=Research on Phishing Email Detection Based on URL Parameters Using Machine Learning Algorithms
|pdfUrl=https://ceur-ws.org/Vol-2915/paper3.pdf
|volume=Vol-2915
|authors=Milda Tubytė,Agnė Paulauskaitė-Tarasevičienė
|dblpUrl=https://dblp.org/rec/conf/ivus/TubyteP21
}}
==Research on Phishing Email Detection Based on URL Parameters Using Machine Learning Algorithms==
Research on phishing email detection based on URL parameters using machine learning algorithms Milda Tubyte and Agne Paulauskaite-Taraseviciene a a Department of Applied Informatics, Kaunas University of Technology, Studentu 50, Kaunas, Lithuania milda.tubyte@ktu.edu, agne.paulauskaite-taraseviciene@ktu.lt Abstract Abstract – phishing is the most frequent data breach problem in cybersecurity. Cyber scammers use the phishing approach to outwit or obtain sensitive information without a consumer agreement. The victims might receive an email that promotes clicking on the following malicious links that lead to sensitive data leaks. This problem is especially relevant to large companies. Attackers tend to prepare emails that contain work-related information and include familiar keywords in phishing URL1s. This paper addresses the URL Boolean classification problem using various machine learning methods such as Support Vector Machine, Random Forest, Decision Tree, Linear Discriminant Analysis, and Logistic Regression. This paper provides a comparative study on these algorithms applied for two different URL classification datasets. Keywords Phishing, URL, machine learning, cybersecurity, classification 1. Introduction The phishing technique still scores as one of the top cybersecurity threats even though it was found in the late 90s. Cybercriminals use this approach because of its simplicity. By creating trustworthy email and encouraging to press provided URL, cybercriminals trick their victim and achieve sensitive data by manipulation. The website might appear as a well-known login form that asks for credentials. Additionally, the website could automatically install malware or inject drive-by exploits in the user's browser. The most targeted companies are Google, Microsoft, Dropbox, PayPal, Apple [1]. By targeting the most popular companies, cybercriminals attempt to gain credentials that would lead them to examine the organization's infrastructure and collect data. Moreover, some consumers choose to use the same credentials to other websites or applications that might open opportunities for attackers to expand criminal activities. Phishing emails and business email compromise cause 67% or more data breaches [2], as stated in the Verizon 2020 data breach investigation report. These numbers provide insight that as organizations moved to SaaS2 applications, phishing will continue to grow. F5 labs provide vital statistics on phishing patterns. As reported by 2019 research, 85% of tested phishing sites used certificates signed by trusted Certificate Authorities, 71% of phishing sites used HTTPS3 [3]. It is one of the cybercriminal strategies to outwit consumers into thinking that the website is legitimate. Although, 36% of phishing sites had certificates lasting only 90 days. It might be worth checking the expiration date since criminals do not use SSL4 certificates for long periods. Three main phishing target groups can be distinguished: indiscriminate, semi-targeted, and spear phisher [4]. The indiscriminate group usually gets the same content phishing email that often pretends to be a tech brand like Microsoft or Google. Semi-target groups most often are from the same working space. And the last one usually targets C-levels or system admins. Diverse phishing techniques apply to each group which is why phishing email detection is so troublesome. Attackers improve their methods 1 Uniform Resource Locator 2 Software as a service 3 Hypertext Transfer Protocol 4 Secure Sockets Layer Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). each month, besides various automation tools increase the processing speed. Machine learning is one of the components that could help to prevent phishing attacks. Combining it with authorization tools and phishing detection courses could minimize the possibility of a successful attack [5]. Since the most common phishing type is receiving a letter with a malicious URL, this paperwork investigates machine learning predictions based on URL features. By extracting URL features from a letter, machine learning algorithms can find the patterns that lead to identifying the URL type. In this research, two datasets were used with five different machine learning techniques. By creating the future extraction code, deployed machine learning models provide a prediction on the selected URL. Experiments include Logistic Regression, Linear Discriminant Analysis, Decision Tree, Support Vector Machine, and Random Forest machine learning methods. The rest of the paper is organized as follows. Section 2 reviews related work and compares methods that are used in this paper. Section 3 describes used machine learning techniques. Section 4 reveals the results, selected each model parameter, and dataset comparison. Section 5 concludes the paperwork results. 2. Related work Machine learning algorithms became a prevalent research study for the past years. Recent machine learning studies gave high accuracy results on classifying URLs as phishing and legitimate type [6], [7], [8]. For instance, a study declared 99.7% accuracy with a negligible false positive rate of about 0.06% using a random forest algorithm [9]. Another study based on phishing website detection has implemented the SVM method and reached 95% accuracy using six features only [10]. The study dataset has been created using legitimate URLs from browsing history and phishing URLs from the PhishTank database. However, the study estimates, if the URL does not include all features, the prediction gives wrong results. A. Akinyelu and A. O. Adewumi developed fifteen URL feature extraction code using C# language and performed the Random Forest algorithm. The model overall accuracy reached an impressive 99.7% result with a negligible false positive rate of about 0.06%. [11] The study Unbiased Phishing Detection Using Domain Name-Based Features by Hossein Shirazi used Rami M. Mohammad, Fadi Thabtah, and Lee McCluskey created rule-based dataset that this paperwork investigates in section [12]. A rule-based threshold becomes a critical point that sets the features as phishing or legitimate type for each URL attribute [13]. Because of strict rules, the model might lose its flexibility. That is the reason why the researcher Hossein Shirazi removed rule-based thresholds and experimented with actual values. The SVM method with the Gaussian kernel parameter provided a 93.7% overall accuracy result. For comparison, the experiments were done on a threshold- based dataset in the experiments section. 3. Research methods Five machine learning approaches used in this paper are related to the URL classification problem. The target result is to detect if the URL is a legitimate or a phishing type. Logistic regression is an efficient and powerful algorithm for Boolean classification problems, therefore can be used to classify the URL as phished or legitimate [14]. Different studies have shown promising results while using logistic regression for predicting phishing [15], [16], [17]. The algorithm uses the Sigmoid function that maps result values between 0 and 1. The model requires defining a threshold value that sets a boundary between two classes. Predictions are made based on that boundary. The Linear Discriminant Analysis (LDA) data classification technique is based on dimension reduction. The method maximizes class separability by reaching the maximum ratio of between-class variance [18]. However, in some cases, alternative and discriminant analysis-based methods can provide higher accuracy results than standard LDA. For example, Biased Discriminant Analysis (BDA) [19], cluster space linear discriminant analysis (CSLDA) [20]. Categorical variable decision tree targets to predict class type values. The model constructs a tree structure that contains a root node that represents entire dataset observations, leaves which are the last node that does not split, decision nodes which are sub-node that splits into further sub-nodes. Some studies display quite great prediction results using the DT algorithm [21], [8], [22]. Support Vector Machine (SVM) approach for classification problem requires that classes could be separable by a linear boundary. The algorithm is constructed to predict Boolean type targets by determining the separating hyperplane between two classes. The method uses kernel functions for classification [23], [24]. The random forest (RF) algorithm randomizes classification or regression trees, averages calculated predictions, and merges them to achieve more accurate results. Instead of looking for the most important feature in the node, the algorithm searches for the best feature among randomized subsets. The algorithm is widely used in phishing URL classification problems because of its great performance [25], [26], [7]. 4. Experiments The experiments below represent two separate dataset accuracy results applying supervised machine learning algorithms. The first dataset focuses on URL symbol appearance and the second one on the created threshold rules. The experimental results provide a comparison of datasets and insight into each dataset's advantages and drawbacks. The University of Maribor provided two datasets that contain 111 features of URL specifics [27]. The last dataset feature represents the target phishing attribute. The phishing attribute is the Boolean value (0 is legitimate, 1 is phishing URL). The class balance in provided datasets variates. One of the datasets has approximately the same amount of both classes. The second variant distribution is about 30% of phishing and 70 % of legitimate observations. The purpose of different variations is to refer to life conditions were about the same present URL distribution. The datasets examine Domain, Directory, File, Parameter, URL, and external services features (Figure 1). It is not enough to evaluate the symbols that appear in the URL. At first sight, the URL can seem trustworthy, so additional information from external services about the domain and URL is significant. The datasets with unbalanced variation were chosen for the first accuracy experiment. Figure 1: URL feature separation Researchers Rami M. Mohammad, Fadi Thabtah, and Lee McCluskey created an intelligent rulebased phishing website classification method [13]. In 2012, seventeen features were described with rule-based threshold values that determine values as 1 (legitimate), 0 (suspicious), or -1 (phishing). The final feature represents the URL as a Boolean result (-1 - phishing, 1 - legitimate). The observations were collected from the Phishtank database and yahoo directory. In 2015 researchers added more rules and published a dataset that is still commonly used in scientific experiments. The dataset investigates five groups’ attributes based on URL, DNS5, External statistics, HTML6, and JavaScript. The final dataset contains 31 features, including targeting URL type. Most features require additional information from external services like the WHOIS or the Alexa databases. 4.1. Accuracy evaluation In this research, confusion matrix metrics were used to evaluate the performance of included machine learning algorithms. Since phishing URL classification is a two-class Boolean problem, confusion matrix dimensions are M2x2. The confusion matrix's purpose is to display model predicted and actual class values [28]. The correct model provided predictions are True Negative (TN) and True Positive 5 Domain Name System 6 Hyper Text Markup Language (TP) values. Incorrect predictions accordingly are False Negative (FN) and False Positive (FP) values. The following formulas need to be applied to obtain the accuracy (1) and the error rate (2) of the final model. Table 1 Confusion matrix metrics Predicted negative Predicted Positive Actual negative TN FP Actual positive FN TP 4.2. Experimental results The experiments below represent the results of .csv type data. The second dataset format was converted from .arff to .csv. The first experiments were done at the University of Maribor provided an unbalanced dataset that contained 88648 observations. A correlation matrix of a dataset was created with a threshold value of 0.7 based on the target column. Twenty-four features were selected with a higher than 70% positive correlation result based on correlation with phishing feature (Figure 2). Features also correlate with each other with a high positive result. Figure 2: Correlation matrix with 0.7 threshold based on phishing feature. Below presented tables (2, 3, 4, 5, 6) provide the confusion matrix of every classification algorithm. Table 7 pictures each class and overall model accuracy results. The LR model was implemented as a binomial family model with a threshold value of 0.5. The DT model was implemented as a categorical CART model. The most impactful parameter was detected as directory length. The SVM model provided the best performance with linear kernels (cost 10). The RF algorithm involves 80 iterations to grow trees. The most impactful parameter was indicated as time-domain activation according to the Mean Decrease Accuracy metrics and directory length according to the Mean Decrease Gini metrics. Directory length feature represents how many symbols the URL contains, and time-domain activation investigates how long the domain is active in days. These features are also investigated by the second dataset. Table 2 Table 3 Confusion matrix of RF algorithm Confusion matrix of SVM algorithm RF Predicted class SVM Predicted class Actual Class Legitimate Phishing Total Actual Class Legitimate Phishing Total Legitimate 16972 428 17400 Legitimate 16816 584 17400 Phishing 529 8665 9194 Phishing 642 8552 9194 Total 17501 9093 26594 Total 17458 9136 26594 Table 4 Table 5 Confusion matrix of Decision Tree (DT) algorithm Confusion matrix of LDA algorithm DT Predicted class LDA Predicted class Actual Class Legitimate Phishing Total Actual Class Legitimate Phishing Total Legitimate 15774 1626 17400 Legitimate 15514 440 17400 Phishing 563 8631 9194 Phishing 886 8754 9194 Total 16337 10257 26594 Total 17400 9194 26594 Table 6 Confusion matrix of Logistic Regression (LR) algorithm LR Predicted class Actual Class Legitimate Phishing Total Legitimate 16379 1021 17400 Phishing 812 8382 9194 Total 17191 9403 26594 The second dataset contains 11056 observations. The correlation matrix below represented that the most correlated feature with the target value was the SSL final state and URL of Anchor (Figure 3). The URL of the Anchor feature examines if HTML tags and the website have different domain names. The SSL final state feature investigates if the URL uses HTTPS protocol and checks certificate age. The previous dataset also includes the SLL attribute that is called the tls ssl certificate. Below presented tables (7, 8, 9, 10, 11) provide statistics of each machine learning accuracy result. Random Forest and Support Vector Machine provided the highest overall accuracy results that reach 96%. However, the SVM algorithm predicted each class with almost the same percentage accuracy, and RF predicted the Legitimate class with a higher result than the Phishing class. The decision tree provided 94% overall accuracy result, Linear Discriminant Analysis and Logistic Regression scored with lowest accuracy results 91 % and 92 % accordingly. Comparing the SVM algorithm with Hossein's research results, using the rule-based threshold dataset, the model provided a 3% higher overall accuracy result. However, the threshold method deprives model flexibility as rule boundaries decide feature class. Since phishing trends change over the years, threshold rules might variate also. The SVM algorithm scored best with kernel option as polynomial, cost value as 50. RF algorithm used 150 tree iterations. The most impactful future was indicated as SSL final state according to Mean Decrease Accuracy and Gini metrics. Decision Tree also detected SLL final state as the most impactful feature. The pruned tree had 51 leaves, and the CP value was selected as 4.37-04. LR performed best with a threshold value of 0.6. Figure 3: Correlation matrix with 0.7 Comparing the two dataset results, the overall prediction accuracy is approximately the same. The first dataset contains 3.6 times more features than the second dataset. It would take more time to develop code that would extract all features represented in a first dataset however, only 15 features require additional information from external services. Moreover, the first dataset uses actual values, and it might avoid URL trend change more over the years. The second dataset is based on strict rules and depends on many external recourses that could fail to provide data. Machine learning models compute faster since there are fewer observations and features. It is possible to remove threshold rules and use actual values, but model accuracy might slightly decrease. Experiments were coded in the R Studio environment using the x86_64-w64-mingw32 platform, 4.0.2 version R. The training was executed on the PC with Intel(R) Core (TM) i5-9600KF CPU @ 3.70GHz 3.70 GHz processor, NVIDIA GeForce GTX 1660 SUPER graphic card 6GB. The first dataset results reveal that the RF (95%) out-performed DT (91%), LDA (91%), LR (93%), SVM (94%) algorithms with the highest overall accuracy (Figure 4). The DT and LDA algorithm scored with the same 91% overall accuracy result. However, LDA classified the phishing feature with the best score of all provided models with 95% accuracy. The lowest precisely classified legitimate type URL was using LDA (89%) and DT (90%). The second dataset result shows that RF and SVM scored with the best 96% overall accuracy results (Figure 5). DT (94%) outperformed LDA (91%) and LR (92%). The second dataset scored with 1% higher accuracy using the Random Forest algorithm. Altogether, each model performed with high accuracy results. Table 12 Percentage accuracy results of included algorithms ML algorithm Average accuracy, % Accuracy results each class Legitimate, % Phishing. % The first dataset results RF 95 97.540 94.246 SVM 94 96.664 93.017 DT 91 90.655 93.876 LDA 91 89.160 95.214 LR 93 94.132 91.168 The second dataset results RF 96.8 97.942 95.543 SVM 96.6 96.661 96.304 DT 94 93.873 95.073 LDA 91 91.848 91.442 LR 92 92.239 92.220 Accuracy results 100 98 96 94 92 90 88 86 84 95 94 91 91 93 RF SVM DT LDA LR Legitimate, % Phishing. % Average accuracy Figure 4: The first dataset accuracy results of five supervised learning algorithms Figure 5: The second dataset accuracy results of five supervised learning algorithms 5. Conclusion This paperwork represented supervised machine learning algorithm classification on phishing and legitimate type URLs. In this study, phishing URL classification is defined as a two-class problem. Five different algorithms were employed using two datasets: Random Forest, Support Vector Machine, Logistic Regression, Linear Discriminant Analysis, Decision Trees. The accuracy of each model was higher than 91% despite the dataset. The RF algorithm performed the highest overall accuracy result on both datasets. However, LDA classified the Phishing URL class with the highest 95% accuracy rate for the first dataset. The SVM algorithm provided the highest accuracy on classifying Phishing URLs for the second dataset. Each dataset accuracy test was performed on a subset that contained 30% of original dataset observations. The first dataset was published in 2020 and used actual values that mostly involve symbol search in the URL. The second dataset was created in 2015 and has strict rules that determine the threshold value of each feature. The better results were reached using the second dataset however, most features from the dataset require additional information from external services that could fail to provide accurate information. In future research, more effective features (new, derived features from classical URML features) can be included for determining the most relevant signs of malicious URLs. 6. Acknowledgements I cannot express enough thanks to my lecture Agne Paulauskaite-Taraseviciene that consulted me in every step and helped me to overcome obstacles. Also, my project could not be completed without the Littelfuse Digital Innovation manager and IT security team that introduced me to this relevant problem and told me more about it. My heartful thanks. 7. References [1] Webroot, "2019 Webroot Threat Report," p. 24, 2019. [2] P. Langlois, "2020 Data Breach Investigations Report," p. 37, 2020. [3] D. Warburton, R. Pompon, "2019 Phishing and Fraud Report," p. 10, 2019. [4] P. N. Mangut, K. A. Datukun, "The Current Phishing Techniques – Perspective of the Nigerian Environment," World Journal of Innovative Research (WJIR), vol. 10, no. 1, pp. 34-44, 2021 . [5] H. Wechsler. V. Ramanathan, "Phishing detection and impersonated entity discovery using Conditional Random Field and Latent Dirichlet Allocation," 2013. [6] V. Marcinkevicius, P. Vaitkevicius, "Comparison of Classification Algorithms for Detection of Phishing Websites," Informatica, vol. 31, pp. 143-160, 2020. [7] W. Ali, "Phishing Website Detection based on Supervised Machine Learning with Wrapper Features Selection," International Journal of Advanced Computer Science and Applications,, vol. 8, no. 9, 2017. [8] A. A. Orunsol, "PERFORMANCE COMPARISON OF PREDICTIVE MODELS BASED ON REDUCED PHISHING FEATURE CORPUS," Anale. Seria Informatică, vol. 18, no. 2, 2020. [9] A. O. Adewumi, A. A. Akinyelu, "Classification of Phishing Email Using Random Forest Machine Learning Technique," Journal of Applied Mathematics, p. 6, 2014. [10] B. Outtaj, M. Zouina, "A novel lightweight URL phishing detection system using SVM and similarity index," 2017. [11] A. O. Adewumi, A. A. Akinyelu, "Classification of Phishing Email Using Random Forest Machine Learning Technique," vol. 2014 , 2014 . [12] H. Shirazi, "Initial Attempt: Fresh-Phish Framework," Fort Collins, Colorado, 2018. [13] F. Thabtah, L. McCluskey, R. M. Mohammad, "Intelligent rule‐based phishing websites classification," Springer-Verlag, 2014. [14] L. H. Ungar, A. I. Schein, "Active learning for logistic regression: an evaluation," Mach Learn, p. 235–265, 2007. [15] M. M. Darabi, M. I. Vahid Shahrivari, "Phishing Detection Using Machine Learning Techniques," vol. 10, 2020. [16] V. Anandkumar, "Malicious-URL Detection using Logistic Regression Technique," International Journal of Engineering Business Management, vol. 9, pp. 108-113, 2019. [17] M. N. Kumar, M. Sowjanya, G. Kumari, "FAKE WEBSITE DETECTION USING REGRESSION," International Journal of Advance Reasearch in Science and Engineering, vol. 6, no. 8, 2017. [18] G. A. Montazer, M. Imani, "Phishing Website Detection Using Weighted Feature Line Embedding," The ISC Int'l Journal of, vol. 9, no. 2, pp. 49-61, 2017. [19] M. F. Moens, J. C. Gomez, "Using Biased Discriminant Analysis for Email Filtering," 2010. [20] M. Imani, G. A. Montazer, "Email Spam Detection Using Linear Discriminant Analysis Based on Clustering," 2017. [21] M. Kula, J. Bohacik, "Webpage phishing detection with data mining," Journal of Information, Control and Management Systems, vol. 17, no. 2, 2019. [22] S. He, X. Y. Shi, B. Cui, "Malicious URL detection with feature extraction based on machine learning," International Journal of High Performance Computing and Networking , vol. 12, 2018. [23] Y. Zhou, H. Liu, N. Zhu, Shourong Hou, "Wavelet Support Vector Machine Algorithm in Power Analysis Attacks," Shanghai, China, 2017. [24] H. Mojeed, A. Balogun, A. N. Oluwatobi, V. Adeyemo, "Ensemble-Based Logistic Model Trees for Website Phishing Detection," in Advances in Cyber Security, Springer, 2021, pp. 627-641. [25] S. Eddie, W. Shou, "Critical Analysis of Current Research Aimed at Improving Detection of Phishing Attacks," Selected Computing Research Papers, vol. 9, 2020. [26] C. X. S. Nazir, A. Hafeez, S. Wan, S. Khan, "Deep Learning-Based Efficient Model Development for Phishing Detection Using Random Forest and BLSTM Classifiers," 2017. [27] I. Fister, J. V. Podgoreleca, GregaVrbančič, "Datasets for phishing websites detection," Journal Data in Brief, vol. 33, 2020. [28] M. B. Chaudhari, Purvi Pujara, "Phishing Website Detection using Machine Learning : A Review," International Journal of Scientific Research in Computer Science, vol. 3, no. 7.