Detecting Phishing Websites by using Neural Network
Models
Dominika Zurawska1
1
    Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, POLAND


                                             Abstract
                                             In the article is presented the problem of classifying domains that may be phishing by using parameters and information
                                             extracted from sample pages. Presented tests are using various ML classifications models which we used from open libraries
                                             in selected programming language. Presented methods are implemented in simple way just to test selected models and
                                             compare them in standard metrics. To my tests i have selected neural networks, decision tree, svm, logistic regression and
                                             random forest. I have tested their effectiveness to select the best option for phishing.

                                             Keywords
                                             neural network, classification, phishing, security domain


1. Introduction                                                                                                            important in the classification of phishing.

Machine learning methods are very popular in last years
[1, 2, 3, 4]. In the development of It we can see that many                                                                2. Phishing Websites Features
applications use such methods to improve working to-
ward some important aspects. In [5], [6], and [7] there                                                                    In this project, we shed light on the important features
are several application of neural networks in image pro-                                                                   that have proved to be sound and effective in predict-
cessing. The model presented in [8, 9] show that neural                                                                    ing phishing websites. We classified our domain based
networks are very good extractors of potential danger-                                                                     on features such as: having IP Address, URL Length,
ous situation on the internet. Tests on classifiers for IoT                                                                Shortening Service, having At Symbol, double slash redi-
environments show that both neural networks and fuzzy                                                                      recting, Prefix Suffix, having Sub Domain, SSLfinal State,
systems have very good application [10, 11].                                                                               Domain registration length, Favicon, port, HTTPS token,
   Phishing attacks attempt to gain sensitive, confiden-                                                                   Request URL, URL of Anchor, Links in tags, SFH, Submit-
tial information such as usernames, passwords, credit                                                                      ting to email, Abnormal URL, Redirect, on mouseover,
card information, network credentials and more [12]. By                                                                    RightClick, pop up window, Iframe, age of a domain,
posing as a legitimate individual or institution via phone                                                                 DNSRecord, web traffic, Page Rank, Google Index, Links
or email, cyber attackers use social engineering to ma-                                                                    pointing to the page, Statistical report, Result.
nipulate victims into performing specific actions—like
clicking on a malicious link or attachment or willfully                                                                    3. Main decision parameters
divulging confidential information. Both individuals and
organizations are at risk; almost any kind of personal                                                                     The features that matter the most in the context of phish-
or organizational data can be valuable, whether it be to                                                                   ing websites detect.
commit fraud or access an organization’s network. In
addition, some phishing scams can target organizational
                                                                                                                           3.1. SSL final State
data in order to support espionage efforts or state-backed
spying on opposition groups. Very interesting comments                                                                     The Subject Common Name of the certificate has to match
on this model can be found directly in online resources of                                                                 the hostname of the phishing site that returned it. Some
https://www.antiphishing.org/resources/apwg-reports/.                                                                      sites will return the hosting company’s certificate when
   To properly classify our domains, we decided to check                                                                   requested over HTTPS. As most modern browsers display
and compare different classifiers to see if there are any                                                                  warnings when a non-matching certificate is encoun-
significant differences between the results and which                                                                      tered, such certificates only serve to make the user more
one is best suited to this problem. And also check if we                                                                   suspicious instead of increasing the perceived security
can extract the features of a given domain that are most                                                                   of the site.
ICYRIME 2021 @ International Conference of Yearly Reports on
Informatics Mathematics and Engineering, online, July 9, 2021                                                              3.2. URL of Anchor
" domizur257@student.polsl.pl (D. Zurawska)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                           An anchor is an element defined by the <a> tag. This
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)                                             feature is treated exactly as “Request URL”. However, for


                                                                                                                      45
Dominika Zurawska CEUR Workshop Proceedings                                                                                45–50


this feature we examine:                                           3.6. Links pointing to page
    1. If the <a> tags and the website have different              The number of links pointing to the webpage indicates
       domain names. This is similar to request URL                its legitimacy level, even if some links are of the same
       feature.                                                    domain. In our datasets and due to its short life span,
    2. If the anchor does not link to any webpage, e.g.:           we find that 98% of phishing dataset items have no links
           a) <a href=“#”>                                         pointing to them. On the other hand, legitimate websites
           b) <a href=“#content”>                                  have at least 2 external links pointing to them.
           c) <a href=“#skip”>
           d) <a href=“JavaScript::void(0)”>                       Of Link Pointing to The Webpage = 0 → Phish-
                                                                   ing
Rule:                                                              Of Link Pointing to The Webpage > 0 and <= 2
% of URL Of Anchor <31% → Legitimate                               →Suspicious
% of URL Of Anchor ≥ 31% And ⩽ 67% → Suspicious                    Otherwise → Legitimate
Otherwise→ Phishing

3.3. Links in tags                                                 4. Classifications Algorithms
Given that our investigation covers all angles likely to           In this work some selected models were tested.
be used in the webpage source code, we find that it is             Presented results are from open libraries that
common for legitimate websites to use <Meta> tags to               were available for student tests in online services.
offer metadata about the HTML document; <Script> tags              Data for the experiments were collected from
to create a client side script; and <Link> tags to retrieve        https://archive.ics.uci.edu/ml/datasets.php.
other web resources.
                                                                   4.1. Logistic regression
3.4. Prefix Suffix                                             Logistic regression developing the concept of a per-
The dash symbol is rarely used in legitimate URLs.             ceptron using a nonlinear activation function and
Phishers tend to add prefixes or suffixes separated by         updating the weights with the logistic regression
(-) to the domain name so that users feel that they            cost function,. In experiments i have used model
are dealing with a legitimate webpage. For example             from      https://machinelearningmastery.com/logistic-
http://www.Confirme-paypal.com.                                regressionfor- machine-learning/. This model can
                                                               be extended with regularization to prevent too high
Rule:                                                          variance. Using the sigmoid activation function, model
Domain Name Part Includes (-) Symbol → Phishing                returns the probability of class, in our case we use the
Otherwise → Legitimate                                         tanh activation function because it returns values from
                                                               -1 to 1 and this is exactly how it is presented in our
                                                               dataset.
3.5. Web traffic
                                                                   4.2. SVM
This feature measures the popularity of the website by
determining the number of visitors and the number of               SVM is very similar to logistic regression, but it uses a
pages they visit. However, since phishing websites live            different method of determining the decision boundary,
for a short period of time, they may not be recognized             it consists in finding such a boundary whose distance to,
by the Alexa database. Furthermore, if the domain has              samples of different classes, is as large as possible. Ap-
no traffic or is not recognized by the Alexa database, it          plied model is from https://paperswithcode. com/method-
is classified as “Phishing”. Otherwise, it is classified as        /svm. This algorithm also has the ability to correct varia-
“Suspicious”.                                                      tions with the help of the C parameter (expanded regular-
                                                                   ization) as well as solving problems with classes, linearly
Rule:                                                              non-separable with the help of kernel functions, by in-
Website Rank<100,000 → Legitimate                                  creasing the dimensions and finding a hyperplane.
Website Rank>100,000 →Suspicious
Otherwise → Phish                                                  4.3. Decision tree
                                                                   The next classifier is the decision tree [13], its activity is
                                                                   about creating a decision boundary by asking questions


                                                              46
Dominika Zurawska CEUR Workshop Proceedings                                                                                   45–50


about the data, answering them assigns the data to the
next branches of the tree (there may be a lot of them, but
in practice, it is usually divided into two sub-trees). In
theory, such a tree can distribute data until there is only
one class in each leaf, which means that the classifier will
be over-trained and will not cope with the new data to
prevent such a high variance when pruning the tree at a
given height. In our case, the tree will work great because
the features in our data set are binary ages and represent
answers to the questions about domain metadata. The
model of this section is from https://www.geeksforgeeks.
org/decision-tree-implementation-python/.

4.4. Random forest
Random forest is the use of many decision trees and
averaging their results, thanks to this solution we can              Figure 1: Results of SVM classification on input data.
use very tall trees and thus with a large variance (too
high accuracy), because after averaging with other trees
it ceases to be a problem and the model is efficient and                                     𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
                                                                                 𝐹1 = 2 ×                                       (3)
accurate                                                                                     𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙


4.5. Neural network                                                  6. Results of tests
The multi-layer neural network, a model of a neural net-       Our own implementation of logistic regression obtained
work with a layer of input neurons with an amount equal        a decent result of 0.9, while the error matrix shows that
to the number of features of our data, layers of neurons       the model has a tendency to falsely classify websites as
which in our case is 50 and output neurons with the            phishing: Accuracy: 0.9, Precision: 0.87, Recall: 0.974, F1:
number of our classes                                          0.919.
                                                                  The SVM model with complementary variable and reg-
                                                               ularization be scored very high: Accuracy: 0.97, Precision:
5. Accuracy measure parameter                                  0.97, Recall: 0.99, F1: 0.98.
                                                                  Tree decision tree obtained very good results at a depth
After creating all classifications, it turns out that all mod-
                                                               of about 15, increasing by higher values will not improve
els have proven themselves. However, they had different
                                                               much, and reaching very high values caused that the
operating times and unfortunately for some real-time
                                                               accuracy was decreasing - it resulted from the previously
learning is not possible. But, for example, thanks to a de-
                                                               discussed too large variance: Accuracy: 0.964, Precision:
cision tree, we can visualize the decision-making process
                                                               0.966, Recall: 0.969, F1: 0.968.
and check based on the features it is made.
                                                                  Presentation of an example tree with a depth of 3 so
   Formulas for determining parameters for model
                                                               that it is relatively clear, such a tree also achieves a satis-
evaluation:
                                                               factory result of about 93
                                                                  The Random Forest, achieves practically the same re-
TP - true positive (the phishing sample is classi-
                                                               sult as properly trimmed random tree: Accuracy: 0.965,
fied as phishing)
                                                               Precision: 0.966, Recall: 0.969, F1: 0.968.
FN - false negative (the phishing sample is classified as
                                                                  The multi-layer neural network obtained a very high
non-phishing)
                                                               result as one might expect: Accuracy: 0.975, Precision:
FP - false positive (the non-phishing sample is classified
                                                               0.971, Recall: 0.984, F1: 0.977.
as phishing)
TN - true negative (the non-phishing sample is classified
as non-phishing)                                               7. Conclusion
                              𝑇𝑃                                     Most classifiers work well in predicting whether a given
                𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =                              (1)
                            𝑇𝑃 + 𝐹𝑃                                  domain is a phishing attack, the effectiveness of predic-
                             𝑇𝑃                                      tion is at a very, level, and thanks to algorithms such as
                  𝑅𝑒𝑐𝑎𝑙𝑙 =                                (2)        the decision tree, we are able to extract from the model
                           𝐹𝑁 + 𝑃𝑃


                                                                47
Dominika Zurawska CEUR Workshop Proceedings                                                                             45–50


Figure 2: Decision tree diagram, created on the basis of our classifier, you can use it to see how the decision-making process
is progressing. A precise description of the most important parameters according to which the decision was made in the fifth
paragraph.


                                                                  Figure 4: Sample error matrix of logistic classification on
Figure 3: Sample error matrix of classification on input data.    input data.
A pattern which was used in all further figures.


                                                                  very similar results, making the choice appropriate to
the features that are most important to recognize this            our needs should be based on the assessment of efficiency,
type of attack, thanks to which you can defend yourself           flexibility for learning with new data and transparency
more effectively . And the learned model can be used              of operation.
in user protection programs. As the models obtained


                                                             48
Dominika Zurawska CEUR Workshop Proceedings                                                                                  45–50


Figure 5: Sample error matrix of SVM classification on input         Figure 7: Sample error matrix of random forest classification
data.                                                                on input data.


Figure 6: Sample error matrix of decision tree classification
on input data.                                                       Figure 8: Sample error matrix of neural network classification
                                                                     on input data.


References
                                                                          G. Susi, A spiking neural network-based long-term
 [1] G. Cardarilli, L. Nunzio, R. Fazzolari, D. Giardino,                 prediction system for biogas production, Neural
     M. Matta, M. Patetta, M. Re, S. Spanò, Approxi-                      Networks 129 (2020) 271 – 279. doi:10.1016/j.
     mated computing for low power neural networks,                       neunet.2020.06.001.
     Telkomnika (Telecommunication Computing Elec-                    [3] G. Capizzi, G. Lo Sciuto, C. Napoli, E. Tramontana,
     tronics and Control) 17 (2019) 1236–1241.                            A multithread nested neural network architecture
 [2] G. Capizzi, G. Lo Sciuto, C. Napoli, M. Woźniak,                     to model surface plasmon polaritons propagation,


                                                                49
Dominika Zurawska CEUR Workshop Proceedings                    45–50


     Micromachines 7 (2016).
 [4] G. Capizzi, C. Napoli, L. Paternò, An innovative
     hybrid neuro-wavelet method for reconstruction of
     missing data in astronomical photometric surveys,
     Lecture Notes in Computer Science (including sub-
     series Lecture Notes in Artificial Intelligence and
     Lecture Notes in Bioinformatics) 7267 LNAI (2012)
     21 – 29.
 [5] R. Brociek, G. De Magistris, F. Cardia, F. Coppa,
     S. Russo, Contagion prevention of covid-19 by
     means of touch detection for retail stores, volume
     3092, 2021, p. 89 – 94.
 [6] D. Połap, M. Woźniak, Meta-heuristic as manager
     in federated learning approaches for image process-
     ing purposes, Applied Soft Computing 113 (2021)
     107872.
 [7] R. Avanzato, F. Beritelli, M. Russo, S. Russo, M. Vac-
     caro, Yolov3-based mask and face recognition al-
     gorithm for individual protection applications, vol-
     ume 2768, 2020, p. 41 – 45.
 [8] M. Wozniak, J. Silka, M. Wieczorek, M. Alrashoud,
     Recurrent neural network model for iot and net-
     working malware threat detection, IEEE Transac-
     tions on Industrial Informatics 17 (2021) 5583–5594.
 [9] X. Liu, S. Chen, L. Song, M. Woźniak, S. Liu, Self-
     attention negative feedback network for real-time
     image super-resolution, Journal of King Saud
     University-Computer and Information Sciences
     (2021).
[10] M. Woźniak, M. Wieczorek, J. Siłka, D. Połap, Body
     pose prediction based on motion sensor data and
     recurrent neural network, IEEE Transactions on
     Industrial Informatics 17 (2020) 2101–2111.
[11] M. Woźniak, A. Zielonka, A. Sikora, M. J. Piran,
     A. Alamri, 6g-enabled iot home environment con-
     trol using fuzzy rules, IEEE Internet of Things
     Journal 8 (2020) 5442–5452.
[12] M. Silic, A. Back, The dark side of social networking
     sites: Understanding phishing risks, Computers in
     Human Behavior 60 (2016) 35–43.
[13] S. Russo, S. Illari, R. Avanzato, C. Napoli, Reducing
     the psychological burden of isolated oncological
     patients by means of decision trees, volume 2768,
     2020, p. 46 – 53.


                                                          50