Detecting Phishing Websites by using Neural Network
Models
Dominika Zurawska1
1
Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, POLAND
Abstract
In the article is presented the problem of classifying domains that may be phishing by using parameters and information
extracted from sample pages. Presented tests are using various ML classifications models which we used from open libraries
in selected programming language. Presented methods are implemented in simple way just to test selected models and
compare them in standard metrics. To my tests i have selected neural networks, decision tree, svm, logistic regression and
random forest. I have tested their effectiveness to select the best option for phishing.
Keywords
neural network, classification, phishing, security domain
1. Introduction important in the classification of phishing.
Machine learning methods are very popular in last years
[1, 2, 3, 4]. In the development of It we can see that many 2. Phishing Websites Features
applications use such methods to improve working to-
ward some important aspects. In [5], [6], and [7] there In this project, we shed light on the important features
are several application of neural networks in image pro- that have proved to be sound and effective in predict-
cessing. The model presented in [8, 9] show that neural ing phishing websites. We classified our domain based
networks are very good extractors of potential danger- on features such as: having IP Address, URL Length,
ous situation on the internet. Tests on classifiers for IoT Shortening Service, having At Symbol, double slash redi-
environments show that both neural networks and fuzzy recting, Prefix Suffix, having Sub Domain, SSLfinal State,
systems have very good application [10, 11]. Domain registration length, Favicon, port, HTTPS token,
Phishing attacks attempt to gain sensitive, confiden- Request URL, URL of Anchor, Links in tags, SFH, Submit-
tial information such as usernames, passwords, credit ting to email, Abnormal URL, Redirect, on mouseover,
card information, network credentials and more [12]. By RightClick, pop up window, Iframe, age of a domain,
posing as a legitimate individual or institution via phone DNSRecord, web traffic, Page Rank, Google Index, Links
or email, cyber attackers use social engineering to ma- pointing to the page, Statistical report, Result.
nipulate victims into performing specific actions—like
clicking on a malicious link or attachment or willfully 3. Main decision parameters
divulging confidential information. Both individuals and
organizations are at risk; almost any kind of personal The features that matter the most in the context of phish-
or organizational data can be valuable, whether it be to ing websites detect.
commit fraud or access an organization’s network. In
addition, some phishing scams can target organizational
3.1. SSL final State
data in order to support espionage efforts or state-backed
spying on opposition groups. Very interesting comments The Subject Common Name of the certificate has to match
on this model can be found directly in online resources of the hostname of the phishing site that returned it. Some
https://www.antiphishing.org/resources/apwg-reports/. sites will return the hosting company’s certificate when
To properly classify our domains, we decided to check requested over HTTPS. As most modern browsers display
and compare different classifiers to see if there are any warnings when a non-matching certificate is encoun-
significant differences between the results and which tered, such certificates only serve to make the user more
one is best suited to this problem. And also check if we suspicious instead of increasing the perceived security
can extract the features of a given domain that are most of the site.
ICYRIME 2021 @ International Conference of Yearly Reports on
Informatics Mathematics and Engineering, online, July 9, 2021 3.2. URL of Anchor
" domizur257@student.polsl.pl (D. Zurawska)
© 2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
An anchor is an element defined by the tag. This
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) feature is treated exactly as “Request URL”. However, for
45
Dominika Zurawska CEUR Workshop Proceedings 45–50
this feature we examine: 3.6. Links pointing to page
1. If the tags and the website have different The number of links pointing to the webpage indicates
domain names. This is similar to request URL its legitimacy level, even if some links are of the same
feature. domain. In our datasets and due to its short life span,
2. If the anchor does not link to any webpage, e.g.: we find that 98% of phishing dataset items have no links
a) pointing to them. On the other hand, legitimate websites
b) have at least 2 external links pointing to them.
c)
d) Of Link Pointing to The Webpage = 0 → Phish-
ing
Rule: Of Link Pointing to The Webpage > 0 and <= 2
% of URL Of Anchor <31% → Legitimate →Suspicious
% of URL Of Anchor ≥ 31% And ⩽ 67% → Suspicious Otherwise → Legitimate
Otherwise→ Phishing
3.3. Links in tags 4. Classifications Algorithms
Given that our investigation covers all angles likely to In this work some selected models were tested.
be used in the webpage source code, we find that it is Presented results are from open libraries that
common for legitimate websites to use tags to were available for student tests in online services.
offer metadata about the HTML document;