Proactive Brand-Targeting Phishing Website Detection
using a Hybrid Feature-based Approach with Machine
Learning
Nadezda Demidova1, Philip Lawson1 and Jake Sloan1
1 EBAY (UK) LIMITED, 1 More London Place, London, United Kingdom, SE1 2AF
Abstract
Phishing and online scam sites are on the rise, and the sophistication of these attacks continues to
develop. Phishing websites exploit the target brand's identity, using its logo, website design, and
reputation to trick customers into divulging sensitive information like login credentials and financial
details. This, in turn, can cause financial losses, identity theft, and harm to the brand's reputation,
ultimately eroding customer trust. Notably, the number of reported phishing attacks has grown more
than five-fold in the last three years. Meanwhile, the number of brands attacked each month has
remained relatively consistent. This forces businesses into a highly reactive, defensive mode, unable to
get ahead of the problem, while exposing their customers and brand to abuse and financial loss.
Moreover, the longer it takes for a business to identify and respond to an attack, the greater the potential
damage to their reputation. To mitigate the impact of phishing attacks, businesses need to embrace
proactive measures, moving away from purely responsive strategies and to addressing these threats as
close to the source of the attack as possible.
Detecting threats that are targeting customers outside of a brand's platform and infrastructure can be
challenging. The methods used for distributing phishing attacks are constantly evolving, with
cybercriminals targeting new victims and the latest generation of internet users. In addition to classic
email attacks, cybercriminals are now also using social networks and instant messaging platforms to
reach potential victims, making it difficult for brands to identify and respond to these threats.
While many techniques for combating phishing attempt to address the issue broadly, our approach is
focused specifically on brand protection and the abuse of brand assets no matter how a phishing website
was distributed to potential victims. We use a combination of features based on URL structure and
wording, DOM structure, HTML, and text content, that provide agility and adaptability, allowing us to
more precisely detect a wider variety of brand-related phishing websites. These features enable
Machine Learning algorithms to capture semantics and create a comprehensive high accuracy model
capable of detecting phishing websites across multiple languages. Our approach delivers the proactive
detection of classical phishing websites and scam-pages targeting a brand across a range of different
scenarios and methods and can be easily adapted to suit the needs of any brand seeking to protect itself
and its customers from phishing threats.
Keywords
Phishing, Machine Learning, Cybersecurity, Phishing detection1
1. Introduction phishing pages. Subsequently, phishing campaigns are
initiated, attracting traffic to these malicious URLs.
During this process, third-party vendors might detect
According to the Anti-Phishing Working Group the phishing activities and notify the targeted brands,
(APWG), the number of reported phishing attacks has enabling them to act and add the relevant information
grown more than five-fold in the last three years [1]. to their phishing collection for further investigation.
Meanwhile, the number of brands attacked each The time lag between the initiation of a phishing
month has remained relatively consistent. campaign and its detection poses a critical challenge
Phishing attacks have become increasingly for businesses. Customers remain exposed to phishing
sophisticated, posing a significant threat to businesses infrastructure outside the brand's platform, leading to
and their customers. In the typical lifecycle of a potential financial losses, identity theft, and damage to
phishing URL, cybercriminals first establish their the brand's reputation. To address this issue, we
infrastructure, leading to the creation of deceptive
APWG.EU Technical Summit and Researchers Sync-Up 2023, Dublin,
Ireland, June 21 & 22, 2023
nadi.demidova@gmail.com (N. Demidova);
plawson03@qub.ac.uk (P. Lawson); jsloan@red-button.com
(J. Sloan)
0009-0002-9775-2729 (N. Demidova); 0009-0003-3107-5523 (P.
Lawson); 0009-0009-5356-7573 (J. Sloan)
© 2023 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
sought to develop a proactive approach that identifies with unique approaches and features. In this section,
phishing URLs and infrastructure earlier in the we review a selection of pertinent studies that
customer compromise cycle, effectively reducing the contribute to the advancement of phishing detection
exposure time for our customers. methodologies.
Our approach uses the concept of "Shift Left," One study by Das Guptta, S., Shahriar, K.T.,
emphasizing early identification of phishing assets. To Alqahtani, H. et al. [2] advances hybrid feature-based
achieve this, we created a custom Anti-Phishing phishing website detection. The authors leverage URL
Ecosystem tailored to the unique challenges faced by and hyperlink features for real-time accuracy,
our brand. A custom solution allows us to leverage our minimizing reliance on third-party systems. This
in-depth brand knowledge, the understanding of our addresses the challenge of new websites and zero-
business workings, customers, and communication hour attacks.
channels in the best possible way. Q. A. Al-Haija and A. A. Badawi [3] propose an
Machine Learning (ML) plays a central role in our efficient phishing website detection system focusing
custom solution. By harnessing ML capabilities, we on URL patterns. Machine learning techniques,
gain a competitive advantage in staying ahead of including neural networks and decision trees, classify
evolving threats and detecting new zero-day attacks. authentic and phishing sites effectively.
ML allows us to be agile and adaptive, enabling swift Arun Kulkarni, Leonard L. Brown III [4] delve into
responses to emerging phishing patterns. Additionally, machine learning classifiers such as decision trees,
we continue to leverage trusted external data sources Naive Bayesian classifier, SVM, and neural network to
and collaborate with valuable insights from partners distinguish real from fake websites4. Real-world URL
to strengthen our approach further. datasets exhibit their prowess.
In this paper, we present our hybrid feature-based Additionally, A. Ghimire, A. Kumar Jha, S. Thapa, S.
approach with machine learning for proactive brand- Mishra and A. Mani Jha [5] champion a machine
targeting phishing website detection. Our custom learning-driven approach detecting phishing URLs.
solution focuses on brand protection and the Balanced datasets and varied algorithms reveal high
incorporation of internal signals and data sources, precision, recall, and F-score potential.
providing a comprehensive and highly accurate model S. Zaman, S. M. Uddin Deep, Z. Kawsar, M.
capable of detecting a wide variety of brand-related Ashaduzzaman and A. I. Pritom [6] demonstrate the
phishing websites across multiple languages and effectiveness of Naive Bayes, J48, and HNB classifiers
distribution methods. in phishing detection. Innovative feature selection
By implementing our proactive approach, enhances accuracy.
businesses can fortify their defenses, protect their Lastly, P. Yang, G. Zhao and P. Zeng [7] propose
customers from phishing attacks, and safeguard their multidimensional feature-based phishing detection
brand reputation. As cybercriminals continually refine with deep learning. Character sequence features
their strategies, the need for early identification and facilitate quick deep learning-based classification,
agile detection becomes paramount in the fight against complemented by URL statistics, webpage code, and
phishing threats. Our research aims to contribute to text features.
the evolving field of cybersecurity, empowering In summary, the reviewed studies collectively
businesses to take a proactive stance against brand- contribute to the ongoing efforts in phishing detection
targeting phishing attacks. using machine learning-based approaches. The variety
The paper has the following structure: Section 2 of methodologies and feature sets underscores the
provides an overview of related work in the field, need for adaptable and comprehensive solutions to
laying the foundation for our contributions. Moving counter the dynamic nature of phishing attacks.
forward, Section 3 outlines the key elements of our This paper brings novelty by emphasizing brand-
methodology, presenting our approach, system specific abuse, combining structural and textual
overview, and data collection methodology. In Section features, and promoting the collection of compatible
4, we delve into the critical process of feature clean training samples for effective phishing detection.
engineering, detailing how we transform raw data into
insights. Section 5 introduces the models that power
our detection system. To assess model performance, 3. Methods
Section 6 elaborates on the evaluation metrics we have
chosen. Section 7 is dedicated to presenting our results 3.1. Definitions and notations
and review. Finally, in Section 8, we conclude with
remarks that summarize our findings and pave the Table 1
way for future research endeavors. Definitions and notations
Term Definition
URL Address of a given unique resource
on the Web
2. Literature review Phishing URL Address of a phishing content on
the Web
The domain of phishing detection has been a focal
Document It defines the logical structure of
point in cybersecurity research, driven by the
Object Model documents and the way a
increasing sophistication of cybercriminal activities.
(DOM) document is accessed and
Researchers have proposed various machine learning-
based solutions to tackle this pervasive threat, each manipulated
Page content Captured web page source, when process. This model utilizes a combination of features
given phishing URL is requested in meticulously selected to ensure high accuracy in
browser. detection. A subset of these features revolves around
FQDN Domain name that specifies its the use of brand assets, aligning with our concentrated
exact location in the tree hierarchy approach tailored for a specific brand. This synergy
of the Domain Name System (DNS). empowers our system to process large volumes of
It specifies all domain levels, suspicious URLs from diverse sources, elevating our
including the top-level domain and overall phishing detection efficacy.
the root zone. The incorporation of machine learning allows us to
TLD Top level domain proactively address evolving threats, including the
detection of new zero-day attacks. This adaptability
Subdomains All domains on the left of second- and agility are integral to staying ahead in the rapidly
level domain changing landscape of online security.
Path The path refers to the exact
location of a page, post, file, or
other asset. It is often analogous to 3.3. System overview
the underlying file structure of the
website. The path resides after the At a high level, our solution follows a streamlined
workflow (Figure 2) to detect and mitigate phishing
hostname and is separated by “/”
threats:
(forward slash).
1. Data Collection: Our system actively collects
Directories Folder in a path (directory names
URLs that exhibit suspicious characteristics from
separated by "/")
diverse sources. These sources encompass various
Parameters goes after "?" symbol. Extra avenues, including new domains, SSL Certificate
parameters provided to the Web stream data, our internal signals, and other
server. repositories of potentially suspicious URLs.
Anchor Represents a sort of "bookmark" 2. Data Retrieval: From the gathered URLs, the
inside the web resource. system extracts the content of the web pages
associated with these URLs.
3. Data Processing: Raw data is subjected to a
3.2. Approach comprehensive processing phase to derive meaningful
data points that are conducive to effective phishing
There is a common approach that underlies the detection.
Customer Compromised Cycle (Figure 1) and basic off- 4. Feature Extraction: The system transforms
platform anti-phishing strategy: the processed data points into a structured numeric
1. Cybercriminal infrastructure setup representation.
2. Phishing page creation 5. Model Evaluation: Utilizing the numeric
3. Phishing campaign launch representation, our machine learning model takes
4. As campaigns gain momentum, third-party over. It evaluates each sample and provides a verdict:
vendors identify and share this information. whether the URL is indicative of phishing or not.
5. This prompt notifications, add relevant data 6. Action and Collection: If the model identifies
to our phishing-collection, and take a URL as phishing, we initiate an appropriate response.
necessary actions. This process has a feedback loop, as the insights
gleaned from the collected data continuously
contribute to the refinement and evolution of our
machine learning model. This iterative approach
ensures that our model remains adaptive to emerging
trends and effectively addresses new challenges that
may arise in the dynamic landscape of phishing
threats.
Figure 1: Customer Compromise Cycle
Our approach is designed to minimize our
customers' exposure to off-platform phishing
infrastructure and focuses on early identification of Figure 2: High-level System Architecture
phishing assets. At its core, our solution integrates a
machine learning model that plays a pivotal role in
automating and scaling the phishing page detection
3.4. Data collection As a result, our model is equipped to discern
nuanced patterns and characteristics in both phishing
As our model makes its decisions based on features and legitimate content, enhancing its predictive
extracted from URLs and page content, this is what we accuracy in real-world scenarios.
needed to gather for our learning collection. To For the evaluation of our model's performance, the
facilitate the training of our detection model, the dataset was divided into a training set and a test set in
acquisition of a substantial number of examples for an 80:20 ratio, allowing us to assess its performance
each distinct targeted group was imperative. on previously unseen data.
Presently, our dataset comprises more than 62,000
samples. Within the "phishing" category, we
encompass a broad spectrum of deceptive pages
4. Feature engineering
designed to exploit our customers, ranging from
Feature engineering is a fundamental step in
traditional phishing schemes that replicate login
converting raw webpage data into numeric vectors
interfaces to fraudulent support pages employing
that can be effectively utilized by machine learning
vishing tactics.
algorithms for phishing detection. Since machine
The efficacy of machine learning hinges on the
learning algorithms operate on numerical data, we
caliber of the data it learns from, coupled with the
need to find a suitable representation for each sample
algorithms' capacity to assimilate it. The quality of data
that provides valuable information to the model,
exerts a direct influence on the model's performance;
enabling it to distinguish phishing instances
it can exclusively glean insights from the data it is
effectively. In our solution, we adopt a hybrid feature-
provided. As such, the data must meet the following
based approach, combining URL structure and
criteria:
wording, DOM structure, HTML, and text content to
• Relevance
create numeric vectors for each webpage sample.
• Non-duplication
• Accurate labeling
• A combination of recent and historical data 4.1. URL structure and rank
• Representative of real-world production
scenarios
features
• Sourced from diverse origins.
Mitigating data selection bias is also paramount. The first type of features are URL-based features, such
This bias manifests when the collected data as the number of subdomains used, the count of path
inadequately encapsulates the full spectrum of folders, and the domain's association with highly
possible information or information combinations that ranked domains based on the number of referring
the model may encounter in practical scenarios. subnets.
For instance, consider the analogy of fruits and This structural foundation, as depicted in Figure 3,
vegetables. In reality, these come in a myriad of colors. forms the basis upon which our URL-based features
However, if data collection predominantly focuses on are constructed.
red fruits and green vegetables, it would introduce
data selection bias.
To ensure diversity within each targeted group,
strategic sampling is essential. For instance, if top-level
domains (TLDs) are employed as features and
phishing samples are available for each TLD, solely Figure 3: Parts of a URL
featuring ".com" samples in the clean dataset could
predispose the model to label anything else as As a result, the following features are extracted:
phishing. Similarly, the nature of web pages must be 1. Does the root domain name rank within the
adequately represented; relying solely on top ranked top 1 million of widely recognized domains,
pages might not correlate well with our phishing based on the number of referring subnets?
dataset, which predominantly mimics login and 2. Fully Qualified Domain Name (FQDN) rank –
registration forms. Therefore, a comprehensive indicating whether the domain resides in the
collection is necessary to mirror analogous first 1000, first 100000, within the top 1
representations in the clean dataset. million, or outside this range.
Moreover, our solution leverages visible text 3. Presence of brand-related keywords within
prevalent in phishing pages, necessitating the the URL.
accumulation of authentic instances that deploy 4. Count of path directories.
similar terminologies without malicious intent. This 5. Count of subdomains.
includes legitimate pages from our customers who
feature their businesses on our platform, personal 4.2. Page content structure and
websites, or articles about our brand. To bolster our
clean dataset, we employed search engines with links features
intelligent search queries to curate samples
embodying diverse feature combinations. This The second type of features we employ are based on
strategic approach ensured that our clean dataset the webpage's structure. Additionally, we analyse the
matched the multifaceted nature of the phishing links used on the page to identify any brand assets or
collection we had amassed over time. links to the original brand logo. Since cybercriminals
often copy the original page, there is a high chance of The id attribute specifies a unique identifier for an
finding traces left behind. Furthermore, we assess the HTML element. The value of the id attribute is usually
DOM structure counts to determine the presence of unique within the HTML document. The class attribute
forms and inputs on the page, contributing to effective is often used to point to a class name in a style sheet. It
phishing detection. can also be used by JavaScript to access and
manipulate elements with the specific class name.
The construction of dictionaries adheres to the
following process:
1. Compilation of unique term sets from each
distinct document within the phishing
segment of the training set.
2. Aggregation of these sets into a
comprehensive list of terms.
Figure 4: Snippet of page content with highlighted 3. Retention of the most frequently occurring
links terms through a counting mechanism.
For each sample within the dataset, we employ a
count vectorization technique to align the extracted
terms with the prepared dictionaries. This alignment
is grounded in the frequency of occurrence exhibited
by each token within the entire text of the respective
sample.
To add further significance to the numeric vectors,
we perform TF-IDF (term frequency-inverse
document frequency). This statistical measure
evaluates the relevance of a word to a document within
a collection of documents. It considers both how
Figure 5: Snippet of page content with highlighted frequently a word appears in a document and its
input, form, button elements inverse document frequency across the entire dataset:
As a result, the following features are extracted: 𝑓!,# (1)
1. Number of links in /