Proactive Brand-Targeting Phishing Website Detection
                         using a Hybrid Feature-based Approach with Machine
                         Learning
                         Nadezda Demidova1, Philip Lawson1 and Jake Sloan1
                         1 EBAY (UK) LIMITED, 1 More London Place, London, United Kingdom, SE1 2AF


                                           Abstract
                                           Phishing and online scam sites are on the rise, and the sophistication of these attacks continues to
                                           develop. Phishing websites exploit the target brand's identity, using its logo, website design, and
                                           reputation to trick customers into divulging sensitive information like login credentials and financial
                                           details. This, in turn, can cause financial losses, identity theft, and harm to the brand's reputation,
                                           ultimately eroding customer trust. Notably, the number of reported phishing attacks has grown more
                                           than five-fold in the last three years. Meanwhile, the number of brands attacked each month has
                                           remained relatively consistent. This forces businesses into a highly reactive, defensive mode, unable to
                                           get ahead of the problem, while exposing their customers and brand to abuse and financial loss.
                                           Moreover, the longer it takes for a business to identify and respond to an attack, the greater the potential
                                           damage to their reputation. To mitigate the impact of phishing attacks, businesses need to embrace
                                           proactive measures, moving away from purely responsive strategies and to addressing these threats as
                                           close to the source of the attack as possible.
                                           Detecting threats that are targeting customers outside of a brand's platform and infrastructure can be
                                           challenging. The methods used for distributing phishing attacks are constantly evolving, with
                                           cybercriminals targeting new victims and the latest generation of internet users. In addition to classic
                                           email attacks, cybercriminals are now also using social networks and instant messaging platforms to
                                           reach potential victims, making it difficult for brands to identify and respond to these threats.
                                           While many techniques for combating phishing attempt to address the issue broadly, our approach is
                                           focused specifically on brand protection and the abuse of brand assets no matter how a phishing website
                                           was distributed to potential victims. We use a combination of features based on URL structure and
                                           wording, DOM structure, HTML, and text content, that provide agility and adaptability, allowing us to
                                           more precisely detect a wider variety of brand-related phishing websites. These features enable
                                           Machine Learning algorithms to capture semantics and create a comprehensive high accuracy model
                                           capable of detecting phishing websites across multiple languages. Our approach delivers the proactive
                                           detection of classical phishing websites and scam-pages targeting a brand across a range of different
                                           scenarios and methods and can be easily adapted to suit the needs of any brand seeking to protect itself
                                           and its customers from phishing threats.

                                           Keywords
                                           Phishing, Machine Learning, Cybersecurity, Phishing detection1

                         1. Introduction                                                                              phishing pages. Subsequently, phishing campaigns are
                                                                                                                      initiated, attracting traffic to these malicious URLs.
                                                                                                                      During this process, third-party vendors might detect
                         According to the Anti-Phishing Working Group                                                 the phishing activities and notify the targeted brands,
                         (APWG), the number of reported phishing attacks has                                          enabling them to act and add the relevant information
                         grown more than five-fold in the last three years [1].                                       to their phishing collection for further investigation.
                         Meanwhile, the number of brands attacked each                                                    The time lag between the initiation of a phishing
                         month has remained relatively consistent.                                                    campaign and its detection poses a critical challenge
                             Phishing attacks have become increasingly                                                for businesses. Customers remain exposed to phishing
                         sophisticated, posing a significant threat to businesses                                     infrastructure outside the brand's platform, leading to
                         and their customers. In the typical lifecycle of a                                           potential financial losses, identity theft, and damage to
                         phishing URL, cybercriminals first establish their                                           the brand's reputation. To address this issue, we
                         infrastructure, leading to the creation of deceptive

                         APWG.EU Technical Summit and Researchers Sync-Up 2023, Dublin,
                         Ireland, June 21 & 22, 2023
                              nadi.demidova@gmail.com (N. Demidova);
                         plawson03@qub.ac.uk (P. Lawson); jsloan@red-button.com
                         (J. Sloan)
                            0009-0002-9775-2729 (N. Demidova); 0009-0003-3107-5523 (P.
                         Lawson); 0009-0009-5356-7573 (J. Sloan)
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
sought to develop a proactive approach that identifies     with unique approaches and features. In this section,
phishing URLs and infrastructure earlier in the            we review a selection of pertinent studies that
customer compromise cycle, effectively reducing the        contribute to the advancement of phishing detection
exposure time for our customers.                           methodologies.
    Our approach uses the concept of "Shift Left,"             One study by Das Guptta, S., Shahriar, K.T.,
emphasizing early identification of phishing assets. To    Alqahtani, H. et al. [2] advances hybrid feature-based
achieve this, we created a custom Anti-Phishing            phishing website detection. The authors leverage URL
Ecosystem tailored to the unique challenges faced by       and hyperlink features for real-time accuracy,
our brand. A custom solution allows us to leverage our     minimizing reliance on third-party systems. This
in-depth brand knowledge, the understanding of our         addresses the challenge of new websites and zero-
business workings, customers, and communication            hour attacks.
channels in the best possible way.                             Q. A. Al-Haija and A. A. Badawi [3] propose an
    Machine Learning (ML) plays a central role in our      efficient phishing website detection system focusing
custom solution. By harnessing ML capabilities, we         on URL patterns. Machine learning techniques,
gain a competitive advantage in staying ahead of           including neural networks and decision trees, classify
evolving threats and detecting new zero-day attacks.       authentic and phishing sites effectively.
ML allows us to be agile and adaptive, enabling swift          Arun Kulkarni, Leonard L. Brown III [4] delve into
responses to emerging phishing patterns. Additionally,     machine learning classifiers such as decision trees,
we continue to leverage trusted external data sources      Naive Bayesian classifier, SVM, and neural network to
and collaborate with valuable insights from partners       distinguish real from fake websites4. Real-world URL
to strengthen our approach further.                        datasets exhibit their prowess.
    In this paper, we present our hybrid feature-based         Additionally, A. Ghimire, A. Kumar Jha, S. Thapa, S.
approach with machine learning for proactive brand-        Mishra and A. Mani Jha [5] champion a machine
targeting phishing website detection. Our custom           learning-driven approach detecting phishing URLs.
solution focuses on brand protection and the               Balanced datasets and varied algorithms reveal high
incorporation of internal signals and data sources,        precision, recall, and F-score potential.
providing a comprehensive and highly accurate model            S. Zaman, S. M. Uddin Deep, Z. Kawsar, M.
capable of detecting a wide variety of brand-related       Ashaduzzaman and A. I. Pritom [6] demonstrate the
phishing websites across multiple languages and            effectiveness of Naive Bayes, J48, and HNB classifiers
distribution methods.                                      in phishing detection. Innovative feature selection
    By implementing our proactive approach,                enhances accuracy.
businesses can fortify their defenses, protect their           Lastly, P. Yang, G. Zhao and P. Zeng [7] propose
customers from phishing attacks, and safeguard their       multidimensional feature-based phishing detection
brand reputation. As cybercriminals continually refine     with deep learning. Character sequence features
their strategies, the need for early identification and    facilitate quick deep learning-based classification,
agile detection becomes paramount in the fight against     complemented by URL statistics, webpage code, and
phishing threats. Our research aims to contribute to       text features.
the evolving field of cybersecurity, empowering                In summary, the reviewed studies collectively
businesses to take a proactive stance against brand-       contribute to the ongoing efforts in phishing detection
targeting phishing attacks.                                using machine learning-based approaches. The variety
    The paper has the following structure: Section 2       of methodologies and feature sets underscores the
provides an overview of related work in the field,         need for adaptable and comprehensive solutions to
laying the foundation for our contributions. Moving        counter the dynamic nature of phishing attacks.
forward, Section 3 outlines the key elements of our            This paper brings novelty by emphasizing brand-
methodology, presenting our approach, system               specific abuse, combining structural and textual
overview, and data collection methodology. In Section      features, and promoting the collection of compatible
4, we delve into the critical process of feature           clean training samples for effective phishing detection.
engineering, detailing how we transform raw data into
insights. Section 5 introduces the models that power
our detection system. To assess model performance,         3. Methods
Section 6 elaborates on the evaluation metrics we have
chosen. Section 7 is dedicated to presenting our results       3.1. Definitions and notations
and review. Finally, in Section 8, we conclude with
remarks that summarize our findings and pave the           Table 1
way for future research endeavors.                         Definitions and notations
                                                            Term               Definition
                                                            URL                Address of a given unique resource
                                                                               on the Web
2. Literature review                                        Phishing URL       Address of a phishing content on
                                                                               the Web
The domain of phishing detection has been a focal
                                                            Document           It defines the logical structure of
point in cybersecurity research, driven by the
                                                            Object Model       documents and the way a
increasing sophistication of cybercriminal activities.
                                                            (DOM)              document is accessed and
Researchers have proposed various machine learning-
based solutions to tackle this pervasive threat, each                          manipulated
 Page content      Captured web page source, when          process. This model utilizes a combination of features
                   given phishing URL is requested in      meticulously selected to ensure high accuracy in
                   browser.                                detection. A subset of these features revolves around
 FQDN              Domain name that specifies its          the use of brand assets, aligning with our concentrated
                   exact location in the tree hierarchy    approach tailored for a specific brand. This synergy
                   of the Domain Name System (DNS).        empowers our system to process large volumes of
                   It specifies all domain levels,         suspicious URLs from diverse sources, elevating our
                   including the top-level domain and      overall phishing detection efficacy.
                   the root zone.                              The incorporation of machine learning allows us to
 TLD               Top level domain                        proactively address evolving threats, including the
                                                           detection of new zero-day attacks. This adaptability
 Subdomains        All domains on the left of second-      and agility are integral to staying ahead in the rapidly
                   level domain                            changing landscape of online security.
 Path              The path refers to the exact
                   location of a page, post, file, or
                   other asset. It is often analogous to       3.3. System overview
                   the underlying file structure of the
                   website. The path resides after the     At a high level, our solution follows a streamlined
                                                           workflow (Figure 2) to detect and mitigate phishing
                   hostname and is separated by “/”
                                                           threats:
                   (forward slash).
                                                               1.    Data Collection: Our system actively collects
 Directories       Folder in a path (directory names
                                                           URLs that exhibit suspicious characteristics from
                   separated by "/")
                                                           diverse sources. These sources encompass various
 Parameters        goes after "?" symbol. Extra            avenues, including new domains, SSL Certificate
                   parameters provided to the Web          stream data, our internal signals, and other
                   server.                                 repositories of potentially suspicious URLs.
 Anchor            Represents a sort of "bookmark"             2.    Data Retrieval: From the gathered URLs, the
                   inside the web resource.                system extracts the content of the web pages
                                                           associated with these URLs.
                                                               3.    Data Processing: Raw data is subjected to a
    3.2. Approach                                          comprehensive processing phase to derive meaningful
                                                           data points that are conducive to effective phishing
There is a common approach that underlies the              detection.
Customer Compromised Cycle (Figure 1) and basic off-           4.    Feature Extraction: The system transforms
platform anti-phishing strategy:                           the processed data points into a structured numeric
    1.   Cybercriminal infrastructure setup                representation.
    2.   Phishing page creation                                5.    Model Evaluation: Utilizing the numeric
    3.   Phishing campaign launch                          representation, our machine learning model takes
    4.   As campaigns gain momentum, third-party           over. It evaluates each sample and provides a verdict:
         vendors identify and share this information.      whether the URL is indicative of phishing or not.
    5.   This prompt notifications, add relevant data          6.    Action and Collection: If the model identifies
         to our phishing-collection, and take              a URL as phishing, we initiate an appropriate response.
         necessary actions.                                    This process has a feedback loop, as the insights
                                                           gleaned from the collected data continuously
                                                           contribute to the refinement and evolution of our
                                                           machine learning model. This iterative approach
                                                           ensures that our model remains adaptive to emerging
                                                           trends and effectively addresses new challenges that
                                                           may arise in the dynamic landscape of phishing
                                                           threats.


Figure 1: Customer Compromise Cycle

    Our approach is designed to minimize our
customers' exposure to off-platform phishing
infrastructure and focuses on early identification of      Figure 2: High-level System Architecture
phishing assets. At its core, our solution integrates a
machine learning model that plays a pivotal role in
automating and scaling the phishing page detection
     3.4. Data collection                                        As a result, our model is equipped to discern
                                                              nuanced patterns and characteristics in both phishing
As our model makes its decisions based on features            and legitimate content, enhancing its predictive
extracted from URLs and page content, this is what we         accuracy in real-world scenarios.
needed to gather for our learning collection. To                 For the evaluation of our model's performance, the
facilitate the training of our detection model, the           dataset was divided into a training set and a test set in
acquisition of a substantial number of examples for           an 80:20 ratio, allowing us to assess its performance
each distinct targeted group was imperative.                  on previously unseen data.
Presently, our dataset comprises more than 62,000
samples. Within the "phishing" category, we
encompass a broad spectrum of deceptive pages
                                                              4. Feature engineering
designed to exploit our customers, ranging from
                                                              Feature engineering is a fundamental step in
traditional phishing schemes that replicate login
                                                              converting raw webpage data into numeric vectors
interfaces to fraudulent support pages employing
                                                              that can be effectively utilized by machine learning
vishing tactics.
                                                              algorithms for phishing detection. Since machine
    The efficacy of machine learning hinges on the
                                                              learning algorithms operate on numerical data, we
caliber of the data it learns from, coupled with the
                                                              need to find a suitable representation for each sample
algorithms' capacity to assimilate it. The quality of data
                                                              that provides valuable information to the model,
exerts a direct influence on the model's performance;
                                                              enabling it to distinguish phishing instances
it can exclusively glean insights from the data it is
                                                              effectively. In our solution, we adopt a hybrid feature-
provided. As such, the data must meet the following
                                                              based approach, combining URL structure and
criteria:
                                                              wording, DOM structure, HTML, and text content to
    •      Relevance
                                                              create numeric vectors for each webpage sample.
    •      Non-duplication
    •      Accurate labeling
    •      A combination of recent and historical data            4.1. URL structure and rank
    •      Representative of real-world production
scenarios
                                                                       features
    •      Sourced from diverse origins.
    Mitigating data selection bias is also paramount.         The first type of features are URL-based features, such
This bias manifests when the collected data                   as the number of subdomains used, the count of path
inadequately encapsulates the full spectrum of                folders, and the domain's association with highly
possible information or information combinations that         ranked domains based on the number of referring
the model may encounter in practical scenarios.               subnets.
    For instance, consider the analogy of fruits and              This structural foundation, as depicted in Figure 3,
vegetables. In reality, these come in a myriad of colors.     forms the basis upon which our URL-based features
However, if data collection predominantly focuses on          are constructed.
red fruits and green vegetables, it would introduce
data selection bias.
    To ensure diversity within each targeted group,
strategic sampling is essential. For instance, if top-level
domains (TLDs) are employed as features and
phishing samples are available for each TLD, solely           Figure 3: Parts of a URL
featuring ".com" samples in the clean dataset could
predispose the model to label anything else as                   As a result, the following features are extracted:
phishing. Similarly, the nature of web pages must be             1.    Does the root domain name rank within the
adequately represented; relying solely on top ranked                   top 1 million of widely recognized domains,
pages might not correlate well with our phishing                       based on the number of referring subnets?
dataset, which predominantly mimics login and                    2.    Fully Qualified Domain Name (FQDN) rank –
registration forms. Therefore, a comprehensive                         indicating whether the domain resides in the
collection is necessary to mirror analogous                            first 1000, first 100000, within the top 1
representations in the clean dataset.                                  million, or outside this range.
    Moreover, our solution leverages visible text                3.    Presence of brand-related keywords within
prevalent in phishing pages, necessitating the                         the URL.
accumulation of authentic instances that deploy                  4.    Count of path directories.
similar terminologies without malicious intent. This             5.    Count of subdomains.
includes legitimate pages from our customers who
feature their businesses on our platform, personal                4.2. Page content structure and
websites, or articles about our brand. To bolster our
clean dataset, we employed search engines with                        links features
intelligent search queries to curate samples
embodying diverse feature combinations. This                  The second type of features we employ are based on
strategic approach ensured that our clean dataset             the webpage's structure. Additionally, we analyse the
matched the multifaceted nature of the phishing               links used on the page to identify any brand assets or
collection we had amassed over time.                          links to the original brand logo. Since cybercriminals
often copy the original page, there is a high chance of       The id attribute specifies a unique identifier for an
finding traces left behind. Furthermore, we assess the    HTML element. The value of the id attribute is usually
DOM structure counts to determine the presence of         unique within the HTML document. The class attribute
forms and inputs on the page, contributing to effective   is often used to point to a class name in a style sheet. It
phishing detection.                                       can also be used by JavaScript to access and
                                                          manipulate elements with the specific class name.
                                                          The construction of dictionaries adheres to the
                                                          following process:
                                                              1.    Compilation of unique term sets from each
                                                                    distinct document within the phishing
                                                                    segment of the training set.
                                                              2.    Aggregation of these sets into a
                                                                    comprehensive list of terms.
Figure 4: Snippet of page content with highlighted            3.    Retention of the most frequently occurring
links                                                               terms through a counting mechanism.
                                                              For each sample within the dataset, we employ a
                                                          count vectorization technique to align the extracted
                                                          terms with the prepared dictionaries. This alignment
                                                          is grounded in the frequency of occurrence exhibited
                                                          by each token within the entire text of the respective
                                                          sample.
                                                              To add further significance to the numeric vectors,
                                                          we perform TF-IDF (term frequency-inverse
                                                          document frequency). This statistical measure
                                                          evaluates the relevance of a word to a document within
                                                          a collection of documents. It considers both how
Figure 5: Snippet of page content with highlighted        frequently a word appears in a document and its
input, form, button elements                              inverse document frequency across the entire dataset:

   As a result, the following features are extracted:                                   𝑓!,#                   (1)
   1.    Number of links in <link>/<script>                             𝑡𝑓(𝑡, 𝑑) =                ,
                                                                                     ∑!$∈# 𝑓! !,#
         /<img>/<a> tags (links to brand assets, links
         brand with keywords, non-brand related).                                        1+𝑁                   (2)
   2.    Number of inputs.                                      𝑖𝑑𝑓(𝑡, 𝐷) = 𝑙𝑜𝑔                       ,
                                                                                  1 + |{𝑑 ∈ 𝐷: 𝑡 ∈ 𝑑}|
   3.    Number of forms.
   4.    Number of buttons.
                                                                𝑡𝑓𝑖𝑑𝑓(𝑡, 𝑑, 𝐷) = 𝑡𝑓(𝑡, 𝑑) ∙ 𝑖𝑑𝑓(𝑡, 𝐷),         (3)
   5.    Forms methods used (attribute specifies how
         to send form-data).
                                                          where ft,d is the raw count of a term in a document, i.e.,
   6.    Use of original brand logo.
                                                          the number of times that term t occurs in document d.
                                                          Note the denominator is simply the total number of
    4.3. Tag names features                               terms in document d (counting each occurrence of the
                                                          same term separately). N: total number of documents
The third type of features revolves around unique ID      in the corpus N = |D|.
names of HTML elements, class names, and form                 |{d∈D:t∈d}| : number of documents where the
names. We extract them from html and map with             term t appears.
dictionaries of the most frequent terms from phishing         By applying TF-IDF, we emphasize the importance
pages. By doing so, we create a linkage between these     of each term in the context of phishing detection.
HTML element identifiers and common phishing                  Even when focusing solely on the most frequently
patterns, enhancing the model's capability to identify    occurring phishing terms to map textual information,
suspicious content.                                       the resultant array of variables remains substantial. In
                                                          addressing this, and with the dual aim of distilling
                                                          valuable insights while mitigating overfitting, we
                                                          employ Principal Component Analysis (PCA). This
                                                          technique serves to condense the dimensions of our
                                                          data vectors, effectively retaining the maximal
                                                          information within more compact representations.


Figure 6: Snippet of page content with highlighted
element attributes
Figure 7: First 3 Principal Components on unseen data
                                                             Figure 8: First 3 Principal Components on unseen data
(red – Brand-targeting Phishing, blue – “clean”
                                                             (red – Brand-targeting Phishing, blue – “clean”
samples)
                                                             samples)
    Figure 7 provides a visualization of the first three
                                                                 As we have capacity to translate all text into
principal components of the test data, showcasing the
                                                             English, we have incorporated features that indicate
distribution of ID values and class names. These test
                                                             the original language group of the text. This capability
samples were initially mapped onto dictionaries that
                                                             enables us to detect phishing attempts aimed at our
were constructed using the pre-built training data.
                                                             global customer base within a unified model. For
Following this mapping, we applied PCA to achieve the
                                                             instances where such capabilities are unavailable, we
visualization depicted in the figure, vividly illustrating
                                                             recommend the creation of separate models for each
the successful separation and differentiation of
                                                             language or the exclusion of page visible text as a
samples using these chosen features.
                                                             feature. However, the URL text can still be considered
                                                             for use.
     4.4. Visible text features
Our fourth set of features involves visible text obtained
                                                             5. Classifiers
from the webpage, which we categorize into four parts:
Title, URL (treated as text), Body (entire visible text),    We compared the performance of different classifiers
and Footer. To make the text more informative, we            using varying combinations of feature sets to enhance
map it with dictionaries containing the most frequent        phishing detection accuracy. The selected classifiers
terms and phrases extracted from known phishing              are Logistic Regression, Random Forest, and XGBoost,
pages. This step empowers the model to recognize key         each offering distinct advantages.
indicators of phishing attempts, such as the presence            Logistic Regression: This classic algorithm suits
of "login" or "register" in the Title or "copyrights" in     straightforward tasks with linear relationships
the Footer.                                                  between features and outcomes, providing an
    Before converting text into a numeric                    interpretable baseline for comparison.
representation, we perform text pre-processing, that             Random Forest: By aggregating the outputs of
helps to put all text on equal footing. It involves          multiple decision trees, Random Forest effectively
following steps:                                             captures intricate feature interactions and minimizes
    1.    Translation to English                             overfitting.
    2.    Removing non-ASCII characters                          XGBoost: Known for its predictive power, it
    3.    Conversion to lowercase                            constructs an ensemble of weak learners and
    4.    Removing punctuation                               iteratively     improves       their     performance,
    5.    Removing numbers                                   accommodating various data types and complex
    6.    Removing extra spaces                              patterns.

    Subsequently, the processed text is translated into
a numeric format using pre-established dictionaries
                                                             6. Evaluation metrics
containing the most prevalent terms or tokens,
derived from the phishing data within the training set.      Balanced accuracy is a better metric to use with
This process involves the application of both count          imbalanced data. It accounts for both the positive and
vectorization, which captures token frequency across         negative outcome classes and does not mislead with
the entire text of each sample, and the TF-IDF               imbalanced data.
statistical technique.
                                                                           1    𝑇𝑃     𝑇𝑁                           (4)
                                                                    𝐵𝐴 =     ∙:     +        >,
                                                                           2 𝑇𝑃 + 𝐹𝑁 𝑇𝑁 + 𝐹𝑃

                                                                 Where:
                                                                 TP – true positive (the correctly predicted positive
                                                             class outcome of the model),
                                                                 TN – true negative (the correctly predicted
                                                             negative class outcome of the model),
                                                                 FP – false positive (the incorrectly predicted
                                                             positive class outcome of the model),
   FN – false negative (the incorrectly predicted                  9.    Title. PC #6
negative class outcome of the model).                              10.   Title. PC #16
                                                                   11.   URL contains brand keyword
                                                                   12.   Body text. PC #13
7. Results                                                         13.   ID values. PC #3
                                                                   14.   Classes names. PC #18
In this section, we evaluate the performance of three              15.   Domain rank not in 1m
classifiers across five distinct feature sets. Our                 16.   # subdomains
evaluation is based on cross-validation, a technique               17.   ID values. PC #2
where the training data is split into multiple folds (we           18.   # brand’s assets <img>
use five folds), with each fold serving as validation data         19.   Title. PC #13
in a rotation. During each iteration, the model is                 20.   Title. PC #0
trained on four folds and validated on the fifth, and this
process is repeated five times to ensure robust results.
    The five feature sets are as follows:
    Features1: URL-Based Features. This set includes         8. Conclusion and further work
features related to URL structure, domain ranking, and
the URL as text.                                             In this study, we presented a comprehensive approach
    Features2: Encompassing both URL and page                for detecting phishing pages that target our brand's
content-based countable features, as well as features        customers. By leveraging a hybrid feature-based
derived from page structure and brand assets analysis.       approach, encompassing URL structure, HTML
    Features3: Tag Names Features. Features                  elements, text content, and brand-specific signals, we
extracted from unique HTML element IDs, class names,         developed a robust detection model. Through rigorous
and form names.                                              evaluation, we demonstrated the effectiveness of our
    Features4: Visible Text Features. Features derived       approach in accurately identifying phishing attempts.
from visible text across different parts of the webpage,         While our approach shows promising results,
including Title, URL, Body, and Footer.                      there are opportunities for further enhancement and
    Combined: A comprehensive set comprising all the         exploration. We plan to explore integration with social
features from the previous sets.                             media phishing detection, develop better strategies to
                                                             counter cloaking and filtering techniques, optimize
Table 2                                                      takedown processes, and leverage the potential of
Comparison of different classifiers and features             Large Language Models. These endeavors aim to
combination. Mean balanced accuracy and standard             reinforce our brand's cybersecurity measures and
deviation.                                                   protect our customers from evolving threats.
Head 1        LR           RF          XGBoost
Features1     0.7082       0.8306      0.8362                Acknowledgements
              (0.007)      (0.007)     (0.008)
Features2     0.8804       0.9799      0.9766                We extend our gratitude to the Anti-Phishing Working
              (0.005)      (0.001)     (0.001)               Group (APWG) for their valuable insights and
Features3     0.9439       0.9849      0.9851                resources that contributed to the success of our
              (0.006)      (0.002)     (0.002)               research. Special thanks to our colleagues at eBay for
Features4     0.9705       0.9803      0.9898                their continuous support and collaboration
              (0.005)      (0.003)     (0.003)               throughout this project. We also acknowledge the
Combined      0.9857       0.9897      0.9941                contributions of Marc Green and the broader
              (0.002)      (0.002)     (0.001)               cybersecurity community for their discussions and
                                                             feedback, which enriched our understanding and
    The performance of different classifiers and             approach.
feature sets was evaluated using cross-validation on
the training data. XGBoost demonstrated the highest
mean balanced accuracy, particularly when utilizing
                                                             References
the combined feature set. When applied to the test
data, the XGBoost model with the comprehensive               [1]    Anti-Phishing Working Group (APWG). Phishing
features achieved an impressive accuracy of 99.8342.                activity trends report. 3rd Quarter 2022. URL:
    To identify the key drivers of accurate predictions,            https://docs.apwg.org/reports/apwg_trends_re
we examined the top 20 most influential features:                   port_q3_2022.pdfP. S. Abril, R. Plant, The patent
                                                                    holder’s dilemma: Buy, sell, or troll?,
   1.  Classes names. PC #0 (First principal                        Communications of the ACM 50 (2007) 36–44.
   component)                                                       doi:10.1145/1188913.1188915.
   2.  Form names. PC #0                                     [2]    Das Guptta, S., Shahriar, K.T., Alqahtani, H. et al.
   3.  ID values. PC #0                                             Modeling Hybrid Feature-Based Phishing
   4.  Title. PC #25                                                Websites Detection Using Machine Learning
   5.  Classes names. PC #21                                        Techniques.      Ann.    Data.     Sci.    (2022).
   6.  # non-brand related <a>                                      https://doi.org/10.1007/s40745-022-00379-8
   7.  Classes names. PC #4                                  [3]    Q. A. Al-Haija and A. A. Badawi, "URL-based
   8.  Title. PC #17                                                Phishing Websites Detection via Machine
      Learning," 2021 International Conference on
      Data Analytics for Business and Industry
      (ICDABI), Sakheer, Bahrain, 2021, pp. 644-649,
      doi: 10.1109/ICDABI53623.2021.9655851.
[4]   Arun Kulkarni, Leonard L. Brown III, "Phishing
      Websites Detection using Machine Learning,"
      International Journal of Advanced Computer
      Science and Applications, vol. 10, no. 7, 2019, pp.
      8.                                           URL:
      https://thesai.org/Downloads/Volume10No7/P
      aper_2-
      Phishing_Websites_Detection_using_Machine_Le
      arning.pdf
[5]   A. Ghimire, A. Kumar Jha, S. Thapa, S. Mishra and
      A. Mani Jha, "Machine Learning Approach Based
      on Hybrid Features for Detection of Phishing
      URLs," 2021 11th International Conference on
      Cloud Computing, Data Science & Engineering
      (Confluence), Noida, India, 2021, pp. 954-959,
      doi: 10.1109/Confluence51648.2021.9377113.
[6]   S. Zaman, S. M. Uddin Deep, Z. Kawsar, M.
      Ashaduzzaman and A. I. Pritom, "Phishing
      Website Detection Using Effective Classifiers and
      Feature Selection Techniques," 2019 2nd
      International Conference on Innovation in
      Engineering and Technology (ICIET), Dhaka,
      Bangladesh,      2019,       pp.     1-6,      doi:
      10.1109/ICIET48527.2019.9290554.
[7]   P. Yang, G. Zhao and P. Zeng, "Phishing Website
      Detection Based on Multidimensional Features
      Driven by Deep Learning," in IEEE Access, vol. 7,
      pp.        15196-15209,          2019,         doi:
      10.1109/ACCESS.2019.2892066.