Proactive Brand-Targeting Phishing Website Detection using a Hybrid Feature-based Approach with Machine Learning Nadezda Demidova1, Philip Lawson1 and Jake Sloan1 1 EBAY (UK) LIMITED, 1 More London Place, London, United Kingdom, SE1 2AF Abstract Phishing and online scam sites are on the rise, and the sophistication of these attacks continues to develop. Phishing websites exploit the target brand's identity, using its logo, website design, and reputation to trick customers into divulging sensitive information like login credentials and financial details. This, in turn, can cause financial losses, identity theft, and harm to the brand's reputation, ultimately eroding customer trust. Notably, the number of reported phishing attacks has grown more than five-fold in the last three years. Meanwhile, the number of brands attacked each month has remained relatively consistent. This forces businesses into a highly reactive, defensive mode, unable to get ahead of the problem, while exposing their customers and brand to abuse and financial loss. Moreover, the longer it takes for a business to identify and respond to an attack, the greater the potential damage to their reputation. To mitigate the impact of phishing attacks, businesses need to embrace proactive measures, moving away from purely responsive strategies and to addressing these threats as close to the source of the attack as possible. Detecting threats that are targeting customers outside of a brand's platform and infrastructure can be challenging. The methods used for distributing phishing attacks are constantly evolving, with cybercriminals targeting new victims and the latest generation of internet users. In addition to classic email attacks, cybercriminals are now also using social networks and instant messaging platforms to reach potential victims, making it difficult for brands to identify and respond to these threats. While many techniques for combating phishing attempt to address the issue broadly, our approach is focused specifically on brand protection and the abuse of brand assets no matter how a phishing website was distributed to potential victims. We use a combination of features based on URL structure and wording, DOM structure, HTML, and text content, that provide agility and adaptability, allowing us to more precisely detect a wider variety of brand-related phishing websites. These features enable Machine Learning algorithms to capture semantics and create a comprehensive high accuracy model capable of detecting phishing websites across multiple languages. Our approach delivers the proactive detection of classical phishing websites and scam-pages targeting a brand across a range of different scenarios and methods and can be easily adapted to suit the needs of any brand seeking to protect itself and its customers from phishing threats. Keywords Phishing, Machine Learning, Cybersecurity, Phishing detection1 1. Introduction phishing pages. Subsequently, phishing campaigns are initiated, attracting traffic to these malicious URLs. During this process, third-party vendors might detect According to the Anti-Phishing Working Group the phishing activities and notify the targeted brands, (APWG), the number of reported phishing attacks has enabling them to act and add the relevant information grown more than five-fold in the last three years [1]. to their phishing collection for further investigation. Meanwhile, the number of brands attacked each The time lag between the initiation of a phishing month has remained relatively consistent. campaign and its detection poses a critical challenge Phishing attacks have become increasingly for businesses. Customers remain exposed to phishing sophisticated, posing a significant threat to businesses infrastructure outside the brand's platform, leading to and their customers. In the typical lifecycle of a potential financial losses, identity theft, and damage to phishing URL, cybercriminals first establish their the brand's reputation. To address this issue, we infrastructure, leading to the creation of deceptive APWG.EU Technical Summit and Researchers Sync-Up 2023, Dublin, Ireland, June 21 & 22, 2023 nadi.demidova@gmail.com (N. Demidova); plawson03@qub.ac.uk (P. Lawson); jsloan@red-button.com (J. Sloan) 0009-0002-9775-2729 (N. Demidova); 0009-0003-3107-5523 (P. Lawson); 0009-0009-5356-7573 (J. Sloan) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings sought to develop a proactive approach that identifies with unique approaches and features. In this section, phishing URLs and infrastructure earlier in the we review a selection of pertinent studies that customer compromise cycle, effectively reducing the contribute to the advancement of phishing detection exposure time for our customers. methodologies. Our approach uses the concept of "Shift Left," One study by Das Guptta, S., Shahriar, K.T., emphasizing early identification of phishing assets. To Alqahtani, H. et al. [2] advances hybrid feature-based achieve this, we created a custom Anti-Phishing phishing website detection. The authors leverage URL Ecosystem tailored to the unique challenges faced by and hyperlink features for real-time accuracy, our brand. A custom solution allows us to leverage our minimizing reliance on third-party systems. This in-depth brand knowledge, the understanding of our addresses the challenge of new websites and zero- business workings, customers, and communication hour attacks. channels in the best possible way. Q. A. Al-Haija and A. A. Badawi [3] propose an Machine Learning (ML) plays a central role in our efficient phishing website detection system focusing custom solution. By harnessing ML capabilities, we on URL patterns. Machine learning techniques, gain a competitive advantage in staying ahead of including neural networks and decision trees, classify evolving threats and detecting new zero-day attacks. authentic and phishing sites effectively. ML allows us to be agile and adaptive, enabling swift Arun Kulkarni, Leonard L. Brown III [4] delve into responses to emerging phishing patterns. Additionally, machine learning classifiers such as decision trees, we continue to leverage trusted external data sources Naive Bayesian classifier, SVM, and neural network to and collaborate with valuable insights from partners distinguish real from fake websites4. Real-world URL to strengthen our approach further. datasets exhibit their prowess. In this paper, we present our hybrid feature-based Additionally, A. Ghimire, A. Kumar Jha, S. Thapa, S. approach with machine learning for proactive brand- Mishra and A. Mani Jha [5] champion a machine targeting phishing website detection. Our custom learning-driven approach detecting phishing URLs. solution focuses on brand protection and the Balanced datasets and varied algorithms reveal high incorporation of internal signals and data sources, precision, recall, and F-score potential. providing a comprehensive and highly accurate model S. Zaman, S. M. Uddin Deep, Z. Kawsar, M. capable of detecting a wide variety of brand-related Ashaduzzaman and A. I. Pritom [6] demonstrate the phishing websites across multiple languages and effectiveness of Naive Bayes, J48, and HNB classifiers distribution methods. in phishing detection. Innovative feature selection By implementing our proactive approach, enhances accuracy. businesses can fortify their defenses, protect their Lastly, P. Yang, G. Zhao and P. Zeng [7] propose customers from phishing attacks, and safeguard their multidimensional feature-based phishing detection brand reputation. As cybercriminals continually refine with deep learning. Character sequence features their strategies, the need for early identification and facilitate quick deep learning-based classification, agile detection becomes paramount in the fight against complemented by URL statistics, webpage code, and phishing threats. Our research aims to contribute to text features. the evolving field of cybersecurity, empowering In summary, the reviewed studies collectively businesses to take a proactive stance against brand- contribute to the ongoing efforts in phishing detection targeting phishing attacks. using machine learning-based approaches. The variety The paper has the following structure: Section 2 of methodologies and feature sets underscores the provides an overview of related work in the field, need for adaptable and comprehensive solutions to laying the foundation for our contributions. Moving counter the dynamic nature of phishing attacks. forward, Section 3 outlines the key elements of our This paper brings novelty by emphasizing brand- methodology, presenting our approach, system specific abuse, combining structural and textual overview, and data collection methodology. In Section features, and promoting the collection of compatible 4, we delve into the critical process of feature clean training samples for effective phishing detection. engineering, detailing how we transform raw data into insights. Section 5 introduces the models that power our detection system. To assess model performance, 3. Methods Section 6 elaborates on the evaluation metrics we have chosen. Section 7 is dedicated to presenting our results 3.1. Definitions and notations and review. Finally, in Section 8, we conclude with remarks that summarize our findings and pave the Table 1 way for future research endeavors. Definitions and notations Term Definition URL Address of a given unique resource on the Web 2. Literature review Phishing URL Address of a phishing content on the Web The domain of phishing detection has been a focal Document It defines the logical structure of point in cybersecurity research, driven by the Object Model documents and the way a increasing sophistication of cybercriminal activities. (DOM) document is accessed and Researchers have proposed various machine learning- based solutions to tackle this pervasive threat, each manipulated Page content Captured web page source, when process. This model utilizes a combination of features given phishing URL is requested in meticulously selected to ensure high accuracy in browser. detection. A subset of these features revolves around FQDN Domain name that specifies its the use of brand assets, aligning with our concentrated exact location in the tree hierarchy approach tailored for a specific brand. This synergy of the Domain Name System (DNS). empowers our system to process large volumes of It specifies all domain levels, suspicious URLs from diverse sources, elevating our including the top-level domain and overall phishing detection efficacy. the root zone. The incorporation of machine learning allows us to TLD Top level domain proactively address evolving threats, including the detection of new zero-day attacks. This adaptability Subdomains All domains on the left of second- and agility are integral to staying ahead in the rapidly level domain changing landscape of online security. Path The path refers to the exact location of a page, post, file, or other asset. It is often analogous to 3.3. System overview the underlying file structure of the website. The path resides after the At a high level, our solution follows a streamlined workflow (Figure 2) to detect and mitigate phishing hostname and is separated by “/” threats: (forward slash). 1. Data Collection: Our system actively collects Directories Folder in a path (directory names URLs that exhibit suspicious characteristics from separated by "/") diverse sources. These sources encompass various Parameters goes after "?" symbol. Extra avenues, including new domains, SSL Certificate parameters provided to the Web stream data, our internal signals, and other server. repositories of potentially suspicious URLs. Anchor Represents a sort of "bookmark" 2. Data Retrieval: From the gathered URLs, the inside the web resource. system extracts the content of the web pages associated with these URLs. 3. Data Processing: Raw data is subjected to a 3.2. Approach comprehensive processing phase to derive meaningful data points that are conducive to effective phishing There is a common approach that underlies the detection. Customer Compromised Cycle (Figure 1) and basic off- 4. Feature Extraction: The system transforms platform anti-phishing strategy: the processed data points into a structured numeric 1. Cybercriminal infrastructure setup representation. 2. Phishing page creation 5. Model Evaluation: Utilizing the numeric 3. Phishing campaign launch representation, our machine learning model takes 4. As campaigns gain momentum, third-party over. It evaluates each sample and provides a verdict: vendors identify and share this information. whether the URL is indicative of phishing or not. 5. This prompt notifications, add relevant data 6. Action and Collection: If the model identifies to our phishing-collection, and take a URL as phishing, we initiate an appropriate response. necessary actions. This process has a feedback loop, as the insights gleaned from the collected data continuously contribute to the refinement and evolution of our machine learning model. This iterative approach ensures that our model remains adaptive to emerging trends and effectively addresses new challenges that may arise in the dynamic landscape of phishing threats. Figure 1: Customer Compromise Cycle Our approach is designed to minimize our customers' exposure to off-platform phishing infrastructure and focuses on early identification of Figure 2: High-level System Architecture phishing assets. At its core, our solution integrates a machine learning model that plays a pivotal role in automating and scaling the phishing page detection 3.4. Data collection As a result, our model is equipped to discern nuanced patterns and characteristics in both phishing As our model makes its decisions based on features and legitimate content, enhancing its predictive extracted from URLs and page content, this is what we accuracy in real-world scenarios. needed to gather for our learning collection. To For the evaluation of our model's performance, the facilitate the training of our detection model, the dataset was divided into a training set and a test set in acquisition of a substantial number of examples for an 80:20 ratio, allowing us to assess its performance each distinct targeted group was imperative. on previously unseen data. Presently, our dataset comprises more than 62,000 samples. Within the "phishing" category, we encompass a broad spectrum of deceptive pages 4. Feature engineering designed to exploit our customers, ranging from Feature engineering is a fundamental step in traditional phishing schemes that replicate login converting raw webpage data into numeric vectors interfaces to fraudulent support pages employing that can be effectively utilized by machine learning vishing tactics. algorithms for phishing detection. Since machine The efficacy of machine learning hinges on the learning algorithms operate on numerical data, we caliber of the data it learns from, coupled with the need to find a suitable representation for each sample algorithms' capacity to assimilate it. The quality of data that provides valuable information to the model, exerts a direct influence on the model's performance; enabling it to distinguish phishing instances it can exclusively glean insights from the data it is effectively. In our solution, we adopt a hybrid feature- provided. As such, the data must meet the following based approach, combining URL structure and criteria: wording, DOM structure, HTML, and text content to • Relevance create numeric vectors for each webpage sample. • Non-duplication • Accurate labeling • A combination of recent and historical data 4.1. URL structure and rank • Representative of real-world production scenarios features • Sourced from diverse origins. Mitigating data selection bias is also paramount. The first type of features are URL-based features, such This bias manifests when the collected data as the number of subdomains used, the count of path inadequately encapsulates the full spectrum of folders, and the domain's association with highly possible information or information combinations that ranked domains based on the number of referring the model may encounter in practical scenarios. subnets. For instance, consider the analogy of fruits and This structural foundation, as depicted in Figure 3, vegetables. In reality, these come in a myriad of colors. forms the basis upon which our URL-based features However, if data collection predominantly focuses on are constructed. red fruits and green vegetables, it would introduce data selection bias. To ensure diversity within each targeted group, strategic sampling is essential. For instance, if top-level domains (TLDs) are employed as features and phishing samples are available for each TLD, solely Figure 3: Parts of a URL featuring ".com" samples in the clean dataset could predispose the model to label anything else as As a result, the following features are extracted: phishing. Similarly, the nature of web pages must be 1. Does the root domain name rank within the adequately represented; relying solely on top ranked top 1 million of widely recognized domains, pages might not correlate well with our phishing based on the number of referring subnets? dataset, which predominantly mimics login and 2. Fully Qualified Domain Name (FQDN) rank – registration forms. Therefore, a comprehensive indicating whether the domain resides in the collection is necessary to mirror analogous first 1000, first 100000, within the top 1 representations in the clean dataset. million, or outside this range. Moreover, our solution leverages visible text 3. Presence of brand-related keywords within prevalent in phishing pages, necessitating the the URL. accumulation of authentic instances that deploy 4. Count of path directories. similar terminologies without malicious intent. This 5. Count of subdomains. includes legitimate pages from our customers who feature their businesses on our platform, personal 4.2. Page content structure and websites, or articles about our brand. To bolster our clean dataset, we employed search engines with links features intelligent search queries to curate samples embodying diverse feature combinations. This The second type of features we employ are based on strategic approach ensured that our clean dataset the webpage's structure. Additionally, we analyse the matched the multifaceted nature of the phishing links used on the page to identify any brand assets or collection we had amassed over time. links to the original brand logo. Since cybercriminals often copy the original page, there is a high chance of The id attribute specifies a unique identifier for an finding traces left behind. Furthermore, we assess the HTML element. The value of the id attribute is usually DOM structure counts to determine the presence of unique within the HTML document. The class attribute forms and inputs on the page, contributing to effective is often used to point to a class name in a style sheet. It phishing detection. can also be used by JavaScript to access and manipulate elements with the specific class name. The construction of dictionaries adheres to the following process: 1. Compilation of unique term sets from each distinct document within the phishing segment of the training set. 2. Aggregation of these sets into a comprehensive list of terms. Figure 4: Snippet of page content with highlighted 3. Retention of the most frequently occurring links terms through a counting mechanism. For each sample within the dataset, we employ a count vectorization technique to align the extracted terms with the prepared dictionaries. This alignment is grounded in the frequency of occurrence exhibited by each token within the entire text of the respective sample. To add further significance to the numeric vectors, we perform TF-IDF (term frequency-inverse document frequency). This statistical measure evaluates the relevance of a word to a document within a collection of documents. It considers both how Figure 5: Snippet of page content with highlighted frequently a word appears in a document and its input, form, button elements inverse document frequency across the entire dataset: As a result, the following features are extracted: 𝑓!,# (1) 1. Number of links in /