<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Web Content Filtering Through Knowledge Distillation of Large Language Models</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Tamás</forename><surname>Vörös</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Sophos AI</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sean</forename><surname>Bergeron</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Sophos AI</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Konstantin</forename><surname>Berlin</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Sophos AI</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Web Content Filtering Through Knowledge Distillation of Large Language Models</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">BCE101E372FC59618673A3B358E9F534</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:58+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Machine Learning</term>
					<term>Web Content Filtering</term>
					<term>Large Language Models</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We introduce a state-of-the-art approach for URL categorization that leverages the power of Large Language Models (LLMs) to address the primary objectives of web content filtering: safeguarding organizations from legal and ethical risks, limiting access to high-risk or suspicious websites, and fostering a secure and professional work environment. Our method utilizes LLMs to generate accurate classifications and then employs established knowledge distillation techniques to create smaller, more specialized student models tailored for web content filtering. Distillation results in a student model with a 9% accuracy rate improvement in classifying websites, sourced from customer telemetry data collected by a large security vendor, into 30 distinct content categories based on their URLs, surpassing the current state-of-the-art approach. Our student model matches the performance of the teacher LLM with 175 times less parameters, allowing the model to be used for in-line scanning of large volumes of URLs, and requires 3 orders of magnitude less manually labeled training data than the current state-of-the-art approach. Depending on the specific use case, the output generated by our approach can either be directly returned or employed as a pre-filter for more resource-intensive operations involving website images or HTML.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Web content filtering is crucial for maintaining network security and regulatory compliance in organizations <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>. The aim of a web content filtering system is to prevent employees from accessing inappropriate content that violates regulatory requirements or company policies, and by filtering out high-risk content categories, such as pornography and weapons, it helps to avoid legal liability, reduces the risk of legal or ethical issues arising from exposure to unsuitable content, and promotes a professional work environment. Unlike security classification, which detects hosted malware and phishing attacks, content filtering models address a more general problem that is independent of the attack mechanism. In this work, we address the problem of web content categorization.</p><p>Traditional approaches to website categorization have relied upon creating and maintaining domain-to-category mappings, which are lists of domains grouped by their manually assigned categories <ref type="bibr" target="#b2">[3]</ref>. A natural extension to list-based URL categorization is to en-CAMLIS'23: Conference on Applied Machine Learning for Information Security, October <ref type="bibr" target="#b19">[19]</ref><ref type="bibr" target="#b20">[20]</ref><ref type="bibr">2023</ref>, Arlington, VA tamas.voros@sophos.com (T. Vörös); sean.bergeron@sophos.com (S. Bergeron); konstantin.berlin@sophos.com (K. Berlin) hance them with signatures created by analysts, that would generalize better than exact string matching. In the case of web content filtering, the most straightforward signaturebased approach is to propagate labels based on domains and subdomains, although more complex rules may be applied <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6]</ref>. An example of this kind of label propagation is to maintain a list of known domains with predetermined labels, such as labeling "onlineshop.com" as an e-commerce site and "news-site.com" as a news site. All URLs under these domains inherit the label. For instance, any URL under "online-shop.com" such as "online-shop.com/products/clothing", "online-shop.com/products/electronics", and "onlineshop.com/cart" can be labeled as e-commerce. Similarly, any URL under "news-site.com", such as "news-site.com/politics", "news-site.com/technology", and "news-site.com/entertainment" can be labeled as news. In the manuscript, we focus on domain label propagation signatures for acquiring ground truth for the sake of simplicity, but it could be trivially extended to longest prefix matching of the URL for ambiguous websites. To provide comprehensive customer telemetry coverage for organizations, one of the most resource-effective manual methods involves ranking domains by frequency and labeling them in descending order. This approach maximizes coverage by prioritizing the labeling of a single domain. As new websites emerge daily and with over a billion existing websites, maintaining and scaling signature approaches manually for the long tail has become increasingly challenging. This necessitates the integration of machine learning into the classification pipeline <ref type="bibr" target="#b7">[7,</ref><ref type="bibr" target="#b8">8,</ref><ref type="bibr" target="#b9">9]</ref>. Figure <ref type="figure" target="#fig_0">1</ref> illustrates the telemetry coverage of a large security vendor, with the space above the bars representing the infrequently seen long tail distribution of domains not already covered by domain labeling and label propagation signatures. Maintaining domain-to-category mapping lists and extending them with signatures remains critical in the early stages of security pipelines <ref type="bibr" target="#b10">[10]</ref>. These labels serve as initial shortcuts in the filtering pipeline to prevent catastrophic false positives, and provide low latency on more commonly seen websites. Websites like 'stackoverflow.com' are well-known and need not be evaluated by a model whereas a potential false positive would translate to a negative impact on the productivity of an organization. In this work, we focus our evaluations on the long tail of the distribution, which aligns with actual deployment scenarios and emphasizes the need for machine learning to address the challenges associated with classifying this ever-growing subset of domains.</p><p>In addition to acting as a pre-filter, domain-to-category mapping lists and label propagation signatures are often used to create the training sets for machine learning models. However, machine learning algorithms tend to memorize patterns rather than understand underlying concepts <ref type="bibr" target="#b11">[11,</ref><ref type="bibr" target="#b12">12]</ref>, thus learning from already labeled URLs is insufficient for accurate content classification in the long tail of the URL distribution. A model whose parameters are configured to memorize the head of the distribution is undesired as signatures already cover such domains without risking false positives. Therefore, our objective is to identify models with superior generalization capabilities for out-of-distribution samples.</p><p>For unknown or new domains, the model must infer a description from the URL. It is useful to view URL classification, especially for web content filtering, as a natural language processing task, considering URLs as semi-sentences. For a fair amount of our categories, the URL will frequently have explicit words to advertise its content, specifically semantically related keywords for the given category. For example, a site selling weapons will often contain keywords such as "armaments" or "glock" or "gun". The current state-of-the-art in URL detection and our chosen baseline, URLTran <ref type="bibr" target="#b9">[9]</ref>, frames URL detection as a natural language processing task and fine-tunes a pre-trained BERT model <ref type="bibr" target="#b13">[13]</ref> to detect phishing URLs. The BERT model is an early example of the transformer architecture <ref type="bibr" target="#b14">[14]</ref> which has since been refined and scaled, giving rise to large language models. Large language models (LLMs) are state-of-the-art on natural language tasks <ref type="bibr" target="#b15">[15]</ref>. LLMs are first pre-trained on large amounts of unlabeled textual data in a task-agnostic manner, learning a general understanding of language such as syntax and semantics <ref type="bibr" target="#b15">[15]</ref>. Once pre-trained LLMs can effectively generalize to new tasks upon fine-tuning or few-shot prompting with much smaller amounts of data <ref type="bibr" target="#b15">[15]</ref>. The amount of data needed for LLMs to generalize to new tasks is often several orders of magnitude less than the amount of data needed to fully train a smaller model. Direct use of LLMs for URL content classification in production is prohibitive due to cost considerations at scale <ref type="bibr" target="#b16">[16]</ref>. Fine-tuning smaller LLMs that have lower inference costs results in a loss of performance. Through knowledge distillation <ref type="bibr" target="#b17">[17]</ref>, the LLM-labeled long tail data enables a smaller student model to improve its performance while maintaining the necessary computational efficiency for production. Turc et al. <ref type="bibr" target="#b18">[18]</ref> proposed an approach that utilizes knowledge distillation from the teacher's predictive distribution (soft labels) followed by supervised fine-tuning of the student model. In the domain of web content classification, we combine the steps of distillation and fine-tuning and our computationally efficient student matches the performance of the teacher model. Instead of a predictive distribution, we distill the teacher using hard labels. The student model has a low inference cost and is well-suited for the purposes of web content filtering in production.</p><p>The main contributions of this paper are as follows:</p><p>• We demonstrate that when fine-tuned on data labeled with domain propagation signatures, large language models outperform standard deep learning models by 9% in terms of accuracy on the long tail categorization problem. • We demonstrate that we can fine-tune a large language model using 10000 samples to achieve better performance than the current state-of-the-art approach trained on 10 million samples. • We showcase the effective application of knowledge distillation from a fine-tuned LLM to boost the performance of a smaller, more computationally efficient model, specifically for web content filtering tasks. We attain performance levels comparable to the original LLM using a model that is 175 times smaller, decreasing from 770 million parameters to just 4 million. This reduction in size makes the model more suitable for production and enables practical deployment across various contexts, such as serving as a general pre-filter for all incoming network traffic in firewalls. • We propose a novel validation approach for the community to adopt, which more accurately assesses model performance in realistic scenarios where it works alongside a domain-to-category mapping list of ground truth labels, extended via domain label propagation signatures. In this setting, the model analysis focuses on labeling the long tail, focusing on a more relevant metric.</p><p>Our paper is structured as follows: In Section 1 we introduce the research problem and elucidate the motivation behind our proposed approach. In Section 2 we review relevant literature and prior work in the field. In Section 3, we provide a comprehensive description of our methodology, encompassing the dataset and experimental setups. In Section 4 we present our results, which include a comparison of our approach's performance against the current state-of-the-art, an analysis of the benefits of LLMs in terms of accuracy and sample efficiency, as well as an exploration of deployment challenges and our proposed solution utilizing knowledge distillation for more compact and computationally efficient models. Lastly, In Section 5 we conclude the paper, outlining potential avenues for future research in this domain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Previous work in this field has primarily focused on security classification rather than content classification and filtering. Since machine learning approaches to security classification can be readily reformulated from binary classification to multi-class classification through modification of the last layer in the neural network, approaches to security classification are relevant to the task of content classification. We will compare and build upon security publications as they are better studied.</p><p>Early work on URL-only classification for phishing detection using manually derived feature sets employed both generic features and features meant to detect certain obfuscation techniques such as obfuscation of the host with another domain <ref type="bibr" target="#b19">[19]</ref>. The features were divided into four groups: Page Based, Domain Based, Type Based, and Word Based. The authors focused on manual feature engineering and only applied logistic regression as their classifier. A range of machine learning models, including Random Forests, Logistic Regression, Support Vector Machines, Naive Bayes, and Gradient Boosting, have been applied to detect phishing URLs using manually extracted feature sets <ref type="bibr" target="#b7">[7,</ref><ref type="bibr" target="#b20">20]</ref>. Feature sets may be entirely lexically derived such as the length of the URL, the number of digits in the primary domain, and the number of special characters in the path <ref type="bibr" target="#b21">[21]</ref>. In addition to lexical features, domain-specific features such as the number of passive DNS changes or the remaining time of the SSL certificate may be incorporated <ref type="bibr" target="#b22">[22]</ref>. Manual features may also be extracted from the retrieved information of lookups (Whois, GSB Reporting, Google Ranking, and Selenium Rendering) <ref type="bibr" target="#b23">[23]</ref>.</p><p>The manual feature extraction approach is difficult to maintain as adversaries tend to adapt obfuscation methods to avoid detection so models have shifted to a featureless approach based on the raw string as input. Deep learning methods learn and then automatically extract the feature set from the raw URL during training. The use of automatically extracted features does not preclude the inclusion of manual features however as the optimal input combination of manual and automatic features can be optimized with genetic algorithms <ref type="bibr" target="#b24">[24,</ref><ref type="bibr" target="#b25">25]</ref>.</p><p>Automatic feature extraction can be done on various levels of granularity starting at the character level. Saxe et al. <ref type="bibr" target="#b8">[8]</ref> encode a URL by replacing each character with its corresponding ID whereby features are extracted from the encoded URL with sequential embedding and convolutional layers. This approach outperformed a baseline which uses a manual feature set. Learning meaningful context-independent representations is difficult when using character-level tokenization as a character token doesn't carry the same meaning that a word does. More recent approaches like subword-level and word-level tokenization have been developed in natural language processing in order to make it easier for models to maintain semantic meaning in common subwords and learn more meaningful context-independent representations.</p><p>The application of word-level tokenization to URL classification was first proposed by Le et al. <ref type="bibr" target="#b26">[26]</ref> who extracted both character-level and word-level features. Each feature set is fed through its own series of sequential embedding and convolutional layers before being fused. Tajaddodianfar et al. <ref type="bibr" target="#b27">[27]</ref> expand on this approach by first training the word embeddings in an unsupervised manner via FastText <ref type="bibr" target="#b28">[28]</ref>. The word and character convolutional stems include several convolutional layers in parallel with dilated convolutions allowing the model to adaptively grow in depth and width, extracting N-grams of various lengths. In addition to using both character and word-level feature models, Bu et al. <ref type="bibr" target="#b29">[29]</ref> apply a triplet network structure in order to address class imbalances and better learn the similarity between URLs.</p><p>In addition, to feature set selection, the choice of model architecture plays a large role in the performance of a URL classification model. Transformers have achieved state-of-the-art results in many natural language processing tasks making them a good candidate for URL classification after fine-tuning or even custom pre-training <ref type="bibr" target="#b30">[30,</ref><ref type="bibr" target="#b31">31,</ref><ref type="bibr" target="#b9">9,</ref><ref type="bibr" target="#b32">32,</ref><ref type="bibr" target="#b33">33]</ref>. In addition to the URL, a transformer can leverage tokenized features of the HTML <ref type="bibr" target="#b34">[34]</ref>. A URL classification system might employ different architectures in parallel, fusing the output of models with a convolutional architecture and a transformer architecture <ref type="bibr" target="#b35">[35]</ref>. Instead of fusing model outputs, a system may employ an ensemble of different architectures including Decision Trees, LSTMs, and transformers for URL classification <ref type="bibr" target="#b36">[36]</ref>.</p><p>Other architectures applied to URL classification include graphical networks <ref type="bibr" target="#b37">[37,</ref><ref type="bibr" target="#b38">38]</ref> and GANs <ref type="bibr" target="#b39">[39,</ref><ref type="bibr" target="#b40">40]</ref>. AutoEncoders have proven useful against zero-day attacks <ref type="bibr" target="#b41">[41]</ref>. In addition to the URL and HTML sequences, but beyond the scope of this paper, images of the webpage may be incorporated <ref type="bibr" target="#b42">[42,</ref><ref type="bibr" target="#b43">43]</ref>. The task of classification may be reformulated by approaching detection from a reinforcement learning perspective <ref type="bibr" target="#b44">[44]</ref> or from the perspective of thwarting an adversarial opponent <ref type="bibr" target="#b45">[45,</ref><ref type="bibr" target="#b46">46]</ref>.</p><p>The current state-of-the-art for URL-only classification for phishing detection, URLTran <ref type="bibr" target="#b9">[9]</ref>, utilizes the transformer architecture underpinning LLMs. Maneriker et al. fine-tune a pretrained BERT model on Microsoft's Edge and Internet Explorer production browsing telemetry. Parallel to URLTRan is the Unified Text-to-Text Cybersecurity (UTS) model. Pal et al. <ref type="bibr" target="#b47">[47]</ref> train a multi-task encoder-decoder LLM on cybersecurity data that includes URL phishing detection. Although Pal et al. introduce LLMs to URL phishing detection, they do not explore the few shot capabilities of LLMs in the URL domain nor test the capabilities of LLMs at scale. Compared to URLTran, UTS does not consider a methodology which would allow its large model to be used in production and reports a lower F1 score on a random split compared to URLTran's evaluation on the industry standard time split. Therefore, URLTran will act as our baseline to which all of our results will be compared.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>In this section, we describe our methodology for collecting data and constructing training, validation, and test sets. We also explain our experimental setup and provide a detailed account of how we trained our model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Data</head><p>We obtained our dataset from a large security vendor's customer telemetry data sourced from its firewall and endpoint products over a period spanning July 1, 2022 to December 23, 2022.</p><p>We track 30 categories in our dataset. These categories were defined by a team of expert analysts to be representative of the most common internet content categories as well as the most impactful, which we define as the potential to impact productivity, the degree of liability for the organization, and the degree of associated ethical concerns. The categories include: "Chat", "Games", "Shopping", "Sports", "News", "Job Search", "Search Engines", "Alcohol", "Gambling", "Weapons", "Porn", "Banking", "Business", "Education", "Entertainment", "Food and Dining", "Government", "Health and Medicine", "Motor Vehicles", "Peer to Peer", "Real Estate", "Religion", "Travel", "Translators", "Computer and Internet", "Hunting and Fishing", "Marijuana", "Radio and Audio Hosting", "Social Networking", and "Video Hosting". The majority of websites in our dataset belong to categories such as "Computer and Internet", "Search Engines", and "Business", while niche categories such as "Hunting and Fishing" and "Marijuana" have fewer instances. Figure <ref type="figure">A5</ref> shows the distribution of categories in our dataset. We define our categorization task as a closed-world problem, meaning every URL belongs to one of the 30 categories. It's important to note that, due to limitations in the domain-to-category database, we only consider a single category per URL, even though some pages may realistically have multiple category labels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Training Sets</head><p>To construct our training dataset which spans the period from July 1, 2022 to August 19, 2022, we uniformly sampled 10 million distinct URLs, out of the billions of URL lookups, that have been labeled using a domain-to-category mapping database with label propagation. Additionally, we sampled 10 million URLs from this period that did not correspond to a signature (unlabeled). The unlabeled URLs were set aside for training augmentation purposes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Validation and Test Sets</head><p>We sampled an evaluation dataset spanning from August 19, 2022 to December 23, 2022 and divided it into two validation and test sets to assess our model's performance in different scenarios. The validation sets were based on data first seen between August 19, 2022 and November 24, 2022 while the test sets included data first seen between November 24, 2022 and December 23, 2022.</p><p>We created a domain and time split to simulate a long tail deployment setting. We separated the data based on the first-seen time of the URL and the first-seen time of the URL's domain. The first-seen time of a domain refers to the earliest instance of a URL from that domain. Meaning, there is no domain overlap between the training, validation, and test sets. This approach allowed us to better approximate the unlabeled part of the telemetry.</p><p>To compare our results with the industry-standard evaluation methodology we also created a time split. This split was sampled from the same time span as the domain and time split but without the constraint of dividing based on the domain's first-seen time.</p><p>For the domain and time split, the validation set was comprised of 79,313 unique URLs from 30,897 unique domains, with a maximum of 5 URLs per domain. The test set included 110,624 unique URLs from 43,996 domains. For the time split, we sampled 183,935 URLs from 62,961 domains.</p><p>To compare the various splits, we display the most common domains and their frequencies for the labeled training data, both test splits, and the unlabeled training data in Table <ref type="table" target="#tab_0">1</ref>. The labeled training set and the time split are dominated by common domains such as "google.com". The domain and time split is most similar to the unlabeled long tail of the data where the desired value of machine learning resides. To quantitatively assess the disparities between the domain distributions of the time split and the domain and time split, which more closely models the long tail, we employed the Kullback-Leibler (KL) divergence as a metric for measuring the dissimilarity between token distributions. The KL divergence values were calculated between each validation split and the training dataset as the reference. We tokenized all the URLs in the training dataset using BERT tokenization and then combined all of the tokens to define the distribution of the base training dataset. We tokenized all the URLs in both validation splits using BERT tokenization and each token sequence was converted into probability distributions by computing normalized histograms. The KL divergence between the token probability distribution of each URL and the reference distribution was then determined using the entropy function. Figure <ref type="figure" target="#fig_1">2</ref> illustrates that the token distribution of the domain and time split displays substantially higher KL divergence values from the reference compared to the time split. This observation highlights the distinct nature of the domain distributions in the two validation splits and the similarity between the unlabeled part of the customer telemetry and the domain and time split. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Experiments</head><p>The primary objective of our experiments was to identify the best-performing model in terms of accuracy on our dataset, while using as few training labels and being as small as possible. To achieve this, we varied the training set size as a hyperparameter for each LLM, compact model, and the baseline. We explored training set sizes ranging from few-shot to large-scale learning, increasing the sample size from 10 samples per category to 5 million total samples, growing by an order of magnitude at each step. For a given sample step size 𝑁 , the exact samples per category were determined by the minimum of 𝑁 and the total labeled instances in that category.</p><p>Our next goal was to refine the top-performing large language model (LLM) configuration into a more compact student model. We achieved this by using labels generated by the bestperforming LLM to train smaller models.</p><p>We labeled 10 million unlabeled URLs from our dataset using the best-performing LLM, utilizing them as hard labels. This resulted in a total of 20 million training set with the 10 million signature-labeled base training set and an additional 10 million labels generated by the LLM. We then investigated the impact of combining these labels using various mixing ratios of labeled samples from the base training set and LLM-labeled samples. Each compact student model and baseline were trained on a variety of dataset configurations, each containing a total of 10 million samples.</p><p>We began with a 10-million base training set, incorporating LLM-generated labels at 0.0, 0.25, 0.50, 0.75, and 1.0 ratios. The 0.0 ratio used only the base training set, while the 0.25 ratio included 7.5 million base URLs and 2.5 million LLM-generated. At 0.5, the sources were evenly split with 5 million each. The 0.75 ratio contained 2.5 million base and 7.5 million LLM URLs, and the 1.0 ratio relied solely on LLM-generated labels. By varying the mixing ratios, we were able to assess the effectiveness of our knowledge distillation process and compare the contributions of LLM-generated labels to simply using signature-generated labels.</p><p>We trained and compared the performance of five models: BERT-based URLTran as the baseline <ref type="bibr" target="#b9">[9]</ref>, which demonstrated state-of-the-art performance for URL classification, eXpose <ref type="bibr" target="#b8">[8]</ref> and BERTiny <ref type="bibr" target="#b48">[48]</ref> as the student models, and T5 Large <ref type="bibr" target="#b49">[49]</ref> and GPT-3 Babbage <ref type="bibr" target="#b15">[15]</ref> as the teacher models. The size configurations of our teacher models were limited by budgetary constraints, precluding larger configurations such as GPT-3 Davinci and T5-11B. Our student models were chosen for the following reasons: BERTiny is the smallest pre-trained configuration of the baseline and the inclusion of eXpose allows us to demonstrate the improvements of the transformer architecture over convolutional models for natural language tasks, specifically web content categorization. Unless otherwise noted, all experiments were evaluated on the test set of both validation splits. The GPT-3 Babbage model was not fine-tuned on 5 million samples due to cost considerations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.">Training</head><p>For all models, we pre-processed the data by splitting at the first occurrence of the "?" character and removing the query parameters. The query is assumed to be noisy and without any meaningful information. All URLs were truncated to a fixed length of 128 characters as we have seen no improvement in further increasing the size. The base pre-trained models and tokenizers for all T5 Large, BERT, and BERTiny configurations were the HuggingFace defaults <ref type="bibr" target="#b50">[50]</ref>.</p><p>For all reported T5 configurations, we fine-tuned all weights of a pre-trained T5 Large model using the Adafactor optimizer <ref type="bibr" target="#b51">[51]</ref>. Early stopping was applied by monitoring performance on the validation set of the domain and time split. For all reported GPT-3 configurations, we fine-tuned the Babbage model using the OpenAI API.</p><p>T5 and GPT-3 are generative models that can utilize semantic relationships between class labels and keywords in a URL for making predictions. Consequently, we employed literal class labels as our prediction target. When reporting aggregate metrics, out-of-vocabulary (OOV) predictions are not considered as a separate class, and they were considered as misclassfication for every class. Additionally, any unlabeled data for which LLM generates an OOV prediction is excluded from the distillation process.</p><p>For GPT3 the temperature was set to 0 to ensure deterministic results upon inference. The logit bias for tokens associated with the class labels were set to 100 to ensure exclusive selection of expected tokens. Finally, the stop token was set to the stop sequence seen during training.</p><p>For the student models, we trained a 1D convolutional eXpose model and fine-tuned all weights of a pre-trained BERTiny model. We fine-tuned all weights of a pre-trained BERT model to reproduce the architecture of URLTran as our baseline. No custom vocabulary was created for the BERT-based models. Hyperparameter configurations for T5, BERTiny, BERT, and eXpose may be found in Tables A7, A4, A5, and A6 respectively. In this section, we present the key findings and results of our two sets of experiments. We report the results in terms of accuracy, with additional metrics for both experiments provided in the Appendix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results</head><p>The performance of various models as a function of the log of the training sample counts is displayed in Figure <ref type="figure" target="#fig_2">3</ref>. The top-scoring configuration for each model is detailed in Table <ref type="table">2</ref>. On the domain and time split, the best performing model, T5 Large, achieves 46.3% accuracy after being fine-tuned on 10,000 samples. GPT-3 Babbage attains 44.4% accuracy after fine-tuning on 10,000 samples. Both LLMs surpass the best baseline configuration, which achieves 38.3% accuracy. BERTiny and eXpose achieve 35.7% and 30.2% accuracy, respectively, when trained on 5 million samples.</p><p>On the time split, eXpose achieves 92.8% accuracy when trained on 5 million samples. BERTiny, fine-tuned on 5 million samples, attains 97.6% accuracy. The best configurations for the baseline, GPT-3 Babbage, and T5 Large achieve 97.1%, 98.14%, and 97.5% accuracy, respectively. Additional metrics for the time split are reported in Table <ref type="table" target="#tab_0">A10</ref>, and for the domain and time split in Table <ref type="table">A9</ref> for all experiments.</p><p>On the domain and time split, the best performance was achieved with T5 on 10,000 training samples, so we selected it as our teacher model. For the domain and time split, we found the best ratio to be 1.0, where every training sample had 10 million previously unlabeled URLs labeled by T5. Training eXpose on all of them increases the accuracy from 31.5% to 45%. Fine-tuning BERTiny on all 10 million LLM labels, compared to the 10 million base training set, improves the accuracy from 37.5% to 46.2%. Finally, fine-tuning URLTran on all 10 million LLM labels, compared to the 10 million base training set, raises the accuracy from 41.6% to 46.8%.</p><p>For the traditional time split, the augmentation at a ratio of 0.75 also increased the performance, albeit marginally.</p><p>The performance of the students and the baseline trained via knowledge distillation is shown in the augmentation plot of Figure <ref type="figure" target="#fig_2">3</ref> as a function of the LLM label ratio in the training data. The top-scoring distillation configuration for each student model is detailed in Table <ref type="table">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>The best performing model configurations are presented, including the distilled versions of the students and baseline. The accuracy for each model's top configuration is displayed, along with the model's parameter count relative to the best performing LLM. For the Time and Domain split, the LLM label ratios correspond to 1.0, and for the Time split, the ratio is 0.75, as these were consistently the best across the models. More detailed results can be found in Table <ref type="table">A9</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Discussion</head><p>A comparison of the models' performance on the two evaluation splits reveals that results on the time split, the traditional validation approach, are overly optimistic. Small models such as BERTiny, trained merely on signature-driven data, exhibit performance comparable to T5 and GPT-3. The disparity in model performance between the domain and time split versus the time split, particularly for small models, underscores that signature-sourced data is repetitive and can be memorized with just a few million parameters. Time split validation measures a model's ability to match the signature distribution while in a production setting the primary concern within the context of the overall pipeline is a model's capacity to generalize to new data from the long tail that falls outside the coverage of signatures. When considering the domain and time split-which aligns more closely with real-world performance on unlabeled data-small models no longer match the performance of LLMs, as seen in Figure <ref type="figure" target="#fig_2">3</ref>. Beyond 10,000 samples, LLMs show minimal to no performance gains when scaling up further. Conversely, the performance of small models and the baseline has not yet converged at 5 million training samples. This demonstrates the sample-efficiency of LLMs in the domain of website content categorization.</p><p>LLMs outperform student models in terms of performance, but they still fall short of perfection when applied to domain and test splits. This discrepancy can be attributed to two main factors. First, due to dataset limitations, web content classification is framed as a single-label classification problem. Table <ref type="table" target="#tab_2">3</ref> displays a set of LLM misclassifications on domain and time split, highlighting that a URL could potentially belong to multiple categories. In the first three samples, the analyst opted for the more generic label, while the model choose the more generic labels in the following three samples. Both predicted and true labels could be considered correct in all six cases, suggesting that the true performance is likely better than the metrics indicate because of the single-label limitation. This trade-off between equally correct specific and general labels becomes evident when examining the confusion matrix, displayed in Figure <ref type="figure">A4</ref> in the Appendix, for a T5 Large model's performance on the domain and time split. As we can see on the confusion matrix, the LLM tends to generate class labels that are more specific than manual labels.</p><p>The second factor occurs when a URL lacks keywords or context related to its category, as demonstrated by the last six entries in Table <ref type="table" target="#tab_2">3</ref>. For the middle two URLs, the model was misled by a prominent keyword in the URL, which was unrelated to its content. The final four URLs contain no apparent signal. Consequently, the large-scale pre-training of LLMs struggles to effectively transfer knowledge to a URL from the long tail. This means that if a URL lacks clear or strong indicators of its category, the LLM may not accurately classify it, leading to misclassifications.</p><p>Our results reveal that mixing in the LLM-generated labels significantly enhances the performance of student models BERTiny and eXpose as we can see on Figure <ref type="figure" target="#fig_2">3</ref>. Through this simple form of augmentation, we nearly matched the 46.3% accuracy of T5 Large, the best-performing LLM, with a transformer model that has a parameter count several orders of magnitude smaller (0.57% of the teacher) and could reasonably be deployed in-line in production.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>In conclusion, our paper contributes to the field of web content classification with the development of lightweight models distilled from fine-tuned LLMs. We have demonstrated that LLMs, when fine-tuned on data labeled with domain propagation signatures, significantly outperform the current state-of-the-art approach on the long tail categorization problem. Our teacherstudent training approach enables the distillation of LLMs into models 175 times smaller without sacrificing accuracy, thus making deployment practical in a wide variety of new contexts. The amount of manual labels required to finetune the teacher LLM is orders of magnitude smaller than what is required for convergence of the current state-of-the-art approach. Furthermore, we have proposed a new validation approach that better measures model performance in more realistic scenarios, which should be adapted by the community to improve generalization capabilities to unseen data.</p><p>Expanding beyond web content classification, the cybersecurity field could greatly benefit from proven methods of distilling large language models (LLMs) into more compact versions. This approach is particularly valuable when dealing with large data volumes and expensive training samples, especially when the model is applied to out-of-distribution cases. For addressing web content classification tasks specifically, we suggest future work should focus on augmenting the training data and feature space with HTML and image data, utilizing GPT-4 as a teacher, allowing URLs to have more than one label, and re-working signatures for the assignment of general categories. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Labeling drop-off. The plot visualizes the proportion of analyst-labeled domains across different popularity levels of a large security vendor. The domain popularity is represented on the 𝑥-axis using a logarithmic scale, where higher values indicate more popular domains. Each bar in the plot corresponds to a specific popularity bin, with the height of the bar illustrating the proportion of labeled domains within that bin.</figDesc><graphic coords="2,183.05,394.21,229.18,143.10" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: KL Divergence over BERT tokens. The 𝑥-axis represents the possible range of KL divergence values over BERT tokens, while the 𝑦-axis represents the estimated probability density of these values. The plot quantifies the difference between validation split URLs as compared to the training set token distribution, with higher KL divergence values indicating greater differences between the BERT token distributions of the base training set and validation split.</figDesc><graphic coords="8,183.05,204.52,229.18,174.10" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: The results for scaling and augmentation are dependent on the domain and time split. Left: Scaling results: illustrates the performance of various models in relation to the logarithm of training sample size. Right: Augmentation results: compares the top-performing LLM and baseline configurations with the performance of different student models as a function of the mixing ratio for LLM-generated labels. The GPT-3 Babbage model was not fine-tuned on 5 million samples due to cost considerations.</figDesc><graphic coords="10,89.29,193.12,204.18,122.51" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="16,172.63,397.84,250.01,212.27" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="17,120.54,115.02,354.19,175.32" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Domains and their frequencies in the train and test sets.</figDesc><table><row><cell cols="2">Training Set</cell><cell>Time Split Test Set</cell><cell cols="2">Domain and Time Split Test Set</cell><cell>Unlabeled Data</cell></row><row><cell>Domain</cell><cell>Frequency % Domain</cell><cell cols="2">Frequency % Domain</cell><cell>Frequency % Domain</cell><cell>Frequency %</cell></row><row><cell>google.com</cell><cell cols="2">20 google.com</cell><cell>33 tomcleaneraddon.com</cell><cell cols="2">&lt;1 wymondhamcollege.org</cell><cell>2</cell></row><row><cell>microsoft.com</cell><cell cols="2">6 microsoft.com</cell><cell>4 ammdx.com</cell><cell>&lt;1 mfa.cloud</cell><cell>1</cell></row><row><cell>googleapis.com</cell><cell cols="2">5 gstatic.com</cell><cell>2 ogp.me</cell><cell cols="2">&lt;1 dimmittisd.net</cell><cell>&lt;1</cell></row><row><cell>cedexis-radar.net</cell><cell cols="2">3 googlesyndication.com</cell><cell>2 officested.com</cell><cell>&lt;1 qq.com.cn</cell><cell>&lt;1</cell></row><row><cell>gvt1.com</cell><cell cols="2">3 doubleclick.net</cell><cell>2 dimensionu.com</cell><cell>&lt;1 gnsmat.co.uk</cell><cell>&lt;1</cell></row><row><cell>facebook.com</cell><cell>2 msn.com</cell><cell></cell><cell>2 shreemaruti.com</cell><cell cols="2">&lt;1 headlandentertainment.com</cell><cell>&lt;1</cell></row><row><cell>zeotap.com</cell><cell cols="2">2 googleusercontent.com</cell><cell>2 wbe-eindhoven.nl</cell><cell>&lt;1 murray.edu</cell><cell>&lt;1</cell></row><row><cell>youtube.com</cell><cell cols="2">1 youtube.com</cell><cell>1 vapornodes.finance</cell><cell>&lt;1 pcbid.top</cell><cell>&lt;1</cell></row><row><cell>amazonaws.com</cell><cell cols="2">1 amazonaws.com</cell><cell>1 trendingtrck.com</cell><cell cols="2">&lt;1 stoughtonwi.com</cell><cell>&lt;1</cell></row><row><cell>sharepoint.com</cell><cell cols="2">1 cloudfront.net</cell><cell>1 bluedrop360.com</cell><cell>&lt;1 aveha.com</cell><cell>&lt;1</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>and TableA10.</figDesc><table><row><cell></cell><cell>Accuracy</cell><cell>Accuracy</cell><cell>Parameter Count</cell><cell>Parameter Count Relative</cell><cell></cell></row><row><cell>Model</cell><cell>Time and Domain Split</cell><cell>Time Split</cell><cell>in millions</cell><cell>to the Teacher (%)</cell><cell>Training Samples Count</cell></row><row><cell>eXpose (Conv)</cell><cell>0.30</cell><cell>0.93</cell><cell>3.3</cell><cell></cell><cell>5 × 10 6</cell></row><row><cell>BERTiny</cell><cell>0.36</cell><cell>0.97</cell><cell>4.4</cell><cell></cell><cell>5 × 10 6</cell></row><row><cell>URLTran (BERT)</cell><cell>0.38</cell><cell>0.97</cell><cell>110</cell><cell></cell><cell>1 × 10 5</cell></row><row><cell>T5 Large</cell><cell>0.46</cell><cell>0.97</cell><cell>770</cell><cell></cell><cell>1 × 10 4</cell></row><row><cell>GPT3 Babbage</cell><cell>0.45</cell><cell>0.98</cell><cell>6700</cell><cell></cell><cell>1 × 10 5</cell></row><row><cell>eXpose + T5 Labels</cell><cell>0.45</cell><cell>0.98</cell><cell>3.3</cell><cell>0.42</cell><cell>1 × 10 7</cell></row><row><cell>BERTiny + T5 Labels</cell><cell>0.46</cell><cell>0.98</cell><cell>4.4</cell><cell>0.57</cell><cell>1 × 10 7</cell></row><row><cell>URLTran + T5 Labels</cell><cell>0.47</cell><cell>0.99</cell><cell>110</cell><cell>14.29</cell><cell>1 × 10 7</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Examples of LLM (T5) misclassifications. Comparison of LLM performance on the domain and time split, highlighting the impact of the single-label strategy and keyword-absent URLs.</figDesc><table><row><cell>Domain</cell><cell>LLM Label</cell><cell>True Label</cell></row><row><cell cols="2">citytocoastneurosurgery.com.au HEALTH AND MEDICINE</cell><cell>BUSINESS</cell></row><row><cell>twittodon.com</cell><cell>SOCIAL NETWORKING</cell><cell>COMPUTER AND INTERNET</cell></row><row><cell>robinsonmalls.com/mall-info</cell><cell>SHOPPING</cell><cell>BUSINESS</cell></row><row><cell>online-weinshop.at</cell><cell>SHOPPING</cell><cell>ALCOHOL</cell></row><row><cell>www.fourbakery.com</cell><cell>BUSINESS</cell><cell>FOOD</cell></row><row><cell>sargenttoolsonline.com</cell><cell>BUSINESS</cell><cell>SHOPPING</cell></row><row><cell>praeyforthegods.com</cell><cell>RELIGION</cell><cell>GAMES</cell></row><row><cell>www.hygiene-3d.com</cell><cell>HEALTH AND MEDICINE</cell><cell>SHOPPING</cell></row><row><cell>beta.x9zb.live</cell><cell cols="2">COMPUTING AND INTERNET GAMBLING</cell></row><row><cell>www.857zb6.com</cell><cell>ENTERTAINMENT</cell><cell>SPORTS</cell></row><row><cell>www.lxf.cz</cell><cell>BUSINESS</cell><cell>SHOPPING</cell></row><row><cell>g11.178tiyu.com</cell><cell>ENTERTAINMENT</cell><cell>SPORTS</cell></row></table></figure>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Supplementary Plots and Tables</head></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">F</forename><surname>García</surname></persName>
		</author>
		<ptr target="https://www.academia.edu/11471179/Web_Content_Filtering" />
		<title level="m">Web content filtering. advances in computers</title>
				<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S K</forename><surname>Baishya</surname></persName>
		</author>
		<ptr target="http://www.ijcstjournal.org/volume-7/issue-3/IJCST-V7I3P5.pdf" />
		<title level="m">A review on web content filtering, its technique and prospects</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">An empirical analysis of phishing blacklists</title>
		<author>
			<persName><forename type="first">S</forename><surname>Sheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Wardman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Warner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Cranor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">roceedings of Sixth Conference on Email and Anti-Spam (CEAS)</title>
				<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">Q</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Snyder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Livshits</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kapravelos</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2005.11910</idno>
		<title level="m">Improving web content blocking with event-loop-turn granularity javascript signatures</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Mitigate web phishing using site signatures</title>
		<author>
			<persName><forename type="first">C.-Y</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-P</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-L</forename><surname>Yeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-T</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">TENCON 2010-2010 IEEE Region 10 Conference</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="803" to="808" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">A novel visual similarity-based phishing detection scheme using hue information with auto updating database</title>
		<author>
			<persName><forename type="first">S</forename><surname>Haruta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Yamazaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Asahina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sasase</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page">25</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m">Asia-Pacific Conference on Communications (APCC)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="280" to="285" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Beyond blacklists: learning to detect malicious web sites from suspicious urls</title>
		<author>
			<persName><forename type="first">J</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">K</forename><surname>Saul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Savage</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">M</forename><surname>Voelker</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining</title>
				<meeting>the 15th ACM SIGKDD international conference on Knowledge discovery and data mining</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="1245" to="1254" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Saxe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Berlin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1702.08568</idno>
		<title level="m">expose: A character-level convolutional neural network with embeddings for detecting malicious urls, file paths and registry keys</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Urltran: Improving phishing url detection using transformers</title>
		<author>
			<persName><forename type="first">P</forename><surname>Maneriker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Stokes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">G</forename><surname>Lazo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Carutasu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Tajaddodianfar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gururajan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">MILCOM 2021-2021 IEEE Military Communications Conference (MILCOM)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="197" to="204" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">real attackers don&apos;t compute gradients</title>
		<author>
			<persName><forename type="first">G</forename><surname>Apruzzese</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">S</forename><surname>Anderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Dambra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Freeman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Pierazzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">A</forename><surname>Roundy</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2212.14315</idno>
		<ptr target="https://doi.org/10.48550/arXiv.2212.14315" />
	</analytic>
	<monogr>
		<title level="m">Bridging the gap between adversarial ml research and practice</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dhurandhar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tajer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Yan</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2212.00850</idno>
		<ptr target="https://doi.org/10.48550/arXiv.2212.00850" />
		<title level="m">When neural networks fail to generalize? a model sensitivity perspective</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Generalization in nli: Ways (not) to go beyond simple heuristics</title>
		<author>
			<persName><forename type="first">P</forename><surname>Bhargava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Drozd</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rogers</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2110.01518</idno>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1810.04805</idno>
		<title level="m">Bert: Pre-training of deep bidirectional transformers for language understanding</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Attention is all you need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ł</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Language models are few-shot learners</title>
		<author>
			<persName><forename type="first">T</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ryder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Subbiah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neelakantan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shyam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="1877" to="1901" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<ptr target="https://siteefy.com/how-many-websites-are-there/" />
		<title level="m">How many websites are there</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Hinton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1503.02531</idno>
		<title level="m">Distilling the knowledge in a neural network</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">I</forename><surname>Turc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno>CoRR abs/1908.08962</idno>
		<ptr target="http://arxiv.org/abs/1908.08962.arXiv:1908.08962" />
		<title level="m">Well-read students learn better: The impact of student initialization on knowledge distillation</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">A framework for detection and measurement of phishing attacks</title>
		<author>
			<persName><forename type="first">S</forename><surname>Garera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Provos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chew</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">D</forename><surname>Rubin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2007 ACM workshop on Recurring malcode</title>
				<meeting>the 2007 ACM workshop on Recurring malcode</meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="1" to="8" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Oshingbesan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ekoh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Okobi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Munezero</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Richard</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2209.09630</idno>
		<title level="m">Detection of malicious websites using machine learning techniques</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Joshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Lloyd</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Westin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Seethapathy</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1910.06277</idno>
		<title level="m">Using lexical features for malicious url detection-a machine learning approach</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Less is more: Robust and novel features for malicious domain detection</title>
		<author>
			<persName><forename type="first">C</forename><surname>Hajaj</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Hason</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dvir</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Electronics</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page">969</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Abuadbba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Almashor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Ahmed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Gaire</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Camtepe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename></persName>
		</author>
		<idno type="arXiv">arXiv:2204.00985</idno>
		<title level="m">Towards web phishing detection limitations and mitigation</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Optimized url feature selection based on genetic-algorithm-embedded deep learning for phishing website detection</title>
		<author>
			<persName><forename type="first">S.-J</forename><surname>Bu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H.-J</forename><surname>Kim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Electronics</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<date type="published" when="1090">2022. 1090</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Evolutionary optimization of neuro-symbolic integration for phishing url detection</title>
		<author>
			<persName><forename type="first">K.-W</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-J</forename><surname>Bu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-B</forename><surname>Cho</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Hybrid Artificial Intelligence Systems</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="88" to="100" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Pham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Sahoo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">C</forename><surname>Hoi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1802.03162</idno>
		<title level="m">Urlnet: Learning a url representation with deep learning for malicious url detection</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Texception: a character/word-level deep learning model for phishing url detection</title>
		<author>
			<persName><forename type="first">F</forename><surname>Tajaddodianfar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Stokes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gururajan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="2857" to="2861" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Joulin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Grave</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bojanowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Douze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Jégou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><surname>Fasttext</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1612.03651</idno>
		<title level="m">zip: Compressing text classification models</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Learning disentangled representation of web address via convolutionalrecurrent triplet network for classifying phishing urls</title>
		<author>
			<persName><forename type="first">S.-J</forename><surname>Bu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H.-J</forename><surname>Kim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2021 International Conference on Electronics, Information, and Communication (ICEIC), IEEE</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="1" to="4" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">M</forename><surname>Rudd</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Abdallah</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2011.03040</idno>
		<title level="m">Training transformers for information security tasks: A case study on malicious url prediction</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Transformers for end-to-end infosec tasks: A feasibility study</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">M</forename><surname>Rudd</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">S</forename><surname>Rahman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Tully</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 1st Workshop on Robust Malware Analysis</title>
				<meeting>the 1st Workshop on Robust Malware Analysis</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="21" to="31" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Research on malicious url detection technology based on bert model</title>
		<author>
			<persName><forename type="first">W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2021 IEEE 9th International Conference on Information, Communication and Networks (ICICN), IEEE</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="340" to="345" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Towards performance of nlp transformers on url-based phishing detection for mobile devices</title>
		<author>
			<persName><forename type="first">H</forename><surname>Shirazia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Haynesb</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Raya</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Ubiquitous Systems and Pervasive Networks</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="35" to="42" />
			<date type="published" when="2022">2022. 2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Phishing website detection based on multi-feature stacking</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2021 2nd International Conference on Artificial Intelligence and Computer Engineering (ICAICE), IEEE</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="716" to="720" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">Tcurl: Exploring hybrid transformer and convolutional neural network on phishing url detection</title>
		<author>
			<persName><forename type="first">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Knowledge-Based Systems</title>
		<imprint>
			<biblScope unit="page">109955</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">Detection of malicious urls through an ensemble of machine learning techniques</title>
		<author>
			<persName><forename type="first">S</forename><surname>Venugopal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">Y</forename><surname>Panale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kashyap</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Ananthanagu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE), IEEE</title>
				<imprint>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="page" from="1" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<analytic>
		<title level="a" type="main">Combining long-term recurrent convolutional and graph convolutional networks to detect phishing sites using url and html</title>
		<author>
			<persName><forename type="first">S</forename><surname>Ariyadasa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Fernando</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Fernando</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="82355" to="82375" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">Phishgnn: A phishing website detection framework using graph neural networks</title>
		<author>
			<persName><forename type="first">T</forename><surname>Bilot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Geis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Hammi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference: SECRYPT 2022At</title>
				<meeting><address><addrLine>Lisbon</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Kamran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sengupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tavakkoli</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2108.01852</idno>
		<title level="m">Semi-supervised conditional gan for simultaneous generation and detection of phishing urls: A game theoretic perspective</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b40">
	<analytic>
		<title level="a" type="main">Effective malicious url detection by using generative adversarial networks</title>
		<author>
			<persName><forename type="first">J</forename><surname>Geng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Web Engineering</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="341" to="356" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b41">
	<analytic>
		<title level="a" type="main">Deep character-level anomaly detection based on a convolutional autoencoder for zero-day phishing url detection</title>
		<author>
			<persName><forename type="first">S.-J</forename><surname>Bu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-B</forename><surname>Cho</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Electronics</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page">1492</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b42">
	<analytic>
		<title level="a" type="main">Malicious url detection based on a parallel neural joint model</title>
		<author>
			<persName><forename type="first">J</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Pei</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="9464" to="9472" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b43">
	<analytic>
		<title level="a" type="main">Inferring phishing intention via webpage appearance and dynamics: A deep vision based approach</title>
		<author>
			<persName><forename type="first">R</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">H</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Divakaran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Dong</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">30th {USENIX} Security Symposium ({USENIX} Security 21)</title>
				<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page">1</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b44">
	<monogr>
		<author>
			<persName><forename type="first">O</forename><surname>Lavie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Shabtai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Katz</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2209.09033</idno>
		<title level="m">A transferable and automatic tuning of deep reinforcement learning for cost effective phishing detection</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b45">
	<analytic>
		<title level="a" type="main">Crafting text adversarial examples to attack the deep-learning-based malicious url detection</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Niu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Deng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICC 2022-IEEE International Conference on Communications</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="3118" to="3123" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b46">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-W</forename><surname>Kim</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2209.01454</idno>
		<title level="m">Phishing url detection: A network-based approach robust to evasion</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b47">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">K</forename><surname>Pal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kashihara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Anantheswaran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">C</forename><surname>Kuznia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jagtap</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Baral</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2302.10346</idno>
		<title level="m">Exploring the limits of transfer learning with unified model in the cybersecurity domain</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b48">
	<monogr>
		<title level="m" type="main">Generalization in nli: Ways (not) to go beyond simple heuristics</title>
		<author>
			<persName><forename type="first">P</forename><surname>Bhargava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Drozd</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rogers</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2110.01518</idno>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b49">
	<analytic>
		<title level="a" type="main">Exploring the limits of transfer learning with a unified text-to-text transformer</title>
		<author>
			<persName><forename type="first">C</forename><surname>Raffel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Narang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Matena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="page" from="5485" to="5551" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b50">
	<analytic>
		<title level="a" type="main">Transformers: State-of-the-art natural language processing</title>
		<author>
			<persName><forename type="first">T</forename><surname>Wolf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Debut</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sanh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chaumond</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Delangue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Moi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Cistac</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Rault</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Louf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Funtowicz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations</title>
				<meeting>the 2020 conference on empirical methods in natural language processing: system demonstrations</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="38" to="45" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b51">
	<analytic>
		<title level="a" type="main">Adafactor: Adaptive learning rates with sublinear memory cost</title>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Stern</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="4596" to="4604" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
