<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Models for Phishing Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Milita Songailaitė</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eglė Kankevičiūtė</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bohdan Zhyhun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Justina Mandravickaitė</string-name>
          <email>justina.mandravickaite@vdu.lt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Applied Research and Development (CARD)</institution>
          ,
          <addr-line>Kaunas</addr-line>
          ,
          <country country="LT">Lithuania</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vytautas Magnus University</institution>
          ,
          <addr-line>Kaunas</addr-line>
          ,
          <country country="LT">Lithuania</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>phishing detection</institution>
          ,
          <addr-line>transformers, BERT, DistilBERT, TinyBERT, RoBERTa, cybersecurity</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we report the application of BERT-based models for phishing detection in emails. We fine-tuned 3 BERT-based models (DistilBERT, TinyBERT and RoBERTa) for the task. All the fine-tuned models attained scores above 0.985 for each metric (accuracy, precision, recall and RoBERTa classification scores across all metrics, indicating that it can classify the selected phishing data with the utmost accuracy. The models from each BERT architecture have then been assessed more deeply via using them in pseudo-real-life situation. For this purpose, we created an entirely new dataset from the actual phishing emails and used text augmentation techniques to increase their quantity. DistilBERT and RoBERTa models produced very similar outcomes, i.e., most of the emails were classified correctly. However, as DistilBERT uses fewer resources and performs better than the RoBERTa model, it has been regarded as the best model for detecting phishing emails in our case. The TinyBERT variant had the worst results as its size was insufficient for learning to categorize emails and detect phishing. transfer learning Proceedings</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>People are becoming more and more involved in the digital world, which contributes to the pervasive
issue of phishing, a sort of cyber-attack [20]. User data is frequently stolen using this method as the
attackers' primary strategy is to pose as reliable entities to collect sensitive or private information from
their victims [17]. Such an attack might take the form of emails, messages, phony website visits, etc. as
the victim is persuaded to open a malicious link, which may install malware, damage the system, or
reveal private data. A phishing attack can have severe consequences, such as identity theft, money loss,
or other negative outcomes [24].</p>
      <p>
        Phishing attacks are often initiated through emails that appear to be from appropriate sources, such
as banks, government authorities, or company management [17]. As these emails contain links that
take recipients to fraudulent webpages that imitate legitimate ones, the attacker acquires access to the
victim's accounts after the (s)he submits their login credentials or other personal information, which
may lead to financial loss or identity theft. Phishing attempts can also lead to the theft of private
company information, damage a company's brand, and cause stakeholders and customers to lose faith
in it [19]. Moreover, phishing is frequently used to attack governmental systems as a part of significant
attacks, such as advanced persistent threat (APT) events [23]. Therefore, the accounts of government
employees can be hacked and allow the attackers to get over security barriers, spread malware, or have
access to secured data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>There are several methods and tools that people have commonly used for phishing detection.
Software programs called email filters examine incoming emails and eliminates the ones that may be</p>
      <p>2023 Copyright for this paper by its authors.
CEUR</p>
      <p>
        ceur-ws.org
phishing emails [18]. These filters use such techniques as phishing email address blacklist or
examination of the email's content. Anti-phishing toolbars is another technique for phishing detection
as they provide alerts or prevent access to phishing websites [19]. Another technique for phishing
detection is URL analysis to find potentially harmful or questionable information [20]. Such tools may
examine the domain name or URL path to match it to a known phishing website. However, user
awareness and education are the most effective strategies for phishing detection as well as prevention.
Users who are aware of the indications and dangers of phishing attempts, including suspicious email
sender addresses and requests for personal data [21], can take the necessary precautions. Another
strategy (MFA) is the use of two-factor authentication (2FA) or multifactor authentication [22]. This
strengthens the security of the authentication process by a second level of security, such as a code sent
to the user's mobile phone in addition to a password [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Focusing on phishing email detection, a variety of methods have been used for development of
solutions for this task. In recent years deep learning approaches have become popular for phishing
detection. Deep learning has such benefits as automated feature extraction, reduced reliance on data
pre-processing, extraction of high-dimensional features, and increased accuracy, therefore its
application is increasing in various areas, including phishing detection [17]. Such architectures as
Convolutional Neural Network (CNN) [25-26], Recurrent Neural Network (RNN) [27], [33], Long
Short-Term Memory (LSTM) [28-29], Gated Recurrent Unit (GRU) [30], Multi-Layer Perceptron
(MLP) [31], etc. have been used for phishing detection. LSTM and BiLSTM are considered the most
widely applied deep learning approaches in phishing detection [17]. Also, transformers architecture for
phishing detection was utilised as well, e.g., for developing CatBERT [34], which is a modified BERT
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] model, capable of identifying social engineering emails.
      </p>
      <p>
        Phishing is a major threat that can seriously hurt both people and businesses. Detecting and
preventing phishing attacks is critical to protect sensitive information and prevent a variety of losses.
Email filters, anti-phishing toolbars, machine learning, URL analysis tools, and user education are a
few techniques and tools that have been utilized for phishing detection. But despite this, all tools and
methods need to be improved and supplemented, since an increasing number of new means of
influencing systems are being invented [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In this paper we report the application of BERT-based
models for phishing detection in emails. The rest of paper is structured as follows: Data briefly
introduces data we used for our experiments; Methods describes methods and base models we used in
our experimentation; Experimental Setup presents the set of parameters we used for fine-tuning the
selected BERT-based models; Results reports results of our experiments and assessment of the
finetuned models; the final section ends the paper with Conclusions.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Data</title>
      <p>The starting dataset used in the experiments consisted of 1086 phishing email messages. All
messages were anonymized using pseudonymization tools and methods. Each email message was
assigned a unique ID during the data preparation stage. Information about links and attachments in the
dataset is presented separately. In total, there are 1510 links and 190 attachments in the dataset.</p>
      <p>The email messages were divided into the following elements during the data preparation stage:
 Sent at: date and time of sending of the email
 Subject: email subject
 From email: sender's email address
 From name: sender's name
 Reply to: name for reply to the email
 Return path: real email address for reply to the email
 Category: thematic category of the email message
 Risk: degree of risk (evaluated by an expert)
 Risk source: source of risk level evaluation (evaluated by an expert)
 Link: number of links in the email message
 Attachments count: number of attachments in the email message
 Plaintext: email message text.</p>
      <p>
        To understand phishing emails better, we explored their distribution. In the pie charts below, all
emails are grouped and analysed according to different classification schemes: a general classification,
a technical classification based on the data theft techniques used in the emails, and the target of the
attack. We based our general classification of phishing emails on [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and technical and the target of
the attack classifications – on [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Therefore, Figure 1 presents constitution of starting dataset by the
general classification. Two-thirds (66.7%) of all the data were attributed to the category of domain or
brand impersonation (originally distinguished as separate categories, i.e., domain impersonation and
brand impersonation, but merged for simplicity under the label of impersonation). The undefined
category (7.73% of emails) consists of emails which could not be classified as a specific type of data
theft. A small proportion of the entire dataset was classified as belonging to the categories of extortion,
whaling (targeted phishing attack, aimed at senior executives [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]), business email compromise, and
ransomware. Since the content of emails belonging to the spear phishing (personalized form of email
phishing [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]), lateral phishing (a hijacked corporate account is used to send phishing emails to other
users [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]), and account takeover categories is particularly sensitive, and these categories generally
encompass a data theft process rather than individual emails, these data theft types were not included
in the final experimental dataset.
      </p>
      <p>According to the techniques used in the emails for data theft, all messages were classified into the
following categories:
1. Ransomware - malicious software that demands payment in exchange for returning control of
a victim's data.
2. Trojan horse and content injection - the use of malware that appears to be legitimate software
but is designed to disrupt, damage, or gain unauthorized access to a computer system.
3. Keylogger and screen logger - software designed to track and record the keys struck on a
keyboard or the images displayed on a screen.
4. Man-in-the-middle attacks - instances where cybercriminals intercept target email accounts,
gain access to them, and monitor or manipulate information exchanged for malicious purposes.
5. Social engineering - a form of manipulation that involves deceiving users into divulging
confidential information or downloading malicious software.
6. Scams - fraudulent schemes aimed at tricking individuals into providing personal information
or making financial transactions.
7. Undefined category - emails that could not be attributed to a specific data theft technique.</p>
      <p>The structure of the dataset according to the data theft techniques used in the emails is presented in
Figure 2. As can be seen from the results presented in the diagram, most emails (66.15%) were
identified as using the social engineering technique. Emails that would have used keylogger and screen
logger could not be obtained. Therefore, emails of these categories were not included in the dataset we
used for the experiments.</p>
      <p>According to the attack target, all emails were divided into the categories presented in Figure 3.
As the data presented in the diagram shows, more than half (55.57%) of the analyzed emails were
identified as payment requests. Shared documents were identified as a target in 13.98% of the emails,
while links were identified as a target in 8.37% of the emails. Personal data was identified as a target
in 6.35% of the emails, delivery alerts were identified as a target in 6.07% of the emails, and
password expiration was identified as a target in 4.14% of the emails. The targets for the scam,
undefined and code categories were only identified in a small fraction of the analyzed emails.</p>
      <p>Finally, after analysis and filtering, the starting dataset for fine-tuning pretrained models was
complemented with 5323 phishing email messages and 6403 neutral or “ham” email messages from
publicly available sources2.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>
        The phishing detection task was performed using the transfer learning methodology [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. It involves
using a pre-trained language model as a starting point for training a new model on a specific task. This
approach is particularly effective when working with small datasets or when training a model for a very
specific task [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. In our case, we chose three popular pretrained deep learning models for the English
2 We used public datasets available at https://github.com/TanusreeSharma/phishingdata-Analysis
https://github.com/KostasKoutrou/Text_Phishing_Email_ML_Classification
and
language: DistilBERT, TinyBERT and RoBERTa transformer models to fine-tune them for the task of
phishing detection in email messages.
      </p>
      <sec id="sec-3-1">
        <title>DistilBERT model</title>
        <p>
          Transformer BERT (Bidirectional Encoder Representations from Transformers) is a deep learning
model based on the attention mechanism [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], which is usually applied to solve various language
technology problems [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. This model works on the principles of transfer learning [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. A neural network
is trained to generate word embeddings, which are then used as input functions for models that solve
mainstream language technology tasks. One of the most significant advantages of the BERT
architecture models over other neural network models is understanding the context between words in
the text. The model learns the context using the attention mechanism characteristic of transformer
models, which consists of encoding and decoding mechanisms [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          DistilBERT is a variant of the BERT model that has been optimized for smaller size and faster
performance. It achieves this by employing a process called knowledge distillation, where a smaller
model learns from the predictions and representations of a larger pre-trained model [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. This is a
common method for developing low resource Large Language Models. The process involves
pretraining a large BERT model, fine-tuning it on a task, selecting a sub-network, training a small model,
and applying knowledge distillation to allow the small model to learn from the large model's predictions
and representations. This produces a smaller model that performs similarly to the larger one, making it
suitable for resource-constrained applications or those requiring faster inference times.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>TinyBERT model</title>
        <p>
          Similarly to DistilBERT, TinyBERT also uses knowledge compression methodology to achieve
faster model performance [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. However, there are several key differences between two models:
1. DistilBERT is already a smaller version of BERT, but TinyBERT is even smaller, with a size of
only a few hundred megabytes, making it ideal for low resource development.
2. DistilBERT uses a technique called knowledge distillation to transfer the knowledge learned
from a larger pre-trained model like BERT to a smaller model. TinyBERT, on the other hand,
uses a similar approach called "teacher-student" learning, where the smaller model is trained to
mimic the behavior of a larger model by matching the outputs of the two models on the same
inputs.
3. DistilBERT is trained on a combination of unlabeled and labeled data, while TinyBERT is
trained only on labeled data, making it more efficient for specific tasks.
        </p>
        <p>Overall, both models require less resources and produce comparable model predictions compared to
the BERT model. As a result, it was decided to test both for the phishing detection task.</p>
      </sec>
      <sec id="sec-3-3">
        <title>RoBERTa model</title>
        <p>
          RoBERTa (Robustly Optimized BERT approach) extends the BERT language masking approach,
in which the system learns to predict the masked text portions within unlabeled language samples.
RoBERTa modifies critical hyperparameters in the BERT, such as removing BERT's next-sentence
prediction objective, and it was trained with much bigger mini-batches and learning rates [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. This
enables RoBERTa to outperform BERT on the masked learning goal, resulting in superior downstream
task performance. Furthermore, RoBERTa was trained on a larger and more diverse corpus of data,
enabling the model to comprehend complex information that may span a longer time period. This is
particularly significant in the context of phishing detection, where the content of messages may change
over time. Finally, unlike BERT, which always masks out the same tokens during pre-training,
RoBERTa uses dynamic masking. This means that the model is trained to predict masked tokens based
on the surrounding context, making it better at handling out-of-vocabulary words.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Fine-tuning the models</title>
        <p>The second step of a transfer learning methodology is fine-tuning the pretrained BERT language
models. At this stage, the already pre-trained model is learning how to classify the given data based on
the training data [16]. The process begins by initializing the BERT model with pre-trained weights on
a large corpus of text data. Then, a new classification layer is added on top of the pre-trained model,
which is trained on the specific task using labeled data. During training, the weights of the pre-trained
model are updated along with the weights of the classification layer. Once training is complete, the
finetuned model can be used to predict the classification of new text inputs. Fine-tuning with BERT models
has demonstrated to be extremely successful in achieving exceptional results on a wide range of natural
language processing tasks.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>The phishing detection model was built with transfer learning approach outlined in the Methods
section and trained on data explained in the Data section. The three base models (DistilBERT3,
TinyBERT4 and RoBERTa5) were already pre-trained on a large English language corpus by the
creators of these models. Then we fine-tuned these models to classify phishing email data.</p>
      <p>The fine-tuning was done three times for each of the base large language models using different
random weights initiation seed. Overall, nine models were fine-tuned to classify phishing emails into
two classes – phishing or not phishing. The models were fine-tuned for 30 epochs using a variable
learning rate that began at 0.001. Each architecture had a distinct training batch size: TinyBERT had
64, DistilBERT had 36, and RoBERTa had 24. That is, the smaller the model, the bigger the batch size
we could choose. While training, each of the models were evaluated by looking at the loss function
scores. In addition, after each epoch the evaluation step was done, where the model’s ability to classify
phishing emails was evaluated by four selected classification metrics.</p>
      <p>The comparison of the fine-tuned models’ training parameters is given in Figure 4. RoBERTa type
models were the longest to train since this model had the most complex architecture and was trained
on more data than the two distilled learning type models (DistilBERT and TinyBERT). However,
most of the RoBERTa models also had the lowest training loss, which later resulted in higher
classification performance. The two distilled learning models both had a lower train runtime. Still, the
fastest to fine-tune was the TinyBERT model. However, both models had higher training loss than the
larger RoBERTa model.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>The experiments on the classification of phishing emails were performed using three different BERT
model architectures: DistilBERT, TinyBERT and RoBERTa. The models were trained with parameters
described in the section Experimental Setup. The initial model results are shown in Figure 5. The
models’ abilities to classify the emails into two classes (phishing or not phishing) were evaluated with
four classification metrics: accuracy, precision, recall and F1-score.</p>
      <p>Overall, it can be observed that all of the models exhibit strong performance in classifying phishing
emails. All of the fine-tuned models attained scores above 0.985 for each metric. Nevertheless, the
RoBERTa model demonstrated the highest classification scores across all metrics, indicating that it can
classify the selected phishing data with the utmost accuracy. While DistilBERT and TinyBERT models
may not have performed as well as RoBERTa, they do offer the advantage of requiring significantly
less computing resources and time to train. This makes them ideal for low resource applications.</p>
      <p>The next stage in model assessment was to see how well the models classified real-world phishing
email data. We created an entirely new dataset from the actual phishing emails we gathered for this
purpose. Several text augmentation techniques were used to increase the quantity of gathered emails:
1. The introductions and endings of the letters were rewritten in several ways so that the idea would
stay the same. These parts of emails were exchanged thus generating more variations of the same
email.
2. A database of fictitious personal information (email addresses, phone numbers, personal identity
numbers, and so on) was developed. The same variables were detected in each of the real emails.
The emails were augmented with variables from the personal information database, resulting in
more phishing emails of the same type.</p>
      <p>After the augmentation, there were 8994 phishing emails in the augmented testing database. These
emails were then used to test the best models from each BERT architecture. The results are presented
in Table 1.</p>
      <p>DistilBERT and RoBERTa models produced very similar outcomes. Almost all the emails were
accurately classified by these models. However, because DistilBERT uses fewer resources and
performs better than the RoBERTa model, it is regarded as the best model for detecting phishing emails
in our case. The TinyBERT variant had the worst results. Although this BERT design is an improvement
over the DistilBERT, the model is also significantly smaller. As a result, the TinyBERT size was
insufficient to learn how to categorize different emails and detect phishing.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>In this paper we reported the application of BERT-based models for phishing detection in emails.
We fine-tuned 3 BERT-based models (DistilBERT, TinyBERT and RoBERTa) for the task. The
finetuning was done three times for each of the base large language models using different random weights
initiation seed. Overall, nine models were fine-tuned to classify phishing emails into two classes –
phishing or not phishing. The models were fine-tuned for 30 epochs using a variable learning rate that
began at 0.001. Each architecture had a distinct training batch size: TinyBERT had 64, DistilBERT had
36, and RoBERTa had 24. All the fine-tuned models attained scores above 0.985 for each metric
(accuracy, precision, recall and F1-score). Nevertheless, the RoBERTa model demonstrated the highest
classification scores across all metrics, indicating that it can classify the selected phishing data with the
utmost accuracy. The models from each BERT architecture have then been assessed more deeply via
using them in pseudo-real-life situation. For this purpose, we created an entirely new dataset from the
actual phishing emails and used text augmentation techniques (the introductions and endings of the
letters were rewritten in several ways; a database of fictitious personal information (email addresses,
phone numbers, personal identity numbers, and so on) was developed) to increase their quantity. After
the augmentation, there were 8994 phishing emails in the augmented testing database. DistilBERT and
RoBERTa models produced very similar outcomes, i.e., most of the emails were classified correctly
(8590/8994 by DistilBERT and 8552/8994 by RoBERTa). However, as DistilBERT uses fewer
resources and performs better than the RoBERTa model, it has been regarded as the best model for
detecting phishing emails in our case. The TinyBERT variant had the worst results as its size was
insufficient for learning to categorize emails and detect phishing.</p>
      <p>Our future plans include experimentation with a more diverse variety of models and datasets. We
also plan to explore the application of BERT-based models for the detection of phishing emails, written
in non-English languages.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pagán</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Elleithy</surname>
          </string-name>
          ,
          <article-title>A Multi-Layered Defense Approach to Safeguard Against Ransomware</article-title>
          ,
          <source>In 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC)</source>
          , pp.
          <fpage>0942</fpage>
          -
          <lpage>0947</lpage>
          . IEEE,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z. A.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Andersen</surname>
          </string-name>
          , What. hack:
          <article-title>engaging anti-phishing training through a role-playing phishing simulation game</article-title>
          ,
          <source>In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Binks</surname>
          </string-name>
          ,
          <article-title>The art of phishing: past, present and future</article-title>
          ,
          <source>Computer Fraud &amp; Security</source>
          <year>2019</year>
          , no.
          <issue>4</issue>
          (
          <issue>2019</issue>
          )
          <fpage>9</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ying</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Ruibin</surname>
          </string-name>
          ,
          <article-title>Review of attention mechanism in natural language processing</article-title>
          ,
          <source>Data Analysis and Knowledge Discovery</source>
          <volume>4</volume>
          , no.
          <issue>5</issue>
          (
          <issue>2020</issue>
          )
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , (
          <year>2018</year>
          ). URL: https://arxiv.org/abs/
          <year>1810</year>
          .04805
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Swayamdipta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <article-title>Transfer learning in natural language processing” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics</article-title>
          : Tutorials,
          <year>2019</year>
          , pp.
          <fpage>15</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , and T. Wolf, DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter,
          <year>2020</year>
          . https://doi.org/10.48550/arXiv.
          <year>1910</year>
          .
          <volume>01108</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>TinyBERT: Distilling BERT for Natural Language Understanding</article-title>
          , arXiv,
          <source>October</source>
          <volume>15</volume>
          ,
          <year>2020</year>
          . URL: http://arxiv.org/abs/
          <year>1909</year>
          .10351.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>RoBERTa: A robustly optimized BERT pretraining approach</article-title>
          ,
          <year>2019</year>
          . URL: http://arxiv.org/abs/
          <year>1907</year>
          .11692
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Barracuda</surname>
          </string-name>
          ,
          <article-title>13 email threat types to know about right now</article-title>
          ,
          <year>2020</year>
          . URL: https://assets.barracuda.com/assets/docs/dms/Barracuda-eBook_
          <fpage>13</fpage>
          -
          <lpage>emailthreats</lpage>
          _may2020.pdf
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Aleroud</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Phishing environments, techniques, and countermeasures: A survey</article-title>
          ,
          <source>Computers &amp; Security</source>
          <volume>68</volume>
          (
          <year>2017</year>
          ):
          <fpage>160</fpage>
          -
          <lpage>196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shetty</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Nath</surname>
          </string-name>
          ,
          <article-title>A review on phishing attacks</article-title>
          ,
          <source>International Journal of Applied Engineering Research</source>
          <volume>14</volume>
          , no.
          <issue>9</issue>
          (
          <year>2019</year>
          ):
          <fpage>2171</fpage>
          -
          <lpage>2175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cidon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gavish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schweighauser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Paxson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Savage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Voelker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <article-title>Detecting and characterizing lateral phishing at scale</article-title>
          ,
          <source>In 28th USENIX Security Symposium (USENIX Security 19)</source>
          (
          <year>2019</year>
          ):
          <fpage>1273</fpage>
          -
          <lpage>1290</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xiong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey on transfer learning</article-title>
          ,
          <source>Proceedings of the IEEE 109, no. 1</source>
          (
          <year>2020</year>
          ):
          <fpage>43</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Bashar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Nayak</surname>
          </string-name>
          ,
          <article-title>Active learning for effectively fine-tuning transfer learning to downstream task</article-title>
          ,
          <source>ACM Transactions on Intelligent Systems and Technology (TIST) 12</source>
          , no.
          <issue>2</issue>
          (
          <year>2021</year>
          ):
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>