<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>" Journal of Information &amp; Knowledge Management (JIKM)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/CDS49703.2020.00009</article-id>
      <title-group>
        <article-title>Based on Artificial Neural Networks Technology</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vitaliy Tsyganok</string-name>
          <email>vitaliy.tsyganok@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yaroslav Khrolenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iryna Domanetska</string-name>
          <email>domanetska@knu.ua</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olena Fedusenko</string-name>
          <email>fedusenko@knu.ua</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Information Recording of NAS of Ukraine</institution>
          ,
          <addr-line>2 Shpaka Street, Kyiv, 03113</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Taras Shevchenko National University of Kyiv</institution>
          ,
          <addr-line>24 Bogdana Gavrylishina Str,04116, Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>769</volume>
      <issue>2</issue>
      <fpage>1</fpage>
      <lpage>2</lpage>
      <abstract>
        <p>The widespread use of Internet technologies, in addition to the general positive in the context of the development of society, has led to the emergence and rapid growth of criminal activities carried out with the use of high technologies. Currently, phishing is one of the most common types of Internet crime. The task of detecting phishing is urgent, because phishing attacks lead to large losses due to the malicious use of personal data, confidential information, commercial or state secrets. The paper examines modern threats and methods of countering phishing attacks, analyzes available methods and means of protection, and proposes a method of protecting against phishing attacks using neural networks. The uniqueness of the approach proposed in the article to solving the problem of detecting phishing links lies in the use of hybrid architectures of neural networks, namely a combination of convolutional and recurrent neural networks. The resulting architecture demonstrates a convolutional and recurrent neural networks. phishing, artificial neural network, convolutional neural network, recurrent neural network, conditional random fields, long-short-term memory network, hybrid neural network.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The accelerated growth of information technologies all around the world and in Ukraine,
especially observed in the last decade, is inevitably accompanied by the dynamic development of
crimes in this sphere. Along with global computerization and the development of digital technologies,
which greatly simplified human life, the concept of cybercrime has entered our lives. Cybercrimes are
the most dynamic group of socially dangerous acts because every year cybercrimes become more and
more widespread and dangerous. Today, almost all experts in the field of information technology
acknowledge that the situation with cybercrime in the world is getting worse.</p>
      <p>Phishing remains the most massive threat to Ukrainian Internet users, and its scale is growing.
Notably, phishing sites account for 88% of blocked resources, while the remaining 12% are fraudulent
online stores, fraudulent money-making schemes, "investment" and service fraud that extort money
from citizens, and sites with malicious software. Nowadays, this problem is becoming even more
urgent in Ukraine. During 2022, the number of Russian phishing attempts against Ukraine increased
by 250%. The main target of Russian hackers was more than 150 government institutions, with the
Ministry of Defense of Ukraine being the primary target. Therefore, the creation of effective software
tools for detecting phishing links is an urgent problem.
EMAIL:
(A.1),
(A.3);</p>
      <p>2023 Copyright for this paper by its authors.</p>
    </sec>
    <sec id="sec-2">
      <title>2. The aim of the study</title>
      <p>Analysis of the distinctive features of the process of protection against phishing attacks, along with
a comprehensive review of existing approaches to solving the problem of identification of phishing
web links using computational intelligence tools, development of a software application based on a
classifier of Internet links based on a hybrid artificial neural network.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and research methods</title>
      <p>The research is focused on seeking an effective combination of two different neural network
architectures to solve the task of classifying internet links for phishing link identification. The study
involved an analysis of mono-architectures of convolutional networks (CNN) and recurrent networks
(SimpleRNN), hybrid models, namely combinations of convolutional and recurrent networks, deep
forward propagation networks (DNN), and recurrent neural networks based on LSTM elements. The
dataset used to evaluate the performance of these hybrid architectures for phishing link identification
was obtained from the UCI Machine Learning Repository and comprised approximately 2500 labeled
examples. This dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY
4.0) license.</p>
      <p>The project was implemented using Python programming language and the Keras framework.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Literature review</title>
      <p>According to the Law of Ukraine "On the Fundamental Principles of Cybersecurity of Ukraine,"
cybersecurity is defined as the protection of the vital interests of individuals, citizens, society, and the
state during the use of cyberspace. It ensures the sustainable development of the information society
and digital communication environment, and the timely detection, prevention, and neutralization of
real and potential threats to Ukraine's national security in cyberspace.</p>
      <p>A cyber threat encompasses a combination of factors and conditions that pose risks to information
security. Malicious actors may target IT infrastructure, workstations, mobile devices, other technical
tools, and ultimately, individuals as elements in cyberspace. Phishing is an automated form of social
engineering used by malefactors to exploit the Internet to deceptively acquire confidential information
from companies and individuals, often impersonating legitimate websites [1].</p>
      <p>Malicious actors send harmful email attachments or URLs to users, seeking access to their
accounts or computers [2]. Cybercriminals have become very sophisticated, with many emails
escaping spam detection. Users receive emails, allegedly requiring them to change passwords or
update payment information, unintentionally granting criminals access to confidential data.</p>
      <p>The high potential for gains, such as accessing bank accounts and credit card numbers, the ease of
disseminating forged emails posing as legitimate authorities, and the challenges faced by law
enforcement in apprehending such criminals, have led to a surge in phishing attacks in recent years.</p>
      <p>The "State of the Phish" report [3] for the year 2019 revealed that nearly 90% of organizations
experienced targeted phishing attacks during that year. 84% reported phishing through SMS/text, 83%
encountered voice phishing, and the email phishing volume grew by 67% in a year. These data
indicate a rising trend of people avoiding internet commerce due to identity fraud concerns, despite
companies taking on the risk of fraud. According to Microsoft's annual report, the number of
cyberattacks increased by 3.5 times in 2022 compared to 2021. Financial institutions, social media
platforms, payment systems, and e-commerce are the most attractive targets for phishing (Figure 1).</p>
      <p>Researchers discuss and propose a variety of solutions to overcome phishing challenges, yet there
is no solution that can be trusted or used for fully mitigating these attacks. The anti-phishing
measures, proposed in the literature can be categorized into three main defense strategies.</p>
      <p>The first line of defense assumes human-factor solutions that educate end-users to recognize
phishing attempts and avoid falling victim to them.</p>
      <p>The second line of defense comprises technical solutions developed to avert attacks at early stages,
such as vulnerability levels, to prevent threats from materializing on user devices, by reducing human
impact and detecting attacks. This also involves employing specific methods to detect the source of
attacks (e.g., identifying newly registered domains closely resembling well-known domain names).</p>
      <p>The third line of defense assumes the involvement of law enforcement agencies as a restraining
control. These approaches can be combined to create significantly stronger anti-phishing solutions.</p>
      <p>Human education is an efficient countermeasure to elude and avert phishing attacks. Awareness
and education are the first lines of defense in anti-phishing methodology, even if it does not provide
complete elimination of the threats. End-user training reduces users' susceptibility to phishing attacks
and complements other technical solutions. According to the analysis conducted [5], 95% of phishing
attacks are caused by the human factor. There are various technical solutions for eliminating phishing
threats. The proposed technical solutions for detecting and stopping phishing attacks can be presented
by two main approaches: content-aware solutions and content-based solutions. Content-aware
methods include blacklists and whitelists which classify faked emails or web pages based on
information which is not a part of the email or web page, such as URLs and domain name features.
The disadvantage of this approach is the impossibility to identify all phishing websites because after
deleting the phishing site, the phisher can easily register a new domain [6, 7].</p>
      <p>Content-based methods classify a page or email according to the information in its content, for
instance, text, images, HTML, JavaScript, or cascading style sheets (CSS). Content-based solutions
incorporate machine learning, heuristics, visual similarity, and image processing techniques [8].</p>
      <p>Finally, multi-aspect methods use a combination of the previous approaches to detect and avert
phishing attacks. Lately, there has been a tendency of applying ML technologies in the
implementation of anti-phishing solutions for the early detection of phishing threats and minimizing
the risks of danger. Currently, security strategies based on neural network technologies are becoming
more and more widespread [9-11]. The article [12] reviewed 16 classification systems based on the
semantic characteristics of URLs. Ten characteristics that distinguish safe websites from phishing
websites were also collected and analyzed using semantic features. According to the results of the
comparison, GradientBoostingClassifier and RandomForestClassifier showed the highest accuracy.
The researchers noted that one possible limitation is the task of feature selection. The study [13] treats
malicious URL detection as a binary classification problem and examines the performance of known
classifiers (Naive Bayes, Support Vector Machines, Multilayer Perceptron, Decision Trees, Random
Forest, and k-Nearest Neighbors).</p>
      <p>The authors of [14] focused on semantic feature extraction methods using word2vec to improve
the description of the features of phishing sites, and then combined these features with other statistical
features to create a more robust phishing detection model. Experimental results with actual datasets
have shown that feature combinations improve phishing detection performance.</p>
      <p>The authors of [15] compare the random forest method and recurrent neural networks within the
task URL classifications. Neural networks have shown better efficiency, so the authors came to
conclusion about preference.</p>
      <p>Separately, it should be said about teaching without a teacher: despite the low accuracy, this
approach is now one of the promising directions of scientific research. Such models do not require
tagged data and therefore have great potential for application, including in the field of cyber security.
It is obvious that the accuracy of the classifier is not sufficient for independent use in real conditions.
However, given the fact that the data was not labeled, there is no need to train the model, which is a
good result. Similar classifiers may be required as an additional mechanism within the framework of a
heuristic approach to identify suspicious links that should be paid attention to and possibly subjected
to a deeper investigation. It is also possible to resort to clustering to help mark objects, which will
later, after being checked by an expert, be used for training the classifier model (training with a
teacher) or for deterministic methods — the search for an unambiguous match with the detected
phishing link. Criminals are constantly changing their attack tactics to exploit system vulnerabilities
and user ignorance. Choosing an inappropriate countermeasure algorithm can lead to unpredictable
results and wasted effort, which ultimately affects the accuracy and effectiveness of the Deep
Cybersecurity DL model [16].</p>
      <p>This is another argument in favor of the use of deep learning algorithms, because DL algorithms
provide the possibility of automating the process of detecting signs of phishing links, provide
flexibility and adaptability to changes, of course, the need for additional retraining of the algorithm.</p>
      <p>Nowadays, there are many different DL algorithms that numerous researchers have implemented
to detect phishing websites. There are various DL-based approaches designed to solve a specific
problem or meet certain system requirements; each has its advantages and disadvantages [17, 18].
Although deep learning techniques, and artificial networks in particular, take a long time to train, they
often provide greater accuracy and automatically extract features from raw data without any prior
knowledge.However, selecting the right approach that is best suited for a certain application or data
set is a challenging task. Different performance measures were used to evaluate the effectiveness of
the DL-based phishing detection model. The indicators obtained as a result of the experiments
indicate that among the four DL methods (DNN, CNN, LSTM, GRU) no algorithm gave the best
values for all performance indicators. One should choose the one that best suits their specific
applications or according to their specific requirements.</p>
      <p>In this research, the possibility of combining different neural network architectures into a hybrid or
ensemble model was investigated to achieve the advantages and eliminate the disadvantages of
monoarchitectural artificial neural networks. Based on the conducted research, we can conclude that a
promising direction in the task of increasing the effectiveness of phishing detection is the use of
hybrid models, in particular, the models that combine layers of different nature [4].</p>
      <p>Promising combinations for research might encompass the following:
1) CNN + DNN
2) DNN +LSTM
3) CNN +RNN</p>
      <p>The dataset from [19] was used for the experiments. The augmented dataset consists of 10,000
instances obtained from 5,000 phishing and 5,000 legitimate websites.</p>
      <p>eneral description of attributes:
• havingIPAddress – checking for the availability of an IP address in the link;
• URLLength – checking the number of characters in the link;
• ShorteningService – checking if the link is displayed in a shortened format;
• havingAtSymbol – checking if the link has the sign "@";
• doubleslashredirecting - checking if the link contains the sign "\\";
• PrefixSuffix – checking if there is no prefix or suffix attached to the link;
• havingSubDomain - checking if no subdomain is attached to the link;
• SSLfinalState – verification of the SSL certificate;
• Domainregistrationlength – domain lifecycle check;
• Favicon – check if its personal Favicon is attached to the link;
• Port – checking if no ports are attached to the link and their status;
• HTTPStoken – HTTPS certificate verification;
• RequestURL – checking if no automatically downloaded data is attached to the link;
• URLofAnchor – checking if "anchor" links are not attached;
• Linksintags – checking if no SQL injections are attached to the link;
• SFH (server form handler) – checking if no SFH injections are attached to the link;
• Submittingtoemai – check for attachment to mail;
• AbnormalURL – checking for a fake domain;
• Redirectpage – checking for redirection to another page;
• onMouseOver – checking for a hidden link;
• RightClick – checking if the link is displayed as an "&lt;a&gt;" element;
• Using pop-upwidnow – checking for pop-up windows;
• Iframe – checking for the presence of an Iframe element;
• Ageofdomain – checking for the length of the life cycle;
• DNSRecord – checking for redirection through additional DNS servers;
• Webtraffic – traffic volume checking;
• PageRank – checking the link rating in the Black/White lists;
• GoogleIndex – checking the link ranking in Google;
• Linkspointingtopage – checking for the presence of "magnetic" links;
• Statisticalreport – obtaining confirmation of security from open databases;</p>
      <p>Python was selected as the main software platform for the development of the phishing link
recognition system due to its effective collection of scientific tools.</p>
      <p>Libraries used in the process of the system development included:
• numpy - an extension of the Python language that adds support for large multidimensional
arrays and matrices;
• Keras - a high-level neural networks API that operates using such software tools for creating
deep networks;
• matplotlib - an extensive library for creating 2D visualizations;]
• pandas - a Python library used for data manipulation and analysis;.</p>
      <p>For the comparison of the efficiency of their work, the following scheme of the experiment was
proposed:
• Stage 1 - comparative analysis of networks built on a single (mono) basic model, namely</p>
      <p>CNN, RNN.
• Stage 2 - comparative analysis of the winner network of the 1st stage with the CNN-RNN
hybrid network.
• Stage 3 - comparative analysis of the winner network of the 2nd stage with the DNN-LSTM
hybrid network.</p>
      <p>During the experiment, all the mentioned models were trained on the same input data and with the
same number of epochs (30). At the first stage of the experiment, the efficiency of the convolutional
and recurrent network was analyzed. The network architectures that participated in the experiment are
presented in Figure 2 and Figure 3, respectively.</p>
      <p>A detailed description of convolutional network architecture using Keras framework tools:
 convolution layer (Conv1D) size (batch_size) 200, filters 200;
 sampling layer (MaxPooling1D) size 2;
 selection layer of 20% of existing neurons (Dropout);
 vector reconstruction layer (Flatten);
 layer of neurons (Dense) size 2 with Relu activation function;
 layer of neurons (Dense) of size 1 with sigmoid activation function.</p>
      <p>The approach is based on the work of a convolutional neural network at the symbol level. In
particular, URL and DNS strings are converted to a vector form using natural language processing
techniques. CNN is utilized for the extraction of phishing features and training a binary classification
model.A competitor of the convolutional network in the first experiment was a recurrent neural
network (RNN). The choice was due to the fact that recurrent neural networks specialize in processing
sequential data and are widely used for text processing. Input text is usually abstracted to a sequence
of characters, words, or phrases. In our experiment - to symbols.
A detailed description of the recurrent network architecture using Keras framework tools:
 recurrent layer (SimpleRNN) of 128 neurons, (batch_size) 200;
 a selection layer of 20% of existing neurons (Dropout);
 layer of neurons (Dense) size 2; a layer of neurons (Dense) of size 2 with the ReLU activation
function;
 layer of neurons (Dense) of size 1 with sigmoid activation function.</p>
      <p>The chosen SimpleRNN architecture has a basic form of RNN architecture. Contrary to the classic
architecture proposed in many articles, the implementation of this model in Keras is completely
different, but simple. Each RNN cell accepts one data input and has one hidden state that is passed
from one step to the next. The results of comparing the effectiveness of monoarchitectures CNN and
RNN (SimpleRNN) are presented in the Figure 4. Contrary to expectations, the CNN model showed
the best result. Most likely, this is due to the simplicity of the recurrent network architecture used.</p>
      <p>At the second stage, the monoarchitecture of the CNN (Figure 2) from the previous stage and the
convolutional hybrid network were compared, in which the convolutional network is reinforced by a
recurrent layer, the layers are connected in series (Figure 5).</p>
      <p>Detailed description of the architecture of the hybrid CNN-RNN network:
 convolutional layer (Conv1D) size (batch_size) 200, filters (filters) 150;
 sampling layer (MaxPooling1D) size 2;
 recurrent layer (SimpleRNN) with 50 neurons;
 selection layer of 10% of existing neurons (Dropout);
 dense layer of size 1 with sigmoid activation function.</p>
      <p>The effectiveness of the CNN and CNN-RNN models was evaluated by comparing the values of
the accuracy metric. The best results were shown by the hybrid model (Figure 6). The next stage of
the experiment included a comparison of CNN-RNN and DNN-LSTM hybrid architectures. Images of
neural network architectures are shown in Figure 5 and Figure 7, respectively. In this experiment, a
complex ensemble architecture proposed by the authors [20] was used. This complex neural network
consists of two parallel networks: a deep forward propagation network (DNN) and a recurrent
network on LSTM blocks.
The DNN architecture consists of four fully connected layers and a selection layer:
• dense layer of neurons of size 40 with ReLU activation function;
• dense layer of neurons of size 64 with ReLU activation function;
• dense layer of neurons of size 32 with ReLU activation function;
• selection layer of 20% of existing neurons (Dropout);
• dense layer of neurons of size 16 with ReLU activation function.</p>
      <p>The LSTM network consists of two LSTM layers of 32 neurons each, a Dropout selection layer
and a fully connected layer of 16 neurons. Both networks are connected by a fully connected layer of
8 nodes. The output Dense layer consists of 1 neuron with a sigmoidal activation function and
calculates the result of the entire system (Figure 7). At the third stage, when comparing hybrid
architectures (CNN-RNN and DNN-LSTM), the CNN-RNN model (Figure 8) showed the best results
according to the accuracy metric, although the spread of values is very small.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>Experimental studies of solving the problem of identifying phishing Internet links on the "Phishing
Websites Dataset" dataset from the UCI Machine Learning Repository have shown that simple
monoarchitectural networks lose to hybrid ones. The table summarizes the results of the conducted
experiments on the accuracy parameter on the test set (Table 1).
Among complex hybrid architectures, the CNN-RNN model is one of the most effective. When
dealing with one-dimensional sequence data, CNN is extremely successful at extracting and achieving
features. In a hybrid model with a sequential connection of convolutional and recurrent networks, the
CNN interprets the input data of a sequence of symbols, extracts features from the input data, which
are then sequentially transmitted to the RNN model for further understanding and classification. This
combination of models provides a high level of flexibility and efficiency of the model.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>In the study, an analysis of existing research on the detection of phishing links based on the use of
different types of neural network architectures was carried out. Architectures: CNN, DNN, RNN,
LSTM networks were considered, their disadvantages and advantages were studied.</p>
      <p>Training and testing of mono and hybrid models of neural networks was carried out. The following
architectural models were compared according to efficiency indicators: CNN, RNN, CNN-RNN,
DNN-LSTM. The computational experiment showed that the most effective model is the CNN-RNN
network. The created neural network core is the basis of a software product for automating the
detection and blocking of phishing links. Architecturally, the phishing link detection system is
implemented as a browser extension. A promising area of future research is the development of neural
network architectures using ensemble methods.
7. References
[1] Types of Cybercrime. Panda Security, 2023. URL:
https://www.pandasecurity.com/en/mediacenter/panda-security/types-of-cybercrime/ (date of access:
24.06.2023)</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>