<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Weapon to Detect the Phishing URLs by Machine Learning Approaches</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Suseta Datta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shibaprasad Sen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pritam Kundu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Natural Language Processing</institution>
          ,
          <addr-line>Security, Interactive and Responsive Web Framework</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Engineering &amp; Management</institution>
          ,
          <addr-line>Kolkata, Newtown</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <fpage>39</fpage>
      <lpage>46</lpage>
      <abstract>
        <p>Now a days Phishing is a mundane attack on gullible people by making them to disclose their personal information utilizing counterfeit URLs. The main purpose of Phishing URLs Detection by Machine Learning approaches is to give security and safety to the unique information like passwords of personal portals, personal information's and online transactions. In Machine Learning techniques, various approaches are the puissant implements that have been used to grapple against phishing attacks. This paper consists of various Ma-chine Learning approaches which have been utilized for detecting phishing URLs. The best fitted approach has been derived and modified using another ML approach which is giving almost 97% testing accuracy. This paper has shown that the precision, recall, f1-score and trainingtesting accuracy have been calculated based on the confusion matrix for each applied approach. An interactive and responsive web frame work has been designed for making this project userfriendly. Here, phishing domain characteristics have been explained in details and the features which distinguish these domains from anti-phishing domains. The phishing URLs within the body of the inputs are designed to make it appear that they go to the defraud organization utilizing that organization's logos and other legitimate contents.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Security researchers are now concerned about phishing primarily due to the ease with which an
authentically fraudulent URL can be forged that resembles legitimate URLs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Phishers utilized the
URLs which are visually homogeneous to authentic URLs. Phishing assailments are becoming
prosperous because lack of utilizer cognizance [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. To eschew blacklists assailers uses ingenious
techniques to illude users by modifying the URL to appear legitimate [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Malefactors are endeavoring
to convince online users to reveal passwords, account numbers, convivial security numbers or other
personal information [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The URLs and their all features will be analyzed for detecting the phishing
URLs. To evade extensive losses different authors had proposed to determine characteristic features of
phishing emails. These features accommodate as inputs to statistical relegation techniques, which are
then trained for identifying phishing URLs [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. H. Huang et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] suggested structure that determine
the phishing use of page section similarity that fails macrocosmic assets spotter token to engender
forecast CSS vogue is usually kept as objective pages by phishing pages [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. This technique is
proposed by S. Marchal et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to separate. On the inspection of genuine site server branch erudition
phishing URL is dependent Mustafa Aydin et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposed a relegation method to detect the phishing
URLs and its URL features and survey subset predicted feature cull methods. Phish storm is a robotic
system to detect that can examine in authentic time any URL [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Muhemmet Baykara et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
Nominate an application which is kenned as not phishing clone it gives details about the spotting
quandary of phishing and the way of spotting phishing URLs [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>2022 Copyright for this paper by its authors.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem Statement and Proposed Solution</title>
      <p>
        As technology perpetuates to grow, phishing techniques commenced to progress expeditiously to be
averted by utilizing anti-phishing mechanisms to detect phishing [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Phishing becomes a main area of
concern for security researchers because it is not arduous to engender the unauthentically spurious URL
which looks so proximate to legitimate URL [16]. Major drawback of previous technique is that it can't
deter-mine ‘zero-hour’ phishing URLs attack. The most prevalent technique has utilized and this paper
has shown that the best fitted algorithm has predicted the result correctly with the best accuracy. To
develop best-fit model, programs are divided into their felicitous domains and subsequently categorized
as phishing or legitimate. A classifier and a regressor method have been used. The regressor model has
been giving the best accuracy than the classifier. Confusion matrix has been calculated for determining
the best fitted algorithm based on precision, recall and f1-score. Page content inspection had been
utilized by some strategies to surmount the erroneous negative quandaries and complement the
susceptibilities of the stale lists. A toolkit has been developed to utilize as a platform for all the users.
It will be acclimated to detect a given URL either phishing or not. The URL is engendered for all users;
hence it must be facile to operate with and no utilizer should face any arduousness while making its
use. The different features-based dataset makes up to be taken in the meantime of determining a URL
as phishing [19]. The features for detecting and relegating of phishing URLs are as follows: Hypertext
Markup Language and JavaScript based, Abnormal based, Address bar based, Domain based [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Machine Learning strategies, Natural Language processes, and other applied approaches are de-scribed
further down in this paper.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The dataset for detecting phishing sites has been taken from
https://www.kaggle.com/taruntiwarihp/phishing-site-urls. The raw dataset contains 5,49,346 samples
where 72% is for legitimate URLs and 28% for phishing URLs. The dataset consists of 5,07,195 unique
samples out of total. Legitimate URLs have been labelled as ‘good’ and phishing URLs have been
assigned as ‘bad’. In all, 2 instances have been utilized and there is none of which have a null value.
The dataset embedding procedure has been used according to the natural language processing methods.
Vectorization process has been used to transform the stemmed words into a vector form. This vector
form has been engendered from the tokenized and stemmed dataset and the name of the URLs has been
utilized as input after developing the model. 75% of the dataset has been utilized for training, while the
remaining 25% has been utilized pristinely for testing.</p>
      <p>80%
70%
60%
50%
40%
30%
20%
10%
0%
72%
28%</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
    </sec>
    <sec id="sec-5">
      <title>4.1. Tokenization, Stemming and Joining Root Words</title>
      <p>Raw Dataset
Bad URLs</p>
      <p>Good URLs</p>
      <p>Tokenization is the process of turning a paramount piece of data into a desultory string of characters
that has no consequential value. The tokenization process has been utilized to break a URL which is
given as a string and then the given URL has been broken into some tokens. These tokens have been
assigned as consequential value. The stemming process has been used for engendering morphological
variants of these generated tokens. After generating the root words, 3 instances have been created to
store the tokens and the root words in the raw dataset. These 3 instances are ’Tokenized_text’,
‘Stemmed_text’, and ‘Sent_text’. In the below table, the processing time of tokenizing, stemming and
storing have shown. Please, check Table 1.
4.2.</p>
    </sec>
    <sec id="sec-6">
      <title>Embedding Root Words</title>
      <p>The extracted root words of both good and bad URLs have been embedded through word cloud.
This word cloud visualization has been utilized for showing the root words. In the below figures, the
output of both good and bad URLs has been shown.
4.3.</p>
    </sec>
    <sec id="sec-7">
      <title>Web Driver Automation for Hyperlink Extraction</title>
      <p>Web driver automation tool has been used for mechanical testing of a sample of phishing URL.
Instead of web browser, the web driver has been used for automatic testing. The relevant hyperlinks of
this tested phishing URL has been extracted and plotted into a frame by feeding.</p>
      <p>Vectorization process has been utilized to transform a collection of text to a vector of token counts.
This allows to conduct the cross validations for training and testing sets. Now the label and the vector
form have been used in the process of splitting. These two parameters have been used for the model
creation. Mainly two algorithms have used for creating the model. Further discussion has discussed in
the result analysis part.</p>
    </sec>
    <sec id="sec-8">
      <title>5. Result Analysis</title>
      <p>Logistic Regression and Multinomial Naïve algorithm are the probabilistic learning techniques that
is mostly utilized in Natural language processing. The LR and MNB algorithms have been applied on
the selected training dataset. 75% of total dataset has been already assigned for training purpose in
dataset pre-processing. The training dataset has been fitted to both classifier and regression. The
previous natural language processing methods are carried out with the avail of their respective classifier
class and regression class. To expect the test state result, a confusion matrix has been plotted for each
algorithm. To evaluate the accuracy, the confusion matrix (please, check Table 1.) has been utilized.
The logistic regression (LR) has been giving 96.35% testing accuracy and the multinomial naïve bayes
(MNB) algorithm has been giving 95.79% testing accuracy. Following that, the values for Actual Good
- Predicted Good (True Positive), Actual Bad – Predicted Good (False Positive), Actual Good
Predicted Bad (False Negative), Actual Bad - Predicted Bad (True Negative) have been measured.
Based on the value of these parameters, precision, recall, f1-score and training-testing accuracy have
been calculated accordingly for each algorithm (please, check Table 1.). Scikit-Learn pipelining
algorithm is applied on the best fitted algorithm. Now again the parameters of splitting dataset have
been changed. The vectorized instance has been changed and the name of the URLs instance has been
placed over it. The train set has been fitted as per the previous process. This technique has been giving
96.58% testing accuracy and the confusion matrix has been created according to those predicted and
actual parameters. The final model has been dumped into a pickle file and load-ed it in the responsive
and interactive web frame work. This model will work as a product key in the background of interface.</p>
      <p>Finally, the study has been committed the concept for further detection of phishing URLs strategies.
The Logistic Regression has performed based in the terms of ACC, TPR/FPR, PPV and F1-Score when
applying machine learning approaches to identify given URLs correctly. Multinomial Naïve Bayes
algorithm has performed very well but given a lower accuracy than LR based on the confusion matrix.
The Logistic Regression algorithm has provided the best accuracy during fitting the training dataset.
However, the Logistic Regression has proved to be the most accurate in the end with 96.35% testing
accuracy, the accuracy of pipeline is 96.58%. This most appropriate approach is pretty equal to this
most exact value of LR.</p>
    </sec>
    <sec id="sec-9">
      <title>6. Conclusion and Future Scope</title>
      <p>Thus, to summarize, the model had been visually perceived how phishing is a sizably voluminous
threat to the security and safety of the web and how phishing detection is a paramount quandary domain.
The model had been tested two Machine Learning approaches on the Phishing URLs Dataset and
calculated their results. Then the model had been culled the best algorithm predicated on its performance
and built a Chrome Driver extension for detecting phishing URLs. The model had been detected
phishing URLs utilizing Logistic Regression with and precision of almost ~97%.</p>
      <p>This paper aims to enhance detection technique to detect phishing URLs utilizing Machine Learning
technology. In future, the model had been intended with Random Forest algorithm and black list method
to build the phishing detection system as a scalable web accommodation which will incorporate online
learning so that incipient phishing attack patterns can facilely be learned and amend the precision of
our models.</p>
    </sec>
    <sec id="sec-10">
      <title>7. Acknowledgements</title>
      <p>The co-author would relish to express his gratitude to Dr. Shibaprasad Sen, Department of Computer
Science and Technology and Suseta Datta, Department of Computer Science and Engineering,
University of Engineering &amp; Management, Kolkata, for their insightful comments and for providing
me with all pertinent details and information about the paper in a timely manner.</p>
    </sec>
    <sec id="sec-11">
      <title>8. References</title>
      <p>[16] Abdelhamid, Neda, Fadi Thabtah, and Hussein Abdel-jaber. "Phishing detection: A recent
intelligent machine learning comparison based on models content and features." 2017 IEEE
international conference on intelligence and security informatics (ISI). IEEE, 2017.
[17] Ubing, Alyssa Anne, et al. "Phishing website detection: an improved accuracy through feature
selection and ensemble learning." International Journal of Advanced Computer Science and
Applications 10.1 (2019): 252-257.
[18] Kumar, Abhishek, Jyotir Moy Chatterjee, and Vicente García Díaz. "A novel hybrid approach of
SVM combined with NLP and probabilistic neural network for email phishing." International
Journal of Electrical and Computer Engineering 10.1 (2020): 486.
[19] Zamir, Ammara, et al. "Phishing web site detection using diverse machine learning
algorithms." The Electronic Library (2020).
[20] Gharge, Sagar, and Manik Chavan. "An integrated approach for malicious tweets detection using
NLP." 2017 International Conference on Inventive Communication and Computational
Technologies (ICICCT). IEEE, 2017.
[21] Kulkarni, Arun D., and Leonard L. Brown III. "Phishing websites detection using machine
learning." (2019).
[22] Alswailem, Amani, et al. "Detecting phishing websites using machine learning." 2019 2nd
International Conference on Computer Applications &amp; Information Security (ICCAIS). IEEE,
2019.
[23] Lakshmanarao, A., P. Surya Prabhakara Rao, and MM Bala Krishna. "Phishing website detection
using novel machine learning fusion approach." 2021 International Conference on Artificial
Intelligence and Smart Systems (ICAIS). IEEE, 2021.
[24] Wu, Che-Yu, Cheng-Chung Kuo, and Chu-Sing Yang. "A phishing detection system based on
machine learning." 2019 International Conference on Intelligent Computing and its Emerging
Applications (ICEA). IEEE, 2019.
[25] Alam, Mohammad Nazmul, et al. "Phishing attacks detection using machine learning
approach." 2020 third international conference on smart systems and inventive technology
(ICSSIT). IEEE, 2020.
[26] Kumar, G. Ravi, S. Gunasekaran, and Vignesh AS. "URL Phishing Data Analysis and Detecting
Phishing Attacks using Machine Learning in NLP." International Journal of Engineering Applied
Sciences and Technology (IJEAST) 3.8 (2018): 70-75.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Buber</surname>
            , Ebubekir,
            <given-names>Banu</given-names>
          </string-name>
          <string-name>
            <surname>Diri</surname>
          </string-name>
          , and Ozgur Koray Sahingoz.
          <article-title>"NLP based phishing attack detection from URLs."</article-title>
          <source>International Conference on Intelligent Systems Design and Applications</source>
          . Springer, Cham,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Sahoo</surname>
          </string-name>
          , Doyen, Chenghao Liu, and Steven CH Hoi.
          <article-title>"Malicious URL detection using machine learning:</article-title>
          <source>A survey." arXiv preprint arXiv:1701.07179</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Lemley</surname>
            , Joe,
            <given-names>Shabab</given-names>
          </string-name>
          <string-name>
            <surname>Bazrafkan</surname>
            , and
            <given-names>Peter</given-names>
          </string-name>
          <string-name>
            <surname>Corcoran</surname>
          </string-name>
          .
          <article-title>"Deep Learning for Consumer Devices</article-title>
          and
          <article-title>Services: Pushing the limits for machine learning</article-title>
          ,
          <source>artificial intelligence, and computer vision." IEEE Consumer Electronics Magazine 6.2</source>
          (
          <year>2017</year>
          ):
          <fpage>48</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Ghafir</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ibrahim</surname>
          </string-name>
          , et al.
          <article-title>"Detection of advanced persistent threat using machine-learning correlation analysis</article-title>
          .
          <source>" Future Generation Computer Systems</source>
          <volume>89</volume>
          (
          <year>2018</year>
          ):
          <fpage>349</fpage>
          -
          <lpage>359</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Gandotra</surname>
            , Ekta, and
            <given-names>Deepak</given-names>
          </string-name>
          <string-name>
            <surname>Gupta</surname>
          </string-name>
          .
          <article-title>"An efficient approach for phishing detection using machine learning</article-title>
          .
          <source>" Multimedia Security</source>
          . Springer, Singapore,
          <year>2021</year>
          .
          <fpage>239</fpage>
          -
          <lpage>253</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Sahingoz</surname>
            ,
            <given-names>Ozgur</given-names>
          </string-name>
          <string-name>
            <surname>Koray</surname>
          </string-name>
          , et al.
          <article-title>"Machine learning based phishing detection from URLs."</article-title>
          <source>Expert Systems with Applications</source>
          <volume>117</volume>
          (
          <year>2019</year>
          ):
          <fpage>345</fpage>
          -
          <lpage>357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Gualberto</surname>
          </string-name>
          , Eder S., et al.
          <article-title>"From feature engineering and topics models to enhanced prediction rates in phishing detection." Ieee Access 8 (</article-title>
          <year>2020</year>
          ):
          <fpage>76368</fpage>
          -
          <lpage>76385</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Lee</surname>
            , Minyoung, and
            <given-names>Eunil</given-names>
          </string-name>
          <string-name>
            <surname>Park</surname>
          </string-name>
          .
          <article-title>"Real-time Korean voice phishing detection based on machine learning approaches</article-title>
          .
          <source>" Journal of Ambient Intelligence and Humanized Computing</source>
          (
          <year>2021</year>
          ):
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Yadollahi</surname>
            ,
            <given-names>Mohammad</given-names>
          </string-name>
          <string-name>
            <surname>Mehdi</surname>
          </string-name>
          , et al.
          <article-title>"An adaptive machine learning based approach for phishing detection using hybrid features</article-title>
          .
          <source>" 2019 5th International Conference on Web Research (ICWR)</source>
          . IEEE,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>Brij B.</given-names>
          </string-name>
          , et al.
          <article-title>"A novel approach for phishing URLs detection using lexical based machine learning in a real-time environment."</article-title>
          <source>Computer Communications</source>
          <volume>175</volume>
          (
          <year>2021</year>
          ):
          <fpage>47</fpage>
          -
          <lpage>57</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Kiruthiga</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Akila</surname>
          </string-name>
          .
          <article-title>"Phishing websites detection using machine learning."</article-title>
          <source>International Journal of Recent Technology and Engineering 8</source>
          .2 (
          <year>2019</year>
          ):
          <fpage>111</fpage>
          -
          <lpage>114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Chiew</surname>
            ,
            <given-names>Kang</given-names>
          </string-name>
          <string-name>
            <surname>Leng</surname>
          </string-name>
          , et al.
          <article-title>"A new hybrid ensemble feature selection framework for machine learning-based phishing detection system."</article-title>
          <source>Information Sciences</source>
          <volume>484</volume>
          (
          <year>2019</year>
          ):
          <fpage>153</fpage>
          -
          <lpage>166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>Ankit</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , and
          <string-name>
            <surname>Brij</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Gupta</surname>
          </string-name>
          .
          <article-title>"A machine learning based approach for phishing detection using hyperlinks information</article-title>
          .
          <source>" Journal of Ambient Intelligence and Humanized Computing 10.5</source>
          (
          <year>2019</year>
          ):
          <fpage>2015</fpage>
          -
          <lpage>2028</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>Ankit</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , and
          <string-name>
            <surname>Brij</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Gupta</surname>
          </string-name>
          .
          <article-title>"Towards detection of phishing websites on client-side using machine learning based approach</article-title>
          .
          <source>" Telecommunication Systems 68.4</source>
          (
          <year>2018</year>
          ):
          <fpage>687</fpage>
          -
          <lpage>700</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Peng</surname>
            , Tianrui, Ian Harris, and
            <given-names>Yuki</given-names>
          </string-name>
          <string-name>
            <surname>Sawa</surname>
          </string-name>
          .
          <article-title>"Detecting phishing attacks using natural language processing and machine learning." 2018 IEEE 12th international conference on semantic computing (icsc)</article-title>
          .
          <source>IEEE</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>