A Middleware based Anti-Phishing Architecture
            A.A Orunsolu                                  A.S Sodiya
   Department of Computer Science               Department of Computer Science                          A.T Akinwale
     Moshood Abiola Polytechnic                 Federal University of Agriculture,             Department of Computer Science
          Abeokuta, Nigeria                            Abeokuta, Nigeria                       Federal University of Agriculture,
   orunsolu.abdul@mapoly.edu.ng                     sodiyaas@funaab.edu.ng                            Abeokuta, Nigeria

ABSTRACT                                                                 countermeasure and anti-phishing education. In this work, a new
Phishing attacks are becoming an everyday threat to the ever             paradigm-shift architecture is proposed after extensive survey of
growing cyber community. Regrettably, most online users do not           current client/server-based anti-phishing techniques. Although the
understand some of the simplest indicators of a typical phishing         architecture is at implementation stage, we present this paper to
scam. In addition, the sophistication of some of the newest              communicate the state of anti-phishing research to support the
phishing defeat most of the current software-based                       efficiency     of    the    new     approach     in    the   fight
against phishing attacks.                                             pay quickly because of the sensitive nature of their resources [3].
                                                                      Basically, a typical phishing attack begins with unauthenticated
CCS Concepts                                                          message crafted by phishers. These messages arrived at the client
Computers and Society ➝Electronic Commerce – security,                or user’s machine in the form of email, e-advert, SMS, websites
payment schemes, electronic data interchange (EDI)                    etc. with brand logos and call center number of a known company.
                                                                      One of the core features of these messages is their deceptive view
Keywords                                                              which may not be easily identified even to an experienced IT-
Attacks, E-Commerce, Middleware, Phishing, Internet                   expert [5, 8]. The user falls for a phish by actively following the
                                                                      instruction in the message through performing a click action or
1. INTRODUCTION                                                       download action. In the end, the user’s actions result to the
The prevalence of e-services in today’s digital world has opened a    execution of phishers’ payload. A payload is the functional part of
door for various cyber-crimes that threatened the acceptability of    a phisher’s code where their malicious intention is achieved.
such services. Hackers have continuously managed a host of online     Figure 1 presents the life cycle of a phishing attack.
black markets which discourage stakeholders’ confidence in the        Ravaged by unhealthy reality of phishing attacks, researchers
usability of internet services [13]. This range of criminal           proposed a number of countermeasures ranging from user-
enterprises includes spam-advertised commerce, botnet attacks,        education to software enhancements. In spite of the existence of
and a vector for propagating malware [4]. Among all the               various anti-phishing measures, the frequency of phishing
cybercrimes targeting e-services, phishing attacks have become a      incidences continues to increase [14, 29]. For instance, RSA’s
significant security threat which causes tremendously losses every    online fraud report showed estimated losses of over $4.6 billion by
day to both experienced and unwary internet users [5]. This is        global organizations in 2015. In a similar vein, the Central Bank of
mostly due to the unhealthy disclosure of user’s credentials to a     Nigeria White paper estimated that about $250 million was lost to
phishing-related sites, chats, SMS or e-mail. Thus, these crimes      cybercrime in 2013 [15].
have subjected the popular advantages of Internet to debate as        To this end, we report the survey of anti-phishing researches and
businesses, government, individuals and financial institutions        examine their weaknesses. After survey of relevant extant
recorded millions of dollars in losses and espionage.                 literature, we provide a brief discussion on a new approach that
Phishing is e-communication criminal act which uses social            will effectively mitigate the weakness of the current approaches.
engineering and technical subterfuge to exploit unwary internet       This is a very important milestone in harnessing diverse anti-
users and acquire their confidential data such as credit card         phishing defense system in one study to provide the basis for
number, PIN, password, answer to security questions etc. Social       evaluating the proposed paradigm-shift approach.
engineering-based phishing techniques use spoofed emails, chat or     The rest of the paper is organized as follows: Section 2 presents
SMS to lead internet users to fake agents, websites etc. On the       related works on why phishing works. The overview of the current
other hand, technical subterfuge-based phishing scheme plant          anti-phishing defense architecture is examined in Section 3. In
crime ware unto computers to steal sensitive data. Recently,          Section 4, we present the proposed paradigm-shift architecture to
phishers develop “ransomware” which executes a cryptovirology         address current challenges. Section 5 presents our conclusions.
attack that adversely affects computing resources and demands a
ransom payment to restore the resources to original state.            2. WHY PHISHING WORKS?
According to an online report by CSO, 93% of phishing emails are      A number of studies have examined the reasons that people fall for
now “ransomware”. The report observed that most victims tend to       phishing attacks. For instance, Dhamija et al. [5] identified lack of
                                                                      computer system knowledge, lack of knowledge of security and
                                                                      security indicators, visual deception and bounded attention. The
                                                                      authors further showed that a large number of people cannot
                                                                      differentiate between legitimate and phishing web sites, even when
CoRI’16, Sept 7–9, 2016, Ibadan, Nigeria.                             they are made aware that their ability to identify phishing attacks
                                                                      are being tested. In another related work, Down et al. [6]
                                                                      conducted a research in which 20 non-expert computer users
                                                                      revealed their strategies and understanding when faced with
                                                                      possible suspicious e-mails.

                                                                   122
                                  Phishing scams begin
                                                                                                   3. THE CURRENT COUNTERMEASURES
                                                                                                   In this section, we considered the state of current countermeasures
                                                                                                   against phishing attack from software enhancement perspective.
                                                                                                   Software enhancement techniques are computer programs that are
                           1. Unauthenticated message arrives                                      designed to defeat or mitigate phishing attacks. These software
                                on unwary use’s machine
                                                                                                   approaches use techniques such as list-based, machine learning,
                                                                                                   visual similarity and multi-channel authentication algorithms. They
                                                                                                   are either deployed on the client side or server side.
                            2. The message bears a deceptive
                                         view                                                      3.1 Client-side Anti-phishing approaches
                                                                                                   PhishNet [26] proposed an active blacklist approach in which new
                                                                                                   malicious URLs can be effectively predicted from the existing
                           3. User interact with the message by
                                                                                                   blacklist entries. This is achieved by processing blacklisted URLs
                           performing some actions requested
                           by the message e.g. click, update etc
                                                                                                   and producing multiple variations of the same URL using IP
                                                                                                   address equivalence, query string substitution, brand name
                                                                                                   equivalence, directory structure similarity and top level domain
                                                                                                   replacement. In this way, multiple variations of the same URL
                                                                                                   called children are obtained. In order to filter non-existent children
                                                                   Yes
          No
                                 Has user performed the                  4. Phisher’s Payload      URLs, the system performed DNS query, TCP connect, HTTP
                                   requested action?
                                                                                                   header response and content similarity. The approach achieved
                                                                                                   remarkable results during real-time blacklist feeds against new
                                                                                                   malicious URLs. However, the problem of false positives still
    No payload execution                                                                           exists.
                                                                                                   PhishZoo [1] built profiles of trusted websites based on fuzzy
                                                                                                   hashing techniques in a whitelisted based approach. The approach
                                               End of campaign                                     also used blacklisting and heuristics approaches to warn users
                                                                                                   about malicious sites. This approach compared the stored profile of
                                                                                                   authentic sites with the content of sites under investigation. The
                Figure 1. Life cycle of a phishing attack                                          approach achieved significant accuracy rate of about 96% with the
                                                                                                   possibility of defeating zero-day attack. However, there is lack of
The investigation showed that participants used basic, often                                       generalization to new phishing due to human interventions.
incorrect heuristics in deciding how to respond to email messages.                                 Cao et al [4] developed an Automated Individual White-List
In another development, Sheng et al [31] and Jakobsson et al. [11]                                 (AIWL) in which the record of well-known benign sites visited by
provided useful insights on why phishing works using                                               users is kept. In this way, AIWL maintains a record of every URL
demographic data. While Sheng et al [31] revealed that women are                                   along with its Login User Interface information where the user
more vulnerable than men due to their less exposure to technical                                   input his or her details to prevent unhealthy disclosure of
knowledge, Jakobsson et al [12] revealed users’ sensitivity to                                     confidential information to malicious sites. The LUI information
variety of common trust indicators such as logos, padlock icons                                    maintains by AIWL for any suspicious website include the URL,
etc. when navigating web pages. Jagatic et al. [11] researched into                                the Input Area and the IPs. The URL refers to the Unified
a more sophisticated spear-phishing attacks in which the attackers                                 Resource Locator of the website. The input area includes the form
use specific knowledge of individuals and their organizations to                                   username path and password path. The IPs is a list of legitimate IP
conduct attack. Their investigation showed that people were 4.5                                    addresses mapping to a URL. This method is very effective against
times more likely to fall for phish sent from an existing contact                                  pharming and dynamic phishing attacks. However, the problem of
over standard phishing attacks. This is why social networking sites                                new login can result in false alert
like Facebook are now more patronized by phishers.                                                 In the work of Downs et al [6], a behavior-based phishing
Appealing to people’s sense of greed is an ancient technique now                                   detection system (UBPD) which monitor submission of user
adapted to the digital world especially in phishing scams [10]. This                               credentials by building binding relationship between users and web
kind of phishing scam may look like online survey in which                                         pages was proposed. This is done by constructing a personal
unsuspicious users are promised some financial returns for                                         whitelist for the user by adding web sites the user has visited more
participating in the survey exercise. In a similar vein, phishers                                  than three times. UBPD consists of three components namely the
might pose as relief agency asking for help with recent natural                                    user profile, the monitor and the detection engine. The user profile
disasters to appeal to people’s sense of emotion [10]. Most                                        contains data to describe the user’s binding relationships and the
unsuspecting users may not suspect anything negative even when                                     user’s personal whitelist. The monitor collects the data the user
asked to provide their financial details because of some gory                                      intends to submit and the identity of the destination websites. The
pictures that usually accompanied such campaigns.                                                  detection engine uses the data provided by the monitor to detect
In a more recent study, Mohammed et al. conducted user study                                       phishing websites and update the user profile if necessary. The
with the use of eye tracker to obtain objective quantitative data on                               approach can be effectively applied to static authentication
user judgment of phishing sites. Their results indicated that users                                credentials such as user name, password, security questions etc.
detected 53% of phishing sites even when primed to identify them                                   However, zero day attack is possible since prediction is only
with little attention on security indicators [20].                                                 applied to websites that user once visited.


                                                                                                123
Gowtham et al [8] presented a dynamic defense approach in which         consecutively querying search engines to identify legitimate
direct and indirect links in associated with a malicious page is        domain.
generated. In this way, the target domain set is constructed as input   An offensive approach in which a large number of bogus
into Target Identification algorithm to recognize a phishing page.      credentials are transparently fed into a suspected phishing page
Using DNS lookup and IP address resolution, the suspicious page         was proposed in [30]. In this way, the victim’s real credential is
can be predicted without the use of machine learning algorithms or      concealed among bogus credentials thereby increasing the
existing restriction lists. The accuracy rate of this approach was      overhead on phishers’ side in discerning the real credentials.
99.62%. However, the prediction of this approach is largely             BogusBiter consists of four main modules: information extraction,
dependent on the TF-IDF algorithms, search engine speed and             bogus credential generation, request submission and response
DNS lookup. The unavailability of any of these, defeat the efficacy     process. The information extraction module extracts the username
of this approach.                                                       and password pair and its corresponding form element on a login
A model to test the trustworthiness of suspected phishing page was      page. The bogus credential generation module generates bogus
developed in [30] by checking if the response of websites matches       credential based on an original credentials. The request submission
with the known behavior of phishing or legitimate sites. The model      is responsible for spawing and submitting multiple HTTP requests.
used the notion of Finite State Machine to capture the submission       The response process module determines the legitimacy of a
of forms with random inputs and then their corresponding                website based on its response to HTTP requests. The approach is
responses to describe the website’s behavior. The experimental          not bound to any specific phishing detection scheme and can be
results showed zero false negative and positive rates. The ability to   incrementally deployed over the internet. However, this approach
detect advanced XSS-based attacks is another plus for this method.      can result in increased bandwidth overhead and it can be blocked
However, the approach cannot handle phishing attacks where              by phisher since the bogus credentials is being submitted by a
images are employed.                                                    dedicated IP address.
PhishAri [2] detects phishing on Twitter in real-time. The approach
uses Twitter specific features along with URL features to detect        3.2 Server-side Anti-phishing Approaches
whether a tweet posted with a URL is phishing or not. The features      The deployment of server-side anti-phishing defense system is not
used in this approach are classified into URL based, Tweet-based,       very popular as client side solutions. One of such server-side based
WHOIS-based and Network-based. The approach is implemented              solution is a practical authentication service in which the need for
as a Chrome browser extension which makes a call to a developed         preset user password is eliminated during information flow
API (called RESTful API) and accordingly shows an indicator next        between the client and the server [16]. This is achieved through the
to each tweet indicating whether the tweet is phishing or not.          use of one-time passwords delivered on demand via a reliable
Experimental result shows that the system achieves 92.52%               secondary communication channel. On the receipt of the OTP, the
accuracy. The system detection speed can be improved with               user can login before the password expires. The proposed approach
presence of external database repositories. However, XSS attack is      involves two processes namely a registration process and a login
still possible                                                          process with four participating entities: websites, instant messaging
                                                                        service provider, users and phishers. In the registration process, a
In another work, Islam et al [13] proposed a multi-stage                user choose a unique account name, select a login password, fill in
methodology that employed natural language processing and               all the required information fields, complete an additional IM
machine learning algorithm to detect phishing attack and discover       account registration and provide at least one type of personal
the organization that the attackers impersonated during phishing        contact information. In the login process, the registered user can
attacks. The approach first discovered named entities and hidden        log in with the OTP assigned by the website. The approach does
topics in a suspected message using Conditional Random Field and        not suffer from client side vulnerabilities and cost of deployment is
Latent Dirichlet Allocation after parsing the message with the          low which increases the practicability of this method. The
Multipart Internet Mail Extension Parser and HTML parser. In the        approach cannot detected XSS attacks and phishing sites hosted on
next stage, utilizing topics and named entities as features, each       compromised domains
message was classified as phishing or non-phishing using
AdaBoost. In the final stage, the approach discovered the               In another approach, Chen et al [7] proposed an image based anti-
impersonated organization using CRF. The approach ensured               phishing strategy that measure suspicious pages’ similarity to
automatic discovery of an impersonated entity, which help the           actual sites based on discriminative key point features in web
legitimate organization to take necessary action against the            pages is proposed. The approach defined three aspects of visual
offending site. The problem of scalability, false positives and the     similarity consisting of block-level similarity, layout similarity and
requirement of an efficient parser still exist.                         overall style similarity to compare pages during detection process.
                                                                        Their invariant content descriptor, which uses the contrast context
The work of Maurer et al [21] focused on URL similarity for             histogram, computes the similarity degree between suspicious and
detecting phishing pages by extracting and verifying different          authentic pages. The proposed method takes a snapshot of a
terms of a URL using search engine recommendation. The authors          suspected page and treats it as an image throughout the detection
developed algorithms to detect possible search terms that were          process. It uses CCH to capture invariant information around
worth checking using basename, subdomains, pathdomain and               discriminative key points on the suspect page and then match the
brand name. Top Level Domain was used to extract the base               descriptors with those of authentic pages that are often targeted by
domain that was used with the search engine. The approach was           phishers. However, the approach cannot detect phishing pages in
evaluated with a large set of 8730 URLs from online phishing            which phisher use images to mimic their target.
website database. The approach is effective against software
toolkits that launch a large number of phishing pages using             The concept of dynamic security skins that allow humans to
different URLs. However, high false positive rates affect the           distinguish one computer from another was proposed in [17].
efficiency of this approach. In addition, significant performance       Dynamic security skins allow a remote web server to prove its
issues like high overhead resulted as the system relies on              identity in a way that is easy for human user to verify and difficult
                                                                        for attackers to spoof. This approach assigns each user a random

                                                                    124
personalized photographic image that will always appear in the            like internet; but with the emergence of cloud computing
password window. However, the ability of user to recall this image        infrastructure this challenge can be easily leveraged [22].
is a subject of debate. In addition, it is difficult to convince web      In this paradigm-shift approach, we shall employ Map Reduce
master to apply these rules in web page creation.                         algorithm to aggregate web streams into different jobs as suggested
                                                                          [27]. Map Reduce is a programming model and software
3.3 Summary of problems with the existing                                 framework intended to facilitate and simplify the processing of
countermeasures                                                           vast amount of data in parallel on large clusters. The aggregation
In this subsection, we itemized the summary of the problems with          of tasks results into non-computational and computational classes.
the existing anti-phishing system.                                        In the non-computational class (PhishDetect C1), phishing
 a. The inability of most existing anti-phishing countermeasures          detection is done using list-based approach to reduce the
     to efficiently detect newer phishing scams i.e. possibility of       unnecessary computation within the system. If a phishing attack
     zero-day attacks which a type of attack mounted using hosts          cannot be detected by non-computational class, the computational
     that are not blacklisted or using techniques that evade known        class is invoked to complete the detection process. In this case, the
     approaches to phishing detection [25, 20]                            extracted features from the suspected sites are compared with
 b. Most of the existing countermeasures consider small set of            trained feature vectors from a hybrid classifier (NB-SVM). The
     heuristics features in their approach and most browsers’             proposed system will be implemented and evaluated using datasets
     plugins anti-phishing solutions are susceptible to java              from research sources such as PhishTank, APWG etc. The
     vulnerabilities [25, 27]                                             overview of the algorithm for the proposed scheme is presented in
 c. Although there has been substantial performance                       Figure 3
     improvement in detecting phishing, the foremost drawback of                Get Web Document (webpages, email message, e-chat)
     methods currently in use, in particular for classification based                   Sort web document using Map-Reduce algorithms
     methods using statistical learning algorithms, continue to be                   Generate the Mapper and the Reducer Function
     the false positive problem [13]                                              For each Mapper and Reducer Function, invoke non-
 d. High computational overhead of most classification-based              computational Class
     anti-phishing countermeasures [13]                                               If detection is accurately performed, the exit
 e. Lack of consensus and problems of coverage of most blacklist                  Else send uncompleted task to computational Class
     techniques. In addition, the blacklist method cannot adapt the                  Trained NB-SVM classifier on feature class on the
     filter to identify emerging rule changes in the intruders’           uncompleted task
     attacks [19]                                                                   Classify the task and exit
 f. Intensive configuration and lack of users’ proper attention                      Figure 2. Pseudo code of the proposed scheme
     with most client-side solutions [22]
 g. Absence of holistic countermeasures that detect, prevent and          Stage 1: The first stage of the architecture is where client
     disrupt phishing scams. Most existing anti-phishing system           transactions are captured before being forwarded to the phishing
     either focuses on phishing email or phishing website detection       detection manager (PDM). When a user opens a page in the web
     [6,7,13,19]                                                          browser, the extension module accesses the DOM tree of the
                                                                          downloaded page from web browser’s IFrame. Document Object
4. THE PROPOSED APPROACH                                                  Model, is a World Wide Web consortium standard, that allows
The phishing problem has been and still is very important, and the        programmers and scripts to dynamically access and update the
current detection and warning approach taken to address the               content, structure and style of documents. After the construction of
problem is not enough. Motivated by this challenge, we proposed a         DOM, the transaction is also parsed to extract any hyperlinks
paradigm-shift based architecture (Fig.3) based on middleware             present in the body of the transaction or a webpage
technology. The middleware technology is one of the viable                At the same time, the transaction is also tokenizes in an attempt to
alternatives to the challenges of client/server anti-phishing             identify the named entities such as organization and hidden topics
techniques. The primary advantage of MT is that it leverages the          that the phishers is trying to deceive the unsuspecting users.
benefits of software as a service model. That is, software solution       Named entities are proper names that are names of people,
or design remains external to their system and is accessible and          organization, location etc. in the body of a document. The robust
executable by a large numbers of individuals. The approach has            Conditional Random Field (CRF) which is an information retrieval
potentially great benefits to anti-phishing design: MT is able to         task that seeks to locate and classify elements in text documents as
always keep the system up to date (fully maintained) as                   one of these proper names is employed. The second task of the
administration is under the control of service provider, ensure the       tokenizer is to discover the hidden topics in a transaction which is
anti-phishing service remains efficient (by automatically adding          achieved by employing the Latent Dirichlet Allocation. LDA is
new filter rules as required), interacts with a large volume of data      sensitive to changes in feature usage which make it good at
traffic which can be collated and analyzed for improved security          handling synonym. It is also robust to polysemy, features with
coverage in the fight against phishing, provide a suitable basis for      different meaning in different context. In addition, it can discover
anomaly detection technology and the transparency it offers to            threatening theme in a message and intentionally misspelled
both the client and server. Nevertheless, the MT technology raises        features and conjoined features. The most powerful feature of LDA
the issue of scalability especially in a user intensive environment       is its ability to discover multiple topics from a single document.


                                                                       125
                Client/User Interface
                                                          5. Safe transaction path             key. However, it can be more efficient to sort data once during
                                                                                               insertion than sort them for each Map Reduce query. In the light of
                  1. Request
                                                                                     Stage 1   this, an insertion sorting technique is adopted to increase the
                   Retrieved               4. Response
                                                                                               efficiency of Map Reduce capability of TFC
                                                                                               Transaction Preliminary Filter Component (TPFC)
         Transaction Fetch
            Component
                                                       Transaction
                                                        Response
                                                                                               The output of Map Reduce algorithms provides the input into the
                                                       Component
                                                                                               Transaction Preliminary Filter Components which involves the
                                                                                               following tasks:
         Non-ML UNITS                                 Transaction
                                                      Classification
                                                                                     Stage 2
                                                                                               1. Preliminary Transaction Filtering Module (PTFM) which
            ML UNIT
                                                      Component
                                                                                               determine phishiness of a transaction without learning algorithms
                                                                                               using Anti-Phishing Dictionary with Customized Source Code
                   Orchestration engine/Service Description
                                                                                               Scanner, Anti-Phishing Authentication System with ability to
  2. Transaction Web Calls
                                           3. Secured transaction informaton
                                                                                               detect abnormally in the login form and Phishing Toolkit Analyzer
                                                                                               using Phishing Toolkit Corpus. The rationale for the introduction
                                                                                               of this module is to reduce the system computation and enhances
                                 Internet resources
                                                                                     Stage 3
                                                                                               efficient memory usage in a time-critical scenario like web scape.
                                                                                               This module is especially suited for preapproved sites and sites
                                                                 Server
                                                                                               with known popularity.
                                                                                               2. Feature Selection Module (FSM) which determines efficient
                                                                                               feature for classification in a Feature Generator Process using
  Figure 2. The Middleware-based anti-phishing architecture                                    efficient feature selection approach. The main attraction of this
Stage 2: This contains the core component of the proposed                                      module is to select most informative Comprehensive Anti-Phishing
technique that detect the phishing label of a transaction. The                                 Feature for efficient classification. FSM takes advantage of the
phishing detection manager provides the link between the client                                factors embedded within or surrounding a message (called
interface and the Secured Server Side Transaction (Stage 3). The                               heuristic cues) such as its source, format, length, and subject, to
Orchestration Engine (OE) is responsible for managing                                          quickly make a validity assessment.
communication between the Machine Learning Detection units and                                 3.Cached Internet Resource Module which provides for faster
Non-Machine Learning Detection units. In this way, exception                                   lookup and speed up the phishing label of a transaction using data
management, transaction management, resource management and                                    from WHOIS properties, Phish Tank, Crawling Instances etc. This
components management are easily coordinated. The Phishing                                     is to reduce superfluous computation on already labeled suspicious
Detection Manager offers methods for all the basic tasks associated                            webpage or transaction.
with the construction and interaction of the phishing detection                                Transaction Classification Component
process. The core components of the phishing detection manager                                 Given an identity and a set of features, the task of determining the
are:                                                                                           genuineness of a transaction is executed by a classification
     a. Transaction Fetch Component (TFC)                                                      algorithm. A classification algorithm automatically learns how to
     b. Transaction Preliminary Filter Component (TPFC)                                        make accurate predictions based on past or trained observations.
     c. Transaction Classification Component (TCC)                                             The Transaction Classification Component of HAPS uses a hybrid
     d. Transaction Response Component (TRC)                                                   classifier approach to provide an efficient status of a transaction.
These four components are integrated into a Middleware system as                               Naïve Bayes and Support Vector Machine are combined as
anti-phishing scheme using service model architecture. That is, the                            hierarchical hybrid system model (NB-SVM) to maximize
anti-phishing scheme remains external to their system (i.e. the                                detection accuracy and minimize computational complexity. The
server and client). In addition, the system is accessible and                                  NB is a relatively accurate classifier especially for large
executable by a large number of client machines irrespective of the                            dimensional dataset like web streams. However, capacity control
browser type.                                                                                  and generalization remains an issue. The main problem associated
                                                                                               with using SVM as classifier is the computational overhead needed
Transaction Fetch Component (TFC)                                                              to transform text data into numerical data which is sometimes
The Transaction Fetch Component represents the entry point of                                  termed as “vectorization”. Generally in PDM, the features of a web
web requests into the Anti-Phishing System where billions of user                              transaction are directly vectorized by transforming the text
transactions are aggregated after the DOM construction and                                     documents into numerical format using SVM. Thus, NB is used as
tokenization for onward generation of phishing label with a cost                               a pre-processor for selected features in the front end of the SVM to
efficient response. The task of aggregation is made                                            vectorize corpus before the actual training and classification are
computationally less expensive with the employment of Map                                      carried out. The motivations for the adoption of this hybrid
Reduce framework. This is consistent with the suggestion of [27].                              classifier approach are:
Map Reduce is a programming model and software framework                                            i. Improve the generalization of the overall system
intended to facilitate and simplify the processing of vast amounts                                 ii. Maintain a comparatively feasible training time and
of data in parallel on large clusters such as Internet Web Streams                                     categorization time
(IWS). The Map Reduce framework consists of a single master                                       iii. Overcome the limitations of list-based methods (e.g.
JobTracker and one slave Task Tracker per cluster-node. The                                            blacklist approach) by dynamically updating the training
master is responsible for scheduling the jobs' component tasks on                                      patterns whenever there is new pattern during classification
the slaves, monitoring them and re-executing the failed tasks. The                                iv. Ignore serious deficiencies in underlying algorithms of both
slaves execute the tasks as directed by the master. The core idea                                      classifiers
behind Map Reduce is mapping your data set into a collection of                                    v. Produces a simple computationally effective and highly
<key, value> pairs, and then reducing overall pairs with the same                                      accurate classifier

                                                                                           126
Transaction Response Component                                          [5] Dhamija, R., Tygar, J.D. and Hearst, M. 2006.Why phishing
The Transaction Response Component provides a cost efficient                 works. Proc. of the IGCHI Conference on Human Factors in
response to a classified transaction based on the severity of attack         Computing Systems, ACM Press, pp. 581-90.
as computed by the Threat Identification Module. The Transaction        [6] Downs, J.S., M.B. Holbrook, and L.F. Cranor( 2006).
Identification Module measures and identifies the threat severity            Decision strategies and susceptibility to phishing. In
associated with a classified transaction. With classified                    Proceedings of the Second Symposium on Usable Privacy and
transactions, a TIM is proposed to proactively predict the level of          Security (SOUPS 2006). pp. 79-90.
seriousness of the attack. This is necessary in advancing the notion    [7] Chen, K., Chen, J., Huang, C. and Chen, C. (2009), “Fighting
of HAPS to a high level especially for accessing the severity of             phishing with discriminative keypoint features”, IEEE
phishing campaign. Consider the TIM algorithm that assign a                  Internet Computing, Vol. 13 No. 3, pp. 56-63.
threat score, 0 ≤t_i ≤1, to the ith transaction upon the occurrence     [8] Gowtham R, Krishnamurthi I. 2014. An efficacious method
of the jth classification by PDM. The threat scores may                      for detecting phishing webpages through target domain
qualitatively identify the threat level upon classification as               identification. Journal of Decision Support Systems. Elsevier
compromised if t_i=1, threatened if 0<t_i<1, and unthreatened if             Press
t_i=0.                                                                  [9] Han W, Cao Y, Bertino E and Yong J. 2012.Using automated
Stage 3: The third stage of the architecture ensures that only safe          individual white-list to protect web digital identities. Expert
transaction are forward or return to client for the completion of the        Systems with Applications.
initiated task after necessary anti-phishing computation have been      [10] Hong J. (2012). The state of phishing attacks. Contributed
performed. The orchestration engine of the PDM also makes web                Articles in the Communication of the ACM. Vol 55 No 1
calls into this stage when there is need for external sources of data   [11] Jagatic, T., Johnson, N., Jakobsson, M. and Menczer, F.
in validating a transaction under investigation. All transactions are        (2007). Social phishing. Communications of the ACM, Vol. 50
directed to benign server while malicious servers are bypassed.         [12] Jakobsson, M. and Myers S. A. (2007). Phishing and
                                                                             countermeasures: Understanding the increasing problem of
5. CONCLUSIONS AND FUTURE WORK                                               identity theft. Introduction to Phishing (Eds.), (pp. 1– 2). New
As the rapid explosion of e-commerce witnessed unprecedented                 York: John Wiley & Sons, Inc.
adoption by online communities, phishing activities continue to         [13] Islam R and Abawajy J. 2013. Multi-tier phishing detection
wreak havoc on unsuspecting users who access the e-commerce                  and filtering approach. Journal of Network and Computer
services. In the process, both users and the service providers have          Applications.
suffered millions of dollars in losses compare to any form of           [14] Kathryn P., Agata M., Malcolm P., Marcus B and Cate J.
cybercrime. Therefore, phishing has become a plague that                     2015. The design of phishing studies: Challenges for
threatens stakeholders’ confidence in the security of online product         researchers. Journal of Computers and Security.
and services. Considerable researches have been done towards            [15] Longe T. 2014. Ensuring Information Security Assurance
protecting users from phishing attacks. Despite the efforts by the           through Policy Framework. Proc. of First National Cyber
research community, the industry, and law enforcement to develop             Security Forum. Lagos. Nigeria
solutions to tackle the problem, phishing has shown no sign of          [16] Huang C., Ma S and Chen K., (2011). Using one-time
abating (Basnet et al. 2012) as each of these existing techniques            passwords to prevent password phishing attacks. Journal of
suffers from such major challenges. In this paper, we provide                Network and Computer Applications. Elsevier Press..
survey of relevant literature from client/server-side perspective       [17] Dhamija, R. and Tygar, J. D. (2005). The battle against
anti-phishing defense systems. We illustrated some open problems             phishing: Dynamic security skins. In Proceedings of the
with the current counter strategy and make a case for a paradigm-            Symposium on Usable Privacy and Security (SOUPS). 77–88.
shift defense system for a middleware-based approach. The               [18] Lovet, G. 2009.Fighting cybercrime: technical, juridical and
middleware-based approach overcomes some inherent challenges                 ethical challenges. Proceedings of the Virus Bulletin
of client and server-side approach through provision of enhanced             Conference.
security, ease of configuration, optimization of load-balancing,        [19] Moghimi M and Varjani A.Y. (2016). New rule-based
management of connections etc. In addition, we present a working             phishing detection method. Journal of Expert Systems with
architecture of the proposed method. Future works will consider              Applications. Vol 53 pp. 231-242.
the implementation of the proposed architecture on real-time            [20] Mohammed A., Furkan A., and Sonia C. 2015. Why phishing
phishing data corpus as well as benign data corpus.                          still works: User strategies for combating phishing attacks.
                                                                             International Journal of Human-Computer Studies. Volume
6. REFERENCES                                                                82. pp. 70-82. Elsevier Press
[1] Afroz, A., & Greenstadt, R. (2011). PhishZoo detecting              [21] Maurer M and Hofer L. (2012). Sophisticated Phishers Make
    phishing websites by looking at them. In Proceedings of IEEE             More Spelling Mistakes: Using URL Similarity Against
    fifth international conference on semantic computing (pp.                Phishing. Springer.
    368–375).                                                           [22] Ofuonye E and Miller J. (2013). Securing web-clients with
[2] Aggarwaly, A., Rajadesingan, A., Kumaraguru, P. (2012).                  instrumented code and dynamic runtime monitoring. Journal
    PhishAri: Automatic realtime phishing detection on twitter. In           of Systems and Software.
    Seventh IEEE APWG eCrime researchers summit (eCRS). Las             [23] Pan Y and Ding X. 2006. Anomaly based web phishing page
    Croabas, Puerto Rico, 22–25                                              detection. Proc. of the 22nd annual computer security
[3] CSO Online report on phishing activities. Accessed 2016                  applications conference.
    (www.csoonline.com/articles)                                        [24] Parno, B., Kuo, C. and Perrig, A. 2006. Phoolproof phishing
[4] Cao, Y., Han, W. and Le, Y. 2008. Anti-phishing based on                 prevention. Financial Cryptography and Data Security,
    automated individual white-list. Proceedings of the 4th ACM              Lecture Notes in Computer Science, Vol. 4107, Springer,
    Workshop on Digital Identity Management, Alexandria, USA.                Berlin.


                                                                     127
[25] Purkait S. 2012. Phishing counter measures and their              [31] Sheng, S., Holbrook, M., Kumaraguru, P., Cranor, L.F. and
     effectiveness- literature review. Information Management and           Downs, J. 2010. Who falls for phish? A demographic analysis
     Computer Security Vol. 20 No. 5.                                       of phishing susceptibility and effectiveness of interventions.
[26] Prakash, P., Kumar, M., Kompella, R.R. and Gupta, M.                   Proc. of the 28th International Conference on Human Factors
     (2010). Phishnet: predictive blacklisting to detect phishing           in Computing Systems, USA.
     attacks. Proceedings of the 29th Conference on Information        [32] Xiang, G., Hong, J., Rose, C.P. and Cranor, L. 2011
     Communications, San Diego, CA, USA, pp. 346-50.                        CANTINA+: a feature-rich machine learning framework for
[27] Ramanathan V and Wechsler H. 2013. Phishing detection                  detecting phishing web sites. ACM Transactions on
     and impersonated entity discovery using Conditional Random             Information and System Security
     Field and Latent Dirichlet Allocation. Journal of Computers       [33] Yue, C. and Wang, H. 2010. BogusBiter: a transparent
     and Security.                                                          protection against phishing attacks. ACM Transactions on
[28] Ralf K, Peter F, and Wolfgang N. 2009. Latent Dirichlet                Internet Technology, Vol. 10 No. 2, pp. 1-31
     Allocation for Tag Recommendation. Proc. of RecSys ACM.           [34] Zhang Y., Egelman S., Cranor L. and Hong J. 2007. Phishing
[29] RSA Anti-Fraud Command Center. RSA monthly online                      Phish: Evaluating Anti-Phishing Tools. Proc. of Network and
     fraud report, 2014.                                                    Distributed Systems Security Symposium (NDSS)
[30] Shahriar H, Zulkernine M. 2011. Trustworthiness testing of        [35] Zhang, H., Liu, G., Chow, T.W.S. and Liu, W. 2011. Textual
     phishing websites: a behavior model-based approach. Future             and visual content-based anti-phishing: a Bayesian approach.
     Generation Computer Systems.                                           IEEE Transactions on Neural Networks, Vol. 2


                                                                    128