A Middleware based Anti-Phishing Architecture A.A Orunsolu A.S Sodiya Department of Computer Science Department of Computer Science A.T Akinwale Moshood Abiola Polytechnic Federal University of Agriculture, Department of Computer Science Abeokuta, Nigeria Abeokuta, Nigeria Federal University of Agriculture, orunsolu.abdul@mapoly.edu.ng sodiyaas@funaab.edu.ng Abeokuta, Nigeria ABSTRACT countermeasure and anti-phishing education. In this work, a new Phishing attacks are becoming an everyday threat to the ever paradigm-shift architecture is proposed after extensive survey of growing cyber community. Regrettably, most online users do not current client/server-based anti-phishing techniques. Although the understand some of the simplest indicators of a typical phishing architecture is at implementation stage, we present this paper to scam. In addition, the sophistication of some of the newest communicate the state of anti-phishing research to support the phishing defeat most of the current software-based efficiency of the new approach in the fight against phishing attacks. pay quickly because of the sensitive nature of their resources [3]. Basically, a typical phishing attack begins with unauthenticated CCS Concepts message crafted by phishers. These messages arrived at the client Computers and Society ➝Electronic Commerce – security, or user’s machine in the form of email, e-advert, SMS, websites payment schemes, electronic data interchange (EDI) etc. with brand logos and call center number of a known company. One of the core features of these messages is their deceptive view Keywords which may not be easily identified even to an experienced IT- Attacks, E-Commerce, Middleware, Phishing, Internet expert [5, 8]. The user falls for a phish by actively following the instruction in the message through performing a click action or 1. INTRODUCTION download action. In the end, the user’s actions result to the The prevalence of e-services in today’s digital world has opened a execution of phishers’ payload. A payload is the functional part of door for various cyber-crimes that threatened the acceptability of a phisher’s code where their malicious intention is achieved. such services. Hackers have continuously managed a host of online Figure 1 presents the life cycle of a phishing attack. black markets which discourage stakeholders’ confidence in the Ravaged by unhealthy reality of phishing attacks, researchers usability of internet services [13]. This range of criminal proposed a number of countermeasures ranging from user- enterprises includes spam-advertised commerce, botnet attacks, education to software enhancements. In spite of the existence of and a vector for propagating malware [4]. Among all the various anti-phishing measures, the frequency of phishing cybercrimes targeting e-services, phishing attacks have become a incidences continues to increase [14, 29]. For instance, RSA’s significant security threat which causes tremendously losses every online fraud report showed estimated losses of over $4.6 billion by day to both experienced and unwary internet users [5]. This is global organizations in 2015. In a similar vein, the Central Bank of mostly due to the unhealthy disclosure of user’s credentials to a Nigeria White paper estimated that about $250 million was lost to phishing-related sites, chats, SMS or e-mail. Thus, these crimes cybercrime in 2013 [15]. have subjected the popular advantages of Internet to debate as To this end, we report the survey of anti-phishing researches and businesses, government, individuals and financial institutions examine their weaknesses. After survey of relevant extant recorded millions of dollars in losses and espionage. literature, we provide a brief discussion on a new approach that Phishing is e-communication criminal act which uses social will effectively mitigate the weakness of the current approaches. engineering and technical subterfuge to exploit unwary internet This is a very important milestone in harnessing diverse anti- users and acquire their confidential data such as credit card phishing defense system in one study to provide the basis for number, PIN, password, answer to security questions etc. Social evaluating the proposed paradigm-shift approach. engineering-based phishing techniques use spoofed emails, chat or The rest of the paper is organized as follows: Section 2 presents SMS to lead internet users to fake agents, websites etc. On the related works on why phishing works. The overview of the current other hand, technical subterfuge-based phishing scheme plant anti-phishing defense architecture is examined in Section 3. In crime ware unto computers to steal sensitive data. Recently, Section 4, we present the proposed paradigm-shift architecture to phishers develop “ransomware” which executes a cryptovirology address current challenges. Section 5 presents our conclusions. attack that adversely affects computing resources and demands a ransom payment to restore the resources to original state. 2. WHY PHISHING WORKS? According to an online report by CSO, 93% of phishing emails are A number of studies have examined the reasons that people fall for now “ransomware”. The report observed that most victims tend to phishing attacks. For instance, Dhamija et al. [5] identified lack of computer system knowledge, lack of knowledge of security and security indicators, visual deception and bounded attention. The authors further showed that a large number of people cannot differentiate between legitimate and phishing web sites, even when CoRI’16, Sept 7–9, 2016, Ibadan, Nigeria. they are made aware that their ability to identify phishing attacks are being tested. In another related work, Down et al. [6] conducted a research in which 20 non-expert computer users revealed their strategies and understanding when faced with possible suspicious e-mails. 122 Phishing scams begin 3. THE CURRENT COUNTERMEASURES In this section, we considered the state of current countermeasures against phishing attack from software enhancement perspective. Software enhancement techniques are computer programs that are 1. Unauthenticated message arrives designed to defeat or mitigate phishing attacks. These software on unwary use’s machine approaches use techniques such as list-based, machine learning, visual similarity and multi-channel authentication algorithms. They are either deployed on the client side or server side. 2. The message bears a deceptive view 3.1 Client-side Anti-phishing approaches PhishNet [26] proposed an active blacklist approach in which new malicious URLs can be effectively predicted from the existing 3. User interact with the message by blacklist entries. This is achieved by processing blacklisted URLs performing some actions requested by the message e.g. click, update etc and producing multiple variations of the same URL using IP address equivalence, query string substitution, brand name equivalence, directory structure similarity and top level domain replacement. In this way, multiple variations of the same URL called children are obtained. In order to filter non-existent children Yes No Has user performed the 4. Phisher’s Payload URLs, the system performed DNS query, TCP connect, HTTP requested action? header response and content similarity. The approach achieved remarkable results during real-time blacklist feeds against new malicious URLs. However, the problem of false positives still No payload execution exists. PhishZoo [1] built profiles of trusted websites based on fuzzy hashing techniques in a whitelisted based approach. The approach End of campaign also used blacklisting and heuristics approaches to warn users about malicious sites. This approach compared the stored profile of authentic sites with the content of sites under investigation. The Figure 1. Life cycle of a phishing attack approach achieved significant accuracy rate of about 96% with the possibility of defeating zero-day attack. However, there is lack of The investigation showed that participants used basic, often generalization to new phishing due to human interventions. incorrect heuristics in deciding how to respond to email messages. Cao et al [4] developed an Automated Individual White-List In another development, Sheng et al [31] and Jakobsson et al. [11] (AIWL) in which the record of well-known benign sites visited by provided useful insights on why phishing works using users is kept. In this way, AIWL maintains a record of every URL demographic data. While Sheng et al [31] revealed that women are along with its Login User Interface information where the user more vulnerable than men due to their less exposure to technical input his or her details to prevent unhealthy disclosure of knowledge, Jakobsson et al [12] revealed users’ sensitivity to confidential information to malicious sites. The LUI information variety of common trust indicators such as logos, padlock icons maintains by AIWL for any suspicious website include the URL, etc. when navigating web pages. Jagatic et al. [11] researched into the Input Area and the IPs. The URL refers to the Unified a more sophisticated spear-phishing attacks in which the attackers Resource Locator of the website. The input area includes the form use specific knowledge of individuals and their organizations to username path and password path. The IPs is a list of legitimate IP conduct attack. Their investigation showed that people were 4.5 addresses mapping to a URL. This method is very effective against times more likely to fall for phish sent from an existing contact pharming and dynamic phishing attacks. However, the problem of over standard phishing attacks. This is why social networking sites new login can result in false alert like Facebook are now more patronized by phishers. In the work of Downs et al [6], a behavior-based phishing Appealing to people’s sense of greed is an ancient technique now detection system (UBPD) which monitor submission of user adapted to the digital world especially in phishing scams [10]. This credentials by building binding relationship between users and web kind of phishing scam may look like online survey in which pages was proposed. This is done by constructing a personal unsuspicious users are promised some financial returns for whitelist for the user by adding web sites the user has visited more participating in the survey exercise. In a similar vein, phishers than three times. UBPD consists of three components namely the might pose as relief agency asking for help with recent natural user profile, the monitor and the detection engine. The user profile disasters to appeal to people’s sense of emotion [10]. Most contains data to describe the user’s binding relationships and the unsuspecting users may not suspect anything negative even when user’s personal whitelist. The monitor collects the data the user asked to provide their financial details because of some gory intends to submit and the identity of the destination websites. The pictures that usually accompanied such campaigns. detection engine uses the data provided by the monitor to detect In a more recent study, Mohammed et al. conducted user study phishing websites and update the user profile if necessary. The with the use of eye tracker to obtain objective quantitative data on approach can be effectively applied to static authentication user judgment of phishing sites. Their results indicated that users credentials such as user name, password, security questions etc. detected 53% of phishing sites even when primed to identify them However, zero day attack is possible since prediction is only with little attention on security indicators [20]. applied to websites that user once visited. 123 Gowtham et al [8] presented a dynamic defense approach in which consecutively querying search engines to identify legitimate direct and indirect links in associated with a malicious page is domain. generated. In this way, the target domain set is constructed as input An offensive approach in which a large number of bogus into Target Identification algorithm to recognize a phishing page. credentials are transparently fed into a suspected phishing page Using DNS lookup and IP address resolution, the suspicious page was proposed in [30]. In this way, the victim’s real credential is can be predicted without the use of machine learning algorithms or concealed among bogus credentials thereby increasing the existing restriction lists. The accuracy rate of this approach was overhead on phishers’ side in discerning the real credentials. 99.62%. However, the prediction of this approach is largely BogusBiter consists of four main modules: information extraction, dependent on the TF-IDF algorithms, search engine speed and bogus credential generation, request submission and response DNS lookup. The unavailability of any of these, defeat the efficacy process. The information extraction module extracts the username of this approach. and password pair and its corresponding form element on a login A model to test the trustworthiness of suspected phishing page was page. The bogus credential generation module generates bogus developed in [30] by checking if the response of websites matches credential based on an original credentials. The request submission with the known behavior of phishing or legitimate sites. The model is responsible for spawing and submitting multiple HTTP requests. used the notion of Finite State Machine to capture the submission The response process module determines the legitimacy of a of forms with random inputs and then their corresponding website based on its response to HTTP requests. The approach is responses to describe the website’s behavior. The experimental not bound to any specific phishing detection scheme and can be results showed zero false negative and positive rates. The ability to incrementally deployed over the internet. However, this approach detect advanced XSS-based attacks is another plus for this method. can result in increased bandwidth overhead and it can be blocked However, the approach cannot handle phishing attacks where by phisher since the bogus credentials is being submitted by a images are employed. dedicated IP address. PhishAri [2] detects phishing on Twitter in real-time. The approach uses Twitter specific features along with URL features to detect 3.2 Server-side Anti-phishing Approaches whether a tweet posted with a URL is phishing or not. The features The deployment of server-side anti-phishing defense system is not used in this approach are classified into URL based, Tweet-based, very popular as client side solutions. One of such server-side based WHOIS-based and Network-based. The approach is implemented solution is a practical authentication service in which the need for as a Chrome browser extension which makes a call to a developed preset user password is eliminated during information flow API (called RESTful API) and accordingly shows an indicator next between the client and the server [16]. This is achieved through the to each tweet indicating whether the tweet is phishing or not. use of one-time passwords delivered on demand via a reliable Experimental result shows that the system achieves 92.52% secondary communication channel. On the receipt of the OTP, the accuracy. The system detection speed can be improved with user can login before the password expires. The proposed approach presence of external database repositories. However, XSS attack is involves two processes namely a registration process and a login still possible process with four participating entities: websites, instant messaging service provider, users and phishers. In the registration process, a In another work, Islam et al [13] proposed a multi-stage user choose a unique account name, select a login password, fill in methodology that employed natural language processing and all the required information fields, complete an additional IM machine learning algorithm to detect phishing attack and discover account registration and provide at least one type of personal the organization that the attackers impersonated during phishing contact information. In the login process, the registered user can attacks. The approach first discovered named entities and hidden log in with the OTP assigned by the website. The approach does topics in a suspected message using Conditional Random Field and not suffer from client side vulnerabilities and cost of deployment is Latent Dirichlet Allocation after parsing the message with the low which increases the practicability of this method. The Multipart Internet Mail Extension Parser and HTML parser. In the approach cannot detected XSS attacks and phishing sites hosted on next stage, utilizing topics and named entities as features, each compromised domains message was classified as phishing or non-phishing using AdaBoost. In the final stage, the approach discovered the In another approach, Chen et al [7] proposed an image based anti- impersonated organization using CRF. The approach ensured phishing strategy that measure suspicious pages’ similarity to automatic discovery of an impersonated entity, which help the actual sites based on discriminative key point features in web legitimate organization to take necessary action against the pages is proposed. The approach defined three aspects of visual offending site. The problem of scalability, false positives and the similarity consisting of block-level similarity, layout similarity and requirement of an efficient parser still exist. overall style similarity to compare pages during detection process. Their invariant content descriptor, which uses the contrast context The work of Maurer et al [21] focused on URL similarity for histogram, computes the similarity degree between suspicious and detecting phishing pages by extracting and verifying different authentic pages. The proposed method takes a snapshot of a terms of a URL using search engine recommendation. The authors suspected page and treats it as an image throughout the detection developed algorithms to detect possible search terms that were process. It uses CCH to capture invariant information around worth checking using basename, subdomains, pathdomain and discriminative key points on the suspect page and then match the brand name. Top Level Domain was used to extract the base descriptors with those of authentic pages that are often targeted by domain that was used with the search engine. The approach was phishers. However, the approach cannot detect phishing pages in evaluated with a large set of 8730 URLs from online phishing which phisher use images to mimic their target. website database. The approach is effective against software toolkits that launch a large number of phishing pages using The concept of dynamic security skins that allow humans to different URLs. However, high false positive rates affect the distinguish one computer from another was proposed in [17]. efficiency of this approach. In addition, significant performance Dynamic security skins allow a remote web server to prove its issues like high overhead resulted as the system relies on identity in a way that is easy for human user to verify and difficult for attackers to spoof. This approach assigns each user a random 124 personalized photographic image that will always appear in the like internet; but with the emergence of cloud computing password window. However, the ability of user to recall this image infrastructure this challenge can be easily leveraged [22]. is a subject of debate. In addition, it is difficult to convince web In this paradigm-shift approach, we shall employ Map Reduce master to apply these rules in web page creation. algorithm to aggregate web streams into different jobs as suggested [27]. Map Reduce is a programming model and software 3.3 Summary of problems with the existing framework intended to facilitate and simplify the processing of countermeasures vast amount of data in parallel on large clusters. The aggregation In this subsection, we itemized the summary of the problems with of tasks results into non-computational and computational classes. the existing anti-phishing system. In the non-computational class (PhishDetect C1), phishing a. The inability of most existing anti-phishing countermeasures detection is done using list-based approach to reduce the to efficiently detect newer phishing scams i.e. possibility of unnecessary computation within the system. If a phishing attack zero-day attacks which a type of attack mounted using hosts cannot be detected by non-computational class, the computational that are not blacklisted or using techniques that evade known class is invoked to complete the detection process. In this case, the approaches to phishing detection [25, 20] extracted features from the suspected sites are compared with b. Most of the existing countermeasures consider small set of trained feature vectors from a hybrid classifier (NB-SVM). The heuristics features in their approach and most browsers’ proposed system will be implemented and evaluated using datasets plugins anti-phishing solutions are susceptible to java from research sources such as PhishTank, APWG etc. The vulnerabilities [25, 27] overview of the algorithm for the proposed scheme is presented in c. Although there has been substantial performance Figure 3 improvement in detecting phishing, the foremost drawback of Get Web Document (webpages, email message, e-chat) methods currently in use, in particular for classification based Sort web document using Map-Reduce algorithms methods using statistical learning algorithms, continue to be Generate the Mapper and the Reducer Function the false positive problem [13] For each Mapper and Reducer Function, invoke non- d. High computational overhead of most classification-based computational Class anti-phishing countermeasures [13] If detection is accurately performed, the exit e. Lack of consensus and problems of coverage of most blacklist Else send uncompleted task to computational Class techniques. In addition, the blacklist method cannot adapt the Trained NB-SVM classifier on feature class on the filter to identify emerging rule changes in the intruders’ uncompleted task attacks [19] Classify the task and exit f. Intensive configuration and lack of users’ proper attention Figure 2. Pseudo code of the proposed scheme with most client-side solutions [22] g. Absence of holistic countermeasures that detect, prevent and Stage 1: The first stage of the architecture is where client disrupt phishing scams. Most existing anti-phishing system transactions are captured before being forwarded to the phishing either focuses on phishing email or phishing website detection detection manager (PDM). When a user opens a page in the web [6,7,13,19] browser, the extension module accesses the DOM tree of the downloaded page from web browser’s IFrame. Document Object 4. THE PROPOSED APPROACH Model, is a World Wide Web consortium standard, that allows The phishing problem has been and still is very important, and the programmers and scripts to dynamically access and update the current detection and warning approach taken to address the content, structure and style of documents. After the construction of problem is not enough. Motivated by this challenge, we proposed a DOM, the transaction is also parsed to extract any hyperlinks paradigm-shift based architecture (Fig.3) based on middleware present in the body of the transaction or a webpage technology. The middleware technology is one of the viable At the same time, the transaction is also tokenizes in an attempt to alternatives to the challenges of client/server anti-phishing identify the named entities such as organization and hidden topics techniques. The primary advantage of MT is that it leverages the that the phishers is trying to deceive the unsuspecting users. benefits of software as a service model. That is, software solution Named entities are proper names that are names of people, or design remains external to their system and is accessible and organization, location etc. in the body of a document. The robust executable by a large numbers of individuals. The approach has Conditional Random Field (CRF) which is an information retrieval potentially great benefits to anti-phishing design: MT is able to task that seeks to locate and classify elements in text documents as always keep the system up to date (fully maintained) as one of these proper names is employed. The second task of the administration is under the control of service provider, ensure the tokenizer is to discover the hidden topics in a transaction which is anti-phishing service remains efficient (by automatically adding achieved by employing the Latent Dirichlet Allocation. LDA is new filter rules as required), interacts with a large volume of data sensitive to changes in feature usage which make it good at traffic which can be collated and analyzed for improved security handling synonym. It is also robust to polysemy, features with coverage in the fight against phishing, provide a suitable basis for different meaning in different context. In addition, it can discover anomaly detection technology and the transparency it offers to threatening theme in a message and intentionally misspelled both the client and server. Nevertheless, the MT technology raises features and conjoined features. The most powerful feature of LDA the issue of scalability especially in a user intensive environment is its ability to discover multiple topics from a single document. 125 Client/User Interface 5. Safe transaction path key. However, it can be more efficient to sort data once during insertion than sort them for each Map Reduce query. In the light of 1. Request Stage 1 this, an insertion sorting technique is adopted to increase the Retrieved 4. Response efficiency of Map Reduce capability of TFC Transaction Preliminary Filter Component (TPFC) Transaction Fetch Component Transaction Response The output of Map Reduce algorithms provides the input into the Component Transaction Preliminary Filter Components which involves the following tasks: Non-ML UNITS Transaction Classification Stage 2 1. Preliminary Transaction Filtering Module (PTFM) which ML UNIT Component determine phishiness of a transaction without learning algorithms using Anti-Phishing Dictionary with Customized Source Code Orchestration engine/Service Description Scanner, Anti-Phishing Authentication System with ability to 2. Transaction Web Calls 3. Secured transaction informaton detect abnormally in the login form and Phishing Toolkit Analyzer using Phishing Toolkit Corpus. The rationale for the introduction of this module is to reduce the system computation and enhances Internet resources Stage 3 efficient memory usage in a time-critical scenario like web scape. This module is especially suited for preapproved sites and sites Server with known popularity. 2. Feature Selection Module (FSM) which determines efficient feature for classification in a Feature Generator Process using Figure 2. The Middleware-based anti-phishing architecture efficient feature selection approach. The main attraction of this Stage 2: This contains the core component of the proposed module is to select most informative Comprehensive Anti-Phishing technique that detect the phishing label of a transaction. The Feature for efficient classification. FSM takes advantage of the phishing detection manager provides the link between the client factors embedded within or surrounding a message (called interface and the Secured Server Side Transaction (Stage 3). The heuristic cues) such as its source, format, length, and subject, to Orchestration Engine (OE) is responsible for managing quickly make a validity assessment. communication between the Machine Learning Detection units and 3.Cached Internet Resource Module which provides for faster Non-Machine Learning Detection units. In this way, exception lookup and speed up the phishing label of a transaction using data management, transaction management, resource management and from WHOIS properties, Phish Tank, Crawling Instances etc. This components management are easily coordinated. The Phishing is to reduce superfluous computation on already labeled suspicious Detection Manager offers methods for all the basic tasks associated webpage or transaction. with the construction and interaction of the phishing detection Transaction Classification Component process. The core components of the phishing detection manager Given an identity and a set of features, the task of determining the are: genuineness of a transaction is executed by a classification a. Transaction Fetch Component (TFC) algorithm. A classification algorithm automatically learns how to b. Transaction Preliminary Filter Component (TPFC) make accurate predictions based on past or trained observations. c. Transaction Classification Component (TCC) The Transaction Classification Component of HAPS uses a hybrid d. Transaction Response Component (TRC) classifier approach to provide an efficient status of a transaction. These four components are integrated into a Middleware system as Naïve Bayes and Support Vector Machine are combined as anti-phishing scheme using service model architecture. That is, the hierarchical hybrid system model (NB-SVM) to maximize anti-phishing scheme remains external to their system (i.e. the detection accuracy and minimize computational complexity. The server and client). In addition, the system is accessible and NB is a relatively accurate classifier especially for large executable by a large number of client machines irrespective of the dimensional dataset like web streams. However, capacity control browser type. and generalization remains an issue. The main problem associated with using SVM as classifier is the computational overhead needed Transaction Fetch Component (TFC) to transform text data into numerical data which is sometimes The Transaction Fetch Component represents the entry point of termed as “vectorization”. Generally in PDM, the features of a web web requests into the Anti-Phishing System where billions of user transaction are directly vectorized by transforming the text transactions are aggregated after the DOM construction and documents into numerical format using SVM. Thus, NB is used as tokenization for onward generation of phishing label with a cost a pre-processor for selected features in the front end of the SVM to efficient response. The task of aggregation is made vectorize corpus before the actual training and classification are computationally less expensive with the employment of Map carried out. The motivations for the adoption of this hybrid Reduce framework. This is consistent with the suggestion of [27]. classifier approach are: Map Reduce is a programming model and software framework i. Improve the generalization of the overall system intended to facilitate and simplify the processing of vast amounts ii. Maintain a comparatively feasible training time and of data in parallel on large clusters such as Internet Web Streams categorization time (IWS). The Map Reduce framework consists of a single master iii. Overcome the limitations of list-based methods (e.g. JobTracker and one slave Task Tracker per cluster-node. The blacklist approach) by dynamically updating the training master is responsible for scheduling the jobs' component tasks on patterns whenever there is new pattern during classification the slaves, monitoring them and re-executing the failed tasks. The iv. Ignore serious deficiencies in underlying algorithms of both slaves execute the tasks as directed by the master. The core idea classifiers behind Map Reduce is mapping your data set into a collection of v. Produces a simple computationally effective and highly pairs, and then reducing overall pairs with the same accurate classifier 126 Transaction Response Component [5] Dhamija, R., Tygar, J.D. and Hearst, M. 2006.Why phishing The Transaction Response Component provides a cost efficient works. Proc. of the IGCHI Conference on Human Factors in response to a classified transaction based on the severity of attack Computing Systems, ACM Press, pp. 581-90. as computed by the Threat Identification Module. The Transaction [6] Downs, J.S., M.B. Holbrook, and L.F. Cranor( 2006). Identification Module measures and identifies the threat severity Decision strategies and susceptibility to phishing. In associated with a classified transaction. With classified Proceedings of the Second Symposium on Usable Privacy and transactions, a TIM is proposed to proactively predict the level of Security (SOUPS 2006). pp. 79-90. seriousness of the attack. This is necessary in advancing the notion [7] Chen, K., Chen, J., Huang, C. and Chen, C. (2009), “Fighting of HAPS to a high level especially for accessing the severity of phishing with discriminative keypoint features”, IEEE phishing campaign. Consider the TIM algorithm that assign a Internet Computing, Vol. 13 No. 3, pp. 56-63. threat score, 0 ≤t_i ≤1, to the ith transaction upon the occurrence [8] Gowtham R, Krishnamurthi I. 2014. An efficacious method of the jth classification by PDM. The threat scores may for detecting phishing webpages through target domain qualitatively identify the threat level upon classification as identification. Journal of Decision Support Systems. Elsevier compromised if t_i=1, threatened if 0