=Paper=
{{Paper
|id=Vol-2311/paper_21
|storemode=property
|title=Address Fraud: Monkey Typed Address Classification for e-Commerce Applications
|pdfUrl=https://ceur-ws.org/Vol-2311/paper_21.pdf
|volume=Vol-2311
|authors=T. Ravindra Babu,Vishal Kakkar
|dblpUrl=https://dblp.org/rec/conf/sigir/BabuK17
}}
==Address Fraud: Monkey Typed Address Classification for e-Commerce Applications==
Address Fraud: Monkey Typed Address Classification for e-Commerce Applications T. Ravindra Babu Vishal Kakkar Flipkart Internet Private Limited Flipkart Internet Private Limited Bangalore, India Bangalore, India ravindra.bt@flipkart.com vishal.kakkar@flipkart.com ABSTRACT chain from order management system to supply chain leading to E-Commerce companies face a wide variety of fraudulent activities. ship corresponding packed item to delivery hub for delivery. This Machine Learning models are continuously built to detect them for results in avoidable mis-utlilization of resources. their effective mitigation. Randomly typed alphanumeric characters Address systems are well studied. Davis and Fonseca [7] discuss as customer addresses is one peculiar fraud. The orders are placed theory of addresses while discussing on certainty of locations in ad- with such addresses. We refer to them as monkey-typed addresses. dress geocoding systems. Address structures differ across countries. Accurate methods of identifying such addresses is important in In Japan, the houses are numbered according to their date of con- reducing operational cost. The current work presents machine struction and most streets do not carry names. In Korea, the houses learning based approaches that classify a given address as normal are numbered within a neighbourhood with reference to a hierar- or monkey-typed with a high accuracy. The approach integrates chy of area names instead of numbers. The cited work also notes address preprocessing, novel feature generation and classification. that in countries like India and Brazil, a complete and organised address database that could be correlated with their geolocations KEYWORDS is lacking. In view of absence of proper recommended structure for the addresses, it is empirically noticed that a given address is Geographical addresses, monkey-typed, machine learning, classifi- cation written with different hierarchy levels by different residents of the same dwelling. We work on such a system which has multiple vari- ACM Reference format: ants in writing an address to the same given location which we T. Ravindra Babu and Vishal Kakkar. 2017. Address Fraud: Monkey Typed refer to normal addresses in contrast to monkey-typed addresses. We Address Classification for e-Commerce Applications. In Proceedings of ACM SIGIR eCom Workshop, Tokyo, Japan, Aug 2017 (SIGIR eCom), 7 pages. discuss more elaborately on various aspects of such addresses and present sample addresses in Section 3.1. In order to classify these two classes of addresses, we propose 1 INTRODUCTION two novel solutions that classify monkey-typed addresses with very Online e-commerce companies facilitate its users to order items high classification accuracy. The methods integrate the following online and deliver them to the suggested addresses. Users register aspects. themselves by entering one or more of their addresses that would • Address preprocessing ensure that an ordered shipment is positively delivered to them. • Novel way of feature generation Thus the customer addresses form important part of e-commerce • Pattern Classification companies. A number of challenges in relation to the addresses, We organise the paper in the following manner. We present especially, in developing countries include[2] lack of pre-defined motivation for this work and discuss related literature in Section 2. structure, absence of geolocation information, spell variations in Section 3 contains proposed methods, discussion on challenges in address components due to different literacy levels in the population data, and preprocessing steps. Section 4 contains a discussion on of customers, etc. experimentation and results. We present approaches to solve class In order to avail price discounts or with possible fraudulent in- imbalance problem in Section 5. Section 6 contains summary and tent, there are occasions where orders are placed with addresses directions for further work. containing randomly typed alphanumeric characters. We refer to them as monkey-typed addresses in this work. Early identification of 2 MOTIVATION such addresses in the operations chain is necessary to reduce oper- ations cost. For example, when such addresses remain undetected, E-Commerce companies face a wide variety of fraudulent activi- it would lead to activation of a number of stages in the operations ties. They include payment frauds, return frauds, delivery frauds, lost-and-stolen articles, fake reviews and fraudulent ratings. Such Copyright © 2017 by the paper’s authors. Copying permitted for private and academic frauds are prevalent in both developing and developed countries. purposes. They are well discussed and many solutions are proposed [1, 3, 10, In: J. Degenhardt, S. Kallumadi, M. de Rijke, L. Si, A. Trotman, Y. Xu (eds.): 15, 16]. The nature of fraud, which is the focus of the proposed Proceedings of the SIGIR 2017 eCom workshop, August 2017, Tokyo, Japan, published at http://ceur-ws.org work, was not reported earlier. It is interesting explore possible reasons for such a fraud. Generally, most e-Commerce platforms offer a number of attractive discounts. In order to reach a larger customer base, the discounts offered are limited to purchase of one SIGIR eCom, Aug 2017, Tokyo, Japan T. Ravindra Babu and Vishal Kakkar or a small number of units. Fraudulent users, presumably resellers, hype a target product. User generated reviews play significant role would order multiple items beyond the prescribed limit through in e-commerce retail. People with fraud intent spam the reviews false addresses. The resellers are profited by selling those products with an intent to distort the perception about the product. Fake in the market at a price higher than the discounted price. Since the online review detection is studied in [1]. The work presents a unsu- objective of the company is to serve larger customer base, such pervised, scalable framework that exploits the correlation of labels fraud defeats this purpose. In this context, they possibly resort to between users and products, in order to detect fraudulent users and monkey typed addresses in order not get detected through multiple online fake reviews. User profiling is used to detect fraud in mobile orders. With a nexus at later stages, the orders could get delivered communication networks [10]. The fraud in this case relates to ille- too. The orders that contain monkey typed addresses could also gal use of services which could cause loss to service provider. The possibly be created by fraudsters to cause a strain on the system. work provides multiple approaches to solve the problem, such as When unchecked at the time of order, such addresses, would lead feed-forward neural network for normal and fraud classes, Gauss- to mis-utilization of resources. This necessitates a solution to the ian mixture model for user behaviour modeling to detect sudden problem. The solution should be deployable and trackable. In view changes in behaviour, and cost-sensitive classification. of this, we approach this as a 2-class classification problem. We propose two approaches to classify a given user address as We have not come across any peer-reviewed, published work on anomalous or genuine. We discuss them in detail in Section 3. address fraud. Most of the discussions in blogs that relate address fraud [6] to false declaration of one’s address linked to credit card 3 PROPOSED METHODS fraud. The problem we propose to solve is unique, but falls in the The problem can be defined as, classification of a given address broad class of fraudulent activities that e-commerce companies face. into either normal or monkey-typed accurately and efficiently. Effi- In view of this, we discuss some closely related works. ciency is required for real time detection of such fraud in customer E-commerce companies face a number of fraudulent activities. on-boarding process. The solution is also useful to clean the his- They include frauds caused by buyers, sellers and logistic partners. torical database of customer data. A study of monkey-typed and Some relate to exploiting discounts offered by the companies to normal addresses initially indicated that a possible solution can resell the products for profit. The activities also include number be obtained through language modeling. We experimented with of fraudulent claims and counter claims about non-receipt of a standard language models and found as not effective as shown later genuine product, claims of receipt of part of shipments as against a in Section 3.3. In view of this, we propose to solve this problem in promised list, etc. Some are more subtle and difficult to detect. We the following two ways of pattern representation. discuss some such frauds and their solutions. (1) Approach-1: In this approach, we derive text features as Credit card fraud detection is one important activity. They are groups of characters. Through machine learning models, we broadly classified into two [3], viz., application fraud, where fraud- learn from data of those combinations of characters that are sters apply for new cards with false information and behavioural normal or abnormal and culminate in classification of an fraud, which include mail theft, stolen cards, counterfeit card, usage address. in the absence of the card holder, etc. In general, models incorpo- (2) Approach-2: In this approach, we exploit importance of rate binary classification approaches. The work provides a detailed vowels in English [5]. We mark vowels and non-vowels as account of credit card fraud, data mining approaches to solve the binary features. Regularity in the presence of vowel helps to problem, feature identification, and comparative experimental re- classify a normal address vis-a-vis an abnormal address. sults. The experiments are conducted on real-life dataset using logistic regression, support vector machines and random forests. The motivation for Approach-2 is its possible generality Trust fraud which relates seller fraud in hiking their own ratings across a large country like India, where different languages and reducing competitor’s ratings is rampant in e-commerce indus- are spoken and the address words would be distinctly differ- try [16]. The work provides interesting insights on its prevalence ent. However, the words even though expressed in English, in Chinese C2C (customer to customer) e-commerce industry. The do have similar vowel combination. It should be mentioned sources of fraud include professional scammers, and nexus between that the method would not work if (a) the syllables contain sellers, scammers, buyers, trust fraud companies and logistic com- only consonants or (b) addresses contain only numbers. We, panies. The work also traces historical evolution of trust models. It however, did not come across monkey typed addresses as proposes a dynamic trust model that integrates weightage for trans- numbers alone, possibly because the addresses are expressed action amount, time decay to account for recency of transactions, as characters and fraudsters mimic them with jumbled char- and trust factors such as quality, service and shipping speed. The acters. solution is shown to provide reliable trust scores. Anomalous be- haviour indicates potential fraud activity. Anomaly detection with We achieve this objective through the following steps of the edge-attributed graphs on complex heterogeneous networks is ex- proposed methods. plored in [15]. With product ratings by users as data, the proposed (1) Preprocess the addresses: The addresses at the time of work identifies networks of users that are anomalous. The proposed entry would contain a number of special characters such approach extends to heterogeneous networks, support independent as $, ^,%, *,/, etc. Earlier studies [2] show that they are not edge attributes and has ability to identify and rank anomalies. It helpful in classification of the addresses. Such terms are is shown to have high precision and is scalable. Opinion fraud [1] eliminated. The addresses also contain mixture of upper and relates to spamming the reviews either to defame or artificially lower case. They are all converted to lower case. Address Fraud: Monkey Typed Address Classification for e-Commerce Applications SIGIR eCom, Aug 2017, Tokyo, Japan (2) For Approach-1, generate fixed length features that Table 1: Sample Spell Variants identified through clustering are devoid of repititve substrings: The normal as well as monkey-typed addresses are of non-uniform length. In Spell variants of a word apartment as entered by users order to bring tractability such addresses are reduced to apartment apartmenet apartmanet apartmenst fixed length substrings that form features. We consider two apartmenrt apartmennt apartment appratment different lengths of substrings. Repeated substrings of char- aparatment appratmant aparment apartent acters are identified and removed. This is done so as to avoid apprment apartement apartament apartnment over-representation of repeating substrings. apprtement apartmment aparttment apartrment (3) For Approach-2, we make use of presence of vowels in apartememt apaprtment aparrtment apaertment an address: In this approach, we transform the address data apartmt apartme apartmernt apartmeant into an binary sequence of vowel and non-vowels. Thus nor- apartmemt apartmebt apatrtments apaternments mal and monkey-typed addresses form a sequence of binary apasrtments apartmensts apartmenmts aparntments features. The present work is applicable for terms contain- ing vowels like in Indian languages, although expressed in English. Some examples are koramangala, an area name in Banaglore, and dhaula kuan, an area name in Delhi. It may observe presence of many variants of these words when we study be noted here that lengths of different patterns containing a large set of customer entered addresses. binary values features are not same. In the given dataset, For example, in a dataset of about 10000 addresses of Bangalore, the maximum of pattern lengths is noted. It is extended by we notice about 18 different spell variations of an area name “kora- another 25% of its length in order to take care of unseen test mangala", 61 variants of “Bannerghatta Road", 263 variants of the patterns, say, l. Each pattern is appended with 0’s in order word "apartments". The variants are obtained through clustering till length of pattern reaches common length, l. the addresses [2]. Table 1 contains a sample of spell variants of the (4) Label the patterns as normal or monkey-typed addresses: word, “apartments". They occur possibly either due to an oversight In order to classify the addresses using supervised methods, while entering and due to varying literacy levels. In the preprocess- the addresses are labeled as normal and monkey-typed. This ing step, these variants are replaced by a representative term in is a manual activity. order to reduce the vocabulary. (5) Training, Test and Cross-validation: The dataset is di- Table 3 contains monkey-typed, and potentially fraudulent ad- vided into training and test datasets. We consider multiple dresses. It can be noted that they contain many combinations of such sets by randomly selecting from those two sets in a characters. Interestingly, some intelligent fraud customers, leave mutually exclusive manner. spaces while entering monkey typed addresses, to make them ap- (6) Classify the data and record the average performance: pear as normal addresses which typically contain multiple words. We carry out experiments, with different mutually exclusive The study of such addresses provide interesting insights. It is no- train-test combinations. The performance is presented. ticed that such customers also write random characters with a mix of upper and lower case characters. Thus the proposed solution For classifying a given address as normal or monkey-typed, it should overcome the issues of inter-word spaces, and mixed case is not possible to obtain all combinations of monkey-typing possi- letters. Further, in such randomly typed letters, either vowels are bilities. We initially place our efforts in understanding the conven- extremely rare or completely dominate and remain repetitive. tional addresses as well as monkey-typed addresses. Subsequently Figure 1 contains a word-cloud of a sample of such addresses after we devise methods for effective classification of such addresses. We splitting them into 4-character long strings. We can observe from discuss about the nuances of the data and its transformation as it the figure that frequently used substrings predominantly match passes through important stages of preprocessing. with successive letters on qwerty keyboard although there exist other combinations too. 3.1 Data Description We will further discuss challenges in the data and preprocessing The number of addresses is in the order of millions. Monkey-typed methods that are necessary for working with the proposed approach addresses form a very small percentage of the total address set. in the following section. Table 2 contains a sample of normal addresses. The table contains addresses as entered by the registered users. In the table, in order to 3.2 Address Data Preprocessing maintain privacy of the users, we replaced specific dwelling identity Both the classes of addresses of normal and monkey-typed have with x’s. It can be noticed from the data that a number of challenges many specific observations. Normal address data have the following exist in processing these addresses. The details that could have been challenges. provided in subsequent stages of address entry, such as city name, and ZIP code would form part of the main address. It should further (1) Control and special characters, that are inadvertently intro- be noted that some city names are inadvertently spelled in different duced during data entry, storage and retrieval. ways. Some examples are BANAGALORE in place of BANGALORE (2) Upper and lower case letters and Bhubaneswar in place of Bhubaneshwar. The preprocessing (3) Many spell variants of a given word models also need to account for some city names that are renamed (4) Missing spaces, such as, IndustrialLayout in place of Indus- officially such as {Benguluru, Bangalore} or {Gurgaon, Gurgram}. We trial Layout SIGIR eCom, Aug 2017, Tokyo, Japan T. Ravindra Babu and Vishal Kakkar Table 2: A Sample of Normal Addresses The monkey-typed addresses data have the following challenges. Sl. No. Address (1) Variable length strings with maximum number reaching 1 x-2xx, Beta 2 Greater Noida, Delhi nearly 100 characters. 2 x-x, POLO ROAD ,OLD GUPTA COLONY , (2) Many of them have repeated substrings such as asdfasdfgij OPPOSITE SHANI MANDIR, DELHI etc. 3 Plot no.-xxx/2xxxx, Jagamohan Nagar, (3) Some addresses are provided as multiple monkey-typed PO-Khandagiri, Near ITER College, Bhubaneswar strings so as to mimic a normal address. 4 No xxx,1st stage,4th cross, behind LVK Garments, (4) Combination of upper and lower case characters Opp SRS Water tank, PEENYA, BANAGALORE The above discussion presents a list of all the challenges present 5 XXX/X, KENDRIYA VIHAR, SEC-11, in addresses. For the problem under consideration, some challenges KHARGHAR,NAVI MUMBAI like accidental merging or separation of words is not important since we ultimately remove the spaces in the address while gener- Table 3: A Sample of Monkey-typed Addresses ating features as we present in the following discussion. We take following preprocessing or data preparation steps Sl. No. Address while generating features. 1 OEGVOQCS (1) Removing control and special characters, such as, $, ^,%, *,/, 2 ddfkddd (2) Reducing all data to lower case 3 afadfsf (3) Generating equivalent set of words with the help of cluster- 4 gdfgtdf ing of empirical data [2] 5 fjrjnvhejvnjdjdfogjfn vmfjgfnl (4) After incorporating these changes, identifying unique ad- dresses among the datasets (5) Further steps for Approach-1 • Remove spaces between address words to convert it into a single string without spaces • Split the strings into k-character substrings to form fea- tures, for k=4 and 8. • Blank spaces remain in the last string when the address string length is not divisible by substring length ‘k’. Re- place those blanks that remain after “n modulo k” split, with *’s, where n and k are lengths of full string and sub- string respectively. (6) Further steps for Approach-2 • Identify vowels. • Place ‘1’ for vowel and ‘0’ for the rest. For example, for the word anjaneyacomplex, highlighting vowels, it forms the word, anjaneyacomplex. The corresponding features are 100101010100010. • Append 0’s to arrive at common pattern length, as dis- cussed earlier. Figure 1: Word Cloud of Monkey-typed Addresses. The ad- dresses are split into 4-character long strings. The word Table 4 illustrates processing of the addresses for Approach-1. It cloud corresponds to them. contains addresses in three stages such as (a) raw form as entered by customers, (b) preprocessed and (c) feature form. We provide illustrations for both 8-character and 4-character long features. Dif- (5) Undesirable spaces such as, Industrial Lay out in place of ferent feature subsequence lengths are studied through experiments Industrial Layout and reported in Section 4. In the table, we consider two normally (6) Terms that do not help classification[2] with the proposed written addresses and three monkey-typed addresses. It is inter- solution framework, such as, ZIP codes, city names, time esting to note the variety of letter combinations and word spaces stamp, direction to reach address, landmarks, etc. that fraudsters use in writing monkey-typed address. The ‘*’s at (7) Repeated terms such as those which are accidentally typed the end of a word indicate blanks that are necessary to complete 8- twice while entering the addresses character bound. It should however be noted here that as discussed (8) Lack of structure where address-words appear at different in Section 3.2, repeat substrings are removed. locations by different users while writing an address of a Table 5 contains features for the preprocessed addresses for location. For example, 3rd Stage, BTM Layout is also written Approach-2. The preprocessed addresses are taken from Table 4. as BTM Layout, 3rd Stage. We consider each of these binary digits as features. Address Fraud: Monkey Typed Address Classification for e-Commerce Applications SIGIR eCom, Aug 2017, Tokyo, Japan Table 4: Addresses and Features for Approach-1 Table 5: Preprocessed Addresses and Features for Approach- 2 Preprocess Sample Data Stage Address sample and Sample Data Raw ADDRESS-ID Anjaneya Temple, om corresponding Addresses Shanthitemple road,Hegnahalli cross, Features SunkadaKatte Address-1 anjaneyatempleomshanthitempleroad ADDRESS-ID KR Layout, JP Nagar, hegnahallicrosssunkadakatte 4 Phase, Kalayana Magnum Techpark Features 10010101010001000010000100000110 ADDRESS-ID hshsshshshsdfhdhsh 010010100100100000010101001 ADDRESS-ID scbmdbsdvgjfsgk Address-2 krlayoutjpnagar4phasekalayanamagnum ADDRESS-ID GTYSF, KJHJUH, KJ, techpark KERALA, DJUHGJDHI,IDJIDJID, Features 00010110000101000010101010101010010 KGJFKHKDF, IJKJKL, FIJKJFGH, K 01000100 IJGFKHJ, DFKL, FI, FKIJKLDFL, FI Address-3 hshshshsdfhdhsh Preprocessed anjaneyatempleomshanthitempleroad Features 000000000000000 Addresses hegnahallicrosssunkadakatte Address-4 scbmdbsdvgjfsgk krlayoutjpnagar4phasekalayanamagnum Features 000000000000000 techpark Address-5 gtysfkjhjuhkjkeraladjuhgjdhiidjid hshshshsdfhdhsh jidkgj scbmdbsdvgjfsgk Features 00000000010000101010010000110010 gtysfkjhjuhkjkeraladjuhgjdhiidjidjidkgj 010000 fkjhkdfijkjklfijkjfghkijgfkhjdfklfifkijkldflfi Address-6 fkjhkdfijkjklfijkjfghkijgfkhj Features anjaneya templeom shanthit empleroa dfklfifkijkldflfi (8-char long) dhegnaha llicross sunkadak atte**** Features 00000001000000100000001000000 krlayout jpnagar4 tphaseba ngalaore 00000100100000001 hshshshs jshshs** scbmdbsd vgjfsgk* Table 6: n-gram scores for Monkey typed and Normal Ad- gtysfkjh juhkjker aladjuhg jdhiidji dresses using Language Model dkgjfkjh kdfijkjk lfijkjfg hkijgfkh jdfklfif kijkldfl fi****** Features anja neya temp leom shan thit empl eroa Sl Address Score(*10−3 ) (4-char long) dheg naha llic ross sunk adak atte No. n=3 n=7 krla yout jpna gar4 tpha seba ngal aore 1 6crosswilliamtown 9.4439 11.3258 hshs hshs jshs hs** 2 1floorlloydvillalloyd scbm dbsd vgjf sgk* roadcookertown 10.7745 15.3010 gtys fkjh juhk jker alad juhg jdhi idji 3 kjnlknmlmklnmklnm 9.4387 19.4918 dkgj fkjh kdfi jkjk lfij kjfg hkij gfkh jdfk lfif 4 mgrdtimesol 12.6973 13.3279 kijk ldfl fi** 0.020} and {0.002,0.28} which overlap. This motivated us to look at 3.3 Experiments with Language model alternate approaches to solve the problem. Initial approach to solve this problem is by using statistical lan- guage models, where a probability is assigned to a character se- 4 EXPERIMENTATION AND RESULTS quence, {w 1 , w 2 , . . . , w n }. It is expected that the scores for monkey Monkey typed addresses occur very rarely in a given dataset. But typed addresses such as gtysfkjhjuhkjkeraladjuhgjdhiidjidjid- they are present and need to be located through a model. The ratio kgj will be distinctly different from that of a normal address. We of normal addresses to monkey typed addresses leads to severe class conducted experiments with kylm language modeling toolkit [13] imbalance. Here we consider equal number of patterns for each and generated scores for all the addresses. We notice that the scores of the classes to demonstrate working of our proposed methods. overlap. We experimented with different values of n. Table 6 con- We consider 30,000 addresses with nearly equal number of normal tains scores for a sample set of addresses for n=3 and 7. In the table, and monkey-typed addresses. We present case studies and their we present 2 examples each of normal and monkey typed addresses corresponding datasets in Table 7. For cases 1-3, we consider 23 : 13 for n=3 and n=7. For n=3, it can observed that score for the monkey division between training and test datasets. As a first step, the data typed address kjnlknmlmklnmklnm is closer to normal address is randomly shuffled. Each time an exclusive set of training dataset placed in sl no.1. The score for the second monkey typed address of size 23 , is chosen and the remaining data is considered as test data, falls within the range of normal addresses, viz., {0.0091,0.042}. For thus resulting in three distinct cases. For cases 4-5, the training and n=7, the range for monkey typed and normal addresses are {0.0104, test data are considered as 12 : 12 . SIGIR eCom, Aug 2017, Tokyo, Japan T. Ravindra Babu and Vishal Kakkar Table 7: Case Studies and Datasets Table 8: Experimental Results for Approach-1 Case Training Test Case Preci- Recall F- No dataset dataset No sion(%) (%) Score 1 1-10000 10001-15000 8-Character long features 2 5001-15000 1-5000 1 97.75 91.30 94.42 3 1-5000, 2 97.81 90.94 94.25 10001-15000 5001-10000 3 97.39 91.00 94.09 4 1-7500 7501-15000 4 97.84 90.61 94.09 5 7501-15000 1-7500 5 97.86 90.33 93.95 6 1-10000 10001-15000 4-Character long features 7 5001-15000 1-5000 6 97.98 97.86 97.92 8 1-5000, 7 98.35 97.74 98.04 10001-15000 5001-10000 8 98.19 97.46 97.82 9 1-7500 7501-15000 9 98.58 97.72 98.05 10 7501-15000 1-7500 10 98.35 97.68 98.01 Table 9: Experimental Results for Approach-2 Table 7 contains serial number of case studies and corresponding datasets. The pattern numbers presented in Column 2 and 3 of the Case Preci- Recall F- table correspond to training and test datasets of each of the classes. No sion(%) (%) Score For example, Case-3 indicates that from randomly shuffled training 1 96.81 96.81 96.81 data corresponding to class-1 and class-2, {1 to 5000} and {10001 2 96.76 96.75 96.75 to 15000} patterns each are considered for training and the rest 3 96.81 96.81 96.81 as test data. The training data and their corresponding labels are 4 96.81 96.80 96.80 considered accordingly. 5 96.75 96.72 96.72 The data is subjected to 2-class classification using linear SVM and Naive Bayes classifier. The performance is comparable. For the sake of brevity, we present the results of the exercises using Percentage of monkey FPR FNR linear SVM alone for both the methods. Table 8 contains results typed addresses corresponding to Approach-1. The precision, recall and F-score in training data numbers are presented in columns 2,3 and 4 for the cases considered 5 0.002 0.655 in Table 7. 10 0.002 0.4968 Experiments are carried out with different character lengths for 20 0.0042 0.0424 the features such as 4,8, and 16. We observe that the performance is 50 0.0062 0.0176 better for 4 and 8. Further, the results with 4-character long features 75 0.007 0.011 show better performance as evident from the table. Table 10: FPR and FNR for different proportion of monkey Performance of 4-character long features perform nearly the typed addresses in the training data same for both the data splits of 23 : 13 and 12 : 12 . The best and average classification accuracy of 98% is obtained for features of length 4 and training-test data split of 12 : 21 . Table 9 contains results with Approach-2 for the same cases as The classification time per pattern is 38 microsec on Intel i3, listed in Table 7. The performance of all the five cases is nearly the 2.6Ghz, 4-core, 8GB RAM machine. This is well within the single same. Approach-1 shows superior performance to Approach-2, in digit millisec latency requirement of the operational system. general. It is interesting to see how some failures have happened in the The true positive rate and false negative rates for Case-4 of classification. We look at an example of a normal address that is Table 7 for different proportion of monkey-typed and normal ad- classified as monkey typed. The address ’microwave lab Dept: ECEE, dresses are provided in Table 10. This is provided in order to in- C04, IISC’, when split into 4-char features, amounts to ’micr owav dicate performance under varying natural rates of monkey typed elab dept ecee c04i isc*’. Such combination of characters, such as addresses ranging from 5% to 75%. ‘elab’, ‘isc*’, ’owav’, etc., occur in monkey typed addresses as well. Although Approach-1 outperforms the other approach, deploy- This is the possible reason for its misclassification. Consider the ing the solution across the country requires training of different address which is monkey typed but classified as normal address, viz., combination of k-char groups, since the language spoken, names of ’figb irds igjr educ tion sjfk kuit ebgo b***. It has mix of monkey-typed locality, and ethnic words are distinct in different regions. Approach- 4-char features, such as, ‘igjr’,‘sjfk’, and ‘edgo’ and few features that 2 is immune to basic nuances of words and focus on pronounceabil- occur in normal addresses, such as, ‘educ’, and ‘tion’. This possibly ity of words. Thus, it is a more generic pattern representation. resulted in its classification as a normal address. Address Fraud: Monkey Typed Address Classification for e-Commerce Applications SIGIR eCom, Aug 2017, Tokyo, Japan An important aspect of the deployment is dealing with class im- feature lengths of 4 and 8 for Approach-1 and features marking balance. We provide a discussion in Section 5 on various approaches vowels for Approach-2. The performance is similar in both the to tackle imbalance. approaches. Further work includes reducing the number of features through 5 CLASS IMBALANCE AND MITIGATION optimal feature selection approaches, experimenting with models that could work across different geographical regions within a large Like in many fraud or spam classification problems, monkey-typed country and adapting to class imbalance. addresses form a very small percentage of combined class dataset leading to severe class imbalance [4, 9, 11, 14]. The problem formu- ACKNOWLEDGMENTS lation, feature selection, approaches to solve the problem and the experimental results indicate that the chosen approaches lead to The authors would like to thank Aditya Rachakonda for his helpful effective solution to the problem. As a next step, we need to make comments. it operational under class imbalance too. The costs of misclassifi- cation [8, 12] of a monkey typed address and normal address are REFERENCES [1] Leman Akoglu, Rishi Chandy, and Christos Faloutsos. 2013. Opinion Fraud not same. We discuss class imbalance problem and some important Detection in Online Reviews by Network Effects. In ICWSM. 2–11. approaches to mitigate this problem. [2] T. Ravindra Babu, A. Chatterjee, S. Khandeparker, A.V. Subhash, and S. Gupta. Consider a 2-class classification scenario, where the number 2015. Geographical address classification without using geolocation coordinates. In Proceedings of the 9th Workshop on Geographic Information Retrieval. ACM. of patterns belonging to one class far exceeds those belonging to [3] S. Bhattacharyya, Jha, K. Tharakunnel, and J.C. Westland. 2011. Data mining for second class. Such data is termed imbalanced between the classes. credit card fraud: A comparative study. Decision Support Systems 50, 3 (2011), 602–613. Finding an effective classification model in such a scenario is chal- [4] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. 2002. SMOTE: lenging, since many classification models favour the majority class. synthetic minority over-sampling technique. Journal of artificial intelligence In many practical problems, the minority class represents fraudu- research 16 (2002), 321–357. [5] R.A. Cole, Y. Yan, B. Mak, M. Fanty, and T. Bailey. 1996. The contribution of lent patterns. Misclassification of fraudulent patterns as genuine consonants versus vowels to word recognition in fluent speech. In Proccedings ones is expensive. Some conventional approaches to solve the prob- of IEEE Conf. on Acoustics, Speech, and Signal Processing, 1996. ICASSP-96, Vol. 2. lem are under-sampling majority class till balance is obtained, over- 853–856. [6] Sharon Curry. 2000. An inside look at e-commerce fraud - Prevention and sampling the minority class in similar manner and choice of unequal solutions. http://www.scambusters.org/ecommercefraud.pdf (2000). classification costs. In case of under-sampling, when representa- [7] C. A. Davis-Jr. and F. T. Fonseca. 2007. Assessing the Certainty of Locations Produced by an Address Geocoding System. Geoinformatica 11 (2007), 103–129. tive patterns intra-class variations are eliminated, the classification [8] C. Elkan. 2001. The foundations of cost-sensitive learning. In International joint accuracy suffers. In case of over-sampling of existing patterns of conference on artificial intelligence, Vol. 17. Lawrence Erlbaum Associates Ltd. minority class, it is not guaranteed to arrive at inter-class discrimina- [9] H. He and E. A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 21, 9 (2009), 1263–1284. tive boundary. In the last 15 years, a number of intelligent sampling [10] J. Hollmen. 2000. User profiling and classification for fraud detection in mobile methods, boosting and combination of sampling and boosting, are communications networks. Helsinki University of Technology. studied to effectively overcome these issues [4, 14]. Synthetic Mi- [11] N. Japkowicz. 2000. The class imbalance problem: Significance and strategies. In In Proc. of the International Conf. on Artificial Intelligence. nority Over-sampling Technique (SMOTE) [4] marks intelligent [12] M. Maloof. 2003. Learning when data sets are imbalanced and when costs are sampling approach where a combination of over-sampling and unequal and unknown. In ICML-2003 workshop on learning from imbalanced data sets II, Vol. 2. under-sampling is resorted to, to demonstrate its superior perfor- [13] Graham Neubig. [n. d.]. Kylm - The kyoto Language Modeling Toolkit. ([n. d.]). mance using area under Receiver Operating Characterstic (ROC) http://www.phontron.com/kylm/ curve. Subsequent extensions of SMOTE and many variants of [14] C. Seiffert, T. M. Khoshgoftaar, J. V. Hulse, and Amri Napolitano. 2010. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, boosting are outlined in [14], where RusBoost is shown to outper- Man, and Cybernetics-Part A: Systems and Humans 40, 1 (2010), 185–197. form the rest of the approaches. [15] N. Shah, Alex Beutel, Bryan Hooi, Leman Akoglu, Stephan Gunnemann, Disha Adapting the above discussed approaches would help mitigating Makhija, Mohit Kumar, , and Christos Faloutsos. 2016. EdgeCentric: Anomaly Detection in Edge-Attributed Networks. In In Data Mining Workshops (ICDMW), class imbalance problem for monkey-typed address classification. 2016 IEEE 16th International Conference on. 327–334. [16] Y. Zhang, J. Bian, and W. Zhu. 2013. Trust fraud: A crucial challenge for China’s e-commerce market. Electronic Commerce Research and Applications 12, 5 (2013), 6 SUMMARY AND FURTHER WORK 299–308. We consider a practical problem that affects e-commerce companies where potentially fraudulent addresses containing a single or multi- ple strings of randomly typed addresses occur. We refer to them as monkey-typed addresses. When such addresses remain undetected, they would amount to loss of revenue and cause strain on resources at various stages of processing, starting from order checkout to logistics. The problem is challenging. Through an initial study, we observed that conventional language modeling approaches do not generalise well. We present two novel solutions (a) by forming fixed-length features and (b) by exploiting inter-vowel distance, after subjecting the addresses to a number of preprocessing steps. We present the data challenges, preprocessing steps, proposed ap- proaches and feature generation. We carry out experiments for