=Paper=
{{Paper
|id=Vol-2311/paper_21
|storemode=property
|title=Address Fraud: Monkey Typed Address Classification for e-Commerce Applications
|pdfUrl=https://ceur-ws.org/Vol-2311/paper_21.pdf
|volume=Vol-2311
|authors=T. Ravindra Babu,Vishal Kakkar
|dblpUrl=https://dblp.org/rec/conf/sigir/BabuK17
}}
==Address Fraud: Monkey Typed Address Classification for e-Commerce Applications==
<pdf width="1500px">https://ceur-ws.org/Vol-2311/paper_21.pdf</pdf>
<pre>
          Address Fraud: Monkey Typed Address Classification for
                         e-Commerce Applications
                             T. Ravindra Babu                                                               Vishal Kakkar
                     Flipkart Internet Private Limited                                             Flipkart Internet Private Limited
                             Bangalore, India                                                               Bangalore, India
                         ravindra.bt@flipkart.com                                                     vishal.kakkar@flipkart.com

ABSTRACT                                                                              chain from order management system to supply chain leading to
E-Commerce companies face a wide variety of fraudulent activities.                    ship corresponding packed item to delivery hub for delivery. This
Machine Learning models are continuously built to detect them for                     results in avoidable mis-utlilization of resources.
their effective mitigation. Randomly typed alphanumeric characters                        Address systems are well studied. Davis and Fonseca [7] discuss
as customer addresses is one peculiar fraud. The orders are placed                    theory of addresses while discussing on certainty of locations in ad-
with such addresses. We refer to them as monkey-typed addresses.                      dress geocoding systems. Address structures differ across countries.
Accurate methods of identifying such addresses is important in                        In Japan, the houses are numbered according to their date of con-
reducing operational cost. The current work presents machine                          struction and most streets do not carry names. In Korea, the houses
learning based approaches that classify a given address as normal                     are numbered within a neighbourhood with reference to a hierar-
or monkey-typed with a high accuracy. The approach integrates                         chy of area names instead of numbers. The cited work also notes
address preprocessing, novel feature generation and classification.                   that in countries like India and Brazil, a complete and organised
                                                                                      address database that could be correlated with their geolocations
KEYWORDS                                                                              is lacking. In view of absence of proper recommended structure
                                                                                      for the addresses, it is empirically noticed that a given address is
Geographical addresses, monkey-typed, machine learning, classifi-
cation                                                                                written with different hierarchy levels by different residents of the
                                                                                      same dwelling. We work on such a system which has multiple vari-
ACM Reference format:                                                                 ants in writing an address to the same given location which we
T. Ravindra Babu and Vishal Kakkar. 2017. Address Fraud: Monkey Typed
                                                                                      refer to normal addresses in contrast to monkey-typed addresses. We
Address Classification for e-Commerce Applications. In Proceedings of ACM
SIGIR eCom Workshop, Tokyo, Japan, Aug 2017 (SIGIR eCom), 7 pages.
                                                                                      discuss more elaborately on various aspects of such addresses and
                                                                                      present sample addresses in Section 3.1.
                                                                                          In order to classify these two classes of addresses, we propose
1    INTRODUCTION                                                                     two novel solutions that classify monkey-typed addresses with very
Online e-commerce companies facilitate its users to order items                       high classification accuracy. The methods integrate the following
online and deliver them to the suggested addresses. Users register                    aspects.
themselves by entering one or more of their addresses that would                          • Address preprocessing
ensure that an ordered shipment is positively delivered to them.                          • Novel way of feature generation
Thus the customer addresses form important part of e-commerce                             • Pattern Classification
companies. A number of challenges in relation to the addresses,
                                                                                         We organise the paper in the following manner. We present
especially, in developing countries include[2] lack of pre-defined
                                                                                      motivation for this work and discuss related literature in Section 2.
structure, absence of geolocation information, spell variations in
                                                                                      Section 3 contains proposed methods, discussion on challenges in
address components due to different literacy levels in the population
                                                                                      data, and preprocessing steps. Section 4 contains a discussion on
of customers, etc.
                                                                                      experimentation and results. We present approaches to solve class
   In order to avail price discounts or with possible fraudulent in-
                                                                                      imbalance problem in Section 5. Section 6 contains summary and
tent, there are occasions where orders are placed with addresses
                                                                                      directions for further work.
containing randomly typed alphanumeric characters. We refer to
them as monkey-typed addresses in this work. Early identification of
                                                                                      2   MOTIVATION
such addresses in the operations chain is necessary to reduce oper-
ations cost. For example, when such addresses remain undetected,
                                                                                         E-Commerce companies face a wide variety of fraudulent activi-
it would lead to activation of a number of stages in the operations
                                                                                      ties. They include payment frauds, return frauds, delivery frauds,
                                                                                      lost-and-stolen articles, fake reviews and fraudulent ratings. Such
Copyright © 2017 by the paper’s authors. Copying permitted for private and academic   frauds are prevalent in both developing and developed countries.
purposes.                                                                             They are well discussed and many solutions are proposed [1, 3, 10,
In: J. Degenhardt, S. Kallumadi, M. de Rijke, L. Si, A. Trotman, Y. Xu (eds.):        15, 16]. The nature of fraud, which is the focus of the proposed
Proceedings of the SIGIR 2017 eCom workshop, August 2017, Tokyo, Japan, published
at http://ceur-ws.org                                                                 work, was not reported earlier. It is interesting explore possible
                                                                                      reasons for such a fraud. Generally, most e-Commerce platforms
                                                                                      offer a number of attractive discounts. In order to reach a larger
                                                                                      customer base, the discounts offered are limited to purchase of one
SIGIR eCom, Aug 2017, Tokyo, Japan                                                                       T. Ravindra Babu and Vishal Kakkar


or a small number of units. Fraudulent users, presumably resellers,      hype a target product. User generated reviews play significant role
would order multiple items beyond the prescribed limit through           in e-commerce retail. People with fraud intent spam the reviews
false addresses. The resellers are profited by selling those products    with an intent to distort the perception about the product. Fake
in the market at a price higher than the discounted price. Since the     online review detection is studied in [1]. The work presents a unsu-
objective of the company is to serve larger customer base, such          pervised, scalable framework that exploits the correlation of labels
fraud defeats this purpose. In this context, they possibly resort to     between users and products, in order to detect fraudulent users and
monkey typed addresses in order not get detected through multiple        online fake reviews. User profiling is used to detect fraud in mobile
orders. With a nexus at later stages, the orders could get delivered     communication networks [10]. The fraud in this case relates to ille-
too. The orders that contain monkey typed addresses could also           gal use of services which could cause loss to service provider. The
possibly be created by fraudsters to cause a strain on the system.       work provides multiple approaches to solve the problem, such as
When unchecked at the time of order, such addresses, would lead          feed-forward neural network for normal and fraud classes, Gauss-
to mis-utilization of resources. This necessitates a solution to the     ian mixture model for user behaviour modeling to detect sudden
problem. The solution should be deployable and trackable. In view        changes in behaviour, and cost-sensitive classification.
of this, we approach this as a 2-class classification problem.              We propose two approaches to classify a given user address as
    We have not come across any peer-reviewed, published work on         anomalous or genuine. We discuss them in detail in Section 3.
address fraud. Most of the discussions in blogs that relate address
fraud [6] to false declaration of one’s address linked to credit card    3    PROPOSED METHODS
fraud. The problem we propose to solve is unique, but falls in the        The problem can be defined as, classification of a given address
broad class of fraudulent activities that e-commerce companies face.     into either normal or monkey-typed accurately and efficiently. Effi-
In view of this, we discuss some closely related works.                  ciency is required for real time detection of such fraud in customer
    E-commerce companies face a number of fraudulent activities.         on-boarding process. The solution is also useful to clean the his-
They include frauds caused by buyers, sellers and logistic partners.     torical database of customer data. A study of monkey-typed and
Some relate to exploiting discounts offered by the companies to          normal addresses initially indicated that a possible solution can
resell the products for profit. The activities also include number       be obtained through language modeling. We experimented with
of fraudulent claims and counter claims about non-receipt of a           standard language models and found as not effective as shown later
genuine product, claims of receipt of part of shipments as against a     in Section 3.3. In view of this, we propose to solve this problem in
promised list, etc. Some are more subtle and difficult to detect. We     the following two ways of pattern representation.
discuss some such frauds and their solutions.                                (1) Approach-1: In this approach, we derive text features as
    Credit card fraud detection is one important activity. They are              groups of characters. Through machine learning models, we
broadly classified into two [3], viz., application fraud, where fraud-           learn from data of those combinations of characters that are
sters apply for new cards with false information and behavioural                 normal or abnormal and culminate in classification of an
fraud, which include mail theft, stolen cards, counterfeit card, usage           address.
in the absence of the card holder, etc. In general, models incorpo-          (2) Approach-2: In this approach, we exploit importance of
rate binary classification approaches. The work provides a detailed              vowels in English [5]. We mark vowels and non-vowels as
account of credit card fraud, data mining approaches to solve the                binary features. Regularity in the presence of vowel helps to
problem, feature identification, and comparative experimental re-                classify a normal address vis-a-vis an abnormal address.
sults. The experiments are conducted on real-life dataset using
logistic regression, support vector machines and random forests.                The motivation for Approach-2 is its possible generality
Trust fraud which relates seller fraud in hiking their own ratings              across a large country like India, where different languages
and reducing competitor’s ratings is rampant in e-commerce indus-               are spoken and the address words would be distinctly differ-
try [16]. The work provides interesting insights on its prevalence              ent. However, the words even though expressed in English,
in Chinese C2C (customer to customer) e-commerce industry. The                  do have similar vowel combination. It should be mentioned
sources of fraud include professional scammers, and nexus between               that the method would not work if (a) the syllables contain
sellers, scammers, buyers, trust fraud companies and logistic com-              only consonants or (b) addresses contain only numbers. We,
panies. The work also traces historical evolution of trust models. It           however, did not come across monkey typed addresses as
proposes a dynamic trust model that integrates weightage for trans-             numbers alone, possibly because the addresses are expressed
action amount, time decay to account for recency of transactions,               as characters and fraudsters mimic them with jumbled char-
and trust factors such as quality, service and shipping speed. The              acters.
solution is shown to provide reliable trust scores. Anomalous be-
haviour indicates potential fraud activity. Anomaly detection with         We achieve this objective through the following steps of the
edge-attributed graphs on complex heterogeneous networks is ex-          proposed methods.
plored in [15]. With product ratings by users as data, the proposed          (1) Preprocess the addresses: The addresses at the time of
work identifies networks of users that are anomalous. The proposed               entry would contain a number of special characters such
approach extends to heterogeneous networks, support independent                  as $, ^,%, *,/, etc. Earlier studies [2] show that they are not
edge attributes and has ability to identify and rank anomalies. It               helpful in classification of the addresses. Such terms are
is shown to have high precision and is scalable. Opinion fraud [1]               eliminated. The addresses also contain mixture of upper and
relates to spamming the reviews either to defame or artificially                 lower case. They are all converted to lower case.
Address Fraud: Monkey Typed Address Classification for e-Commerce Applications                          SIGIR eCom, Aug 2017, Tokyo, Japan


   (2) For Approach-1, generate fixed length features that                Table 1: Sample Spell Variants identified through clustering
       are devoid of repititve substrings: The normal as well
       as monkey-typed addresses are of non-uniform length. In                  Spell variants of a word apartment as entered by users
       order to bring tractability such addresses are reduced to                apartment apartmenet apartmanet apartmenst
       fixed length substrings that form features. We consider two              apartmenrt apartmennt apartment appratment
       different lengths of substrings. Repeated substrings of char-            aparatment appratmant aparment apartent
       acters are identified and removed. This is done so as to avoid           apprment apartement apartament apartnment
       over-representation of repeating substrings.                             apprtement apartmment aparttment apartrment
   (3) For Approach-2, we make use of presence of vowels in                     apartememt apaprtment aparrtment apaertment
       an address: In this approach, we transform the address data              apartmt apartme apartmernt apartmeant
       into an binary sequence of vowel and non-vowels. Thus nor-               apartmemt apartmebt apatrtments apaternments
       mal and monkey-typed addresses form a sequence of binary                 apasrtments apartmensts apartmenmts aparntments
       features. The present work is applicable for terms contain-
       ing vowels like in Indian languages, although expressed in
       English. Some examples are koramangala, an area name in
       Banaglore, and dhaula kuan, an area name in Delhi. It may          observe presence of many variants of these words when we study
       be noted here that lengths of different patterns containing        a large set of customer entered addresses.
       binary values features are not same. In the given dataset,            For example, in a dataset of about 10000 addresses of Bangalore,
       the maximum of pattern lengths is noted. It is extended by         we notice about 18 different spell variations of an area name “kora-
       another 25% of its length in order to take care of unseen test     mangala", 61 variants of “Bannerghatta Road", 263 variants of the
       patterns, say, l. Each pattern is appended with 0’s in order       word "apartments". The variants are obtained through clustering
       till length of pattern reaches common length, l.                   the addresses [2]. Table 1 contains a sample of spell variants of the
   (4) Label the patterns as normal or monkey-typed addresses:            word, “apartments". They occur possibly either due to an oversight
       In order to classify the addresses using supervised methods,       while entering and due to varying literacy levels. In the preprocess-
       the addresses are labeled as normal and monkey-typed. This         ing step, these variants are replaced by a representative term in
       is a manual activity.                                              order to reduce the vocabulary.
   (5) Training, Test and Cross-validation: The dataset is di-               Table 3 contains monkey-typed, and potentially fraudulent ad-
       vided into training and test datasets. We consider multiple        dresses. It can be noted that they contain many combinations of
       such sets by randomly selecting from those two sets in a           characters. Interestingly, some intelligent fraud customers, leave
       mutually exclusive manner.                                         spaces while entering monkey typed addresses, to make them ap-
   (6) Classify the data and record the average performance:              pear as normal addresses which typically contain multiple words.
       We carry out experiments, with different mutually exclusive        The study of such addresses provide interesting insights. It is no-
       train-test combinations. The performance is presented.             ticed that such customers also write random characters with a mix
                                                                          of upper and lower case characters. Thus the proposed solution
   For classifying a given address as normal or monkey-typed, it
                                                                          should overcome the issues of inter-word spaces, and mixed case
is not possible to obtain all combinations of monkey-typing possi-
                                                                          letters. Further, in such randomly typed letters, either vowels are
bilities. We initially place our efforts in understanding the conven-
                                                                          extremely rare or completely dominate and remain repetitive.
tional addresses as well as monkey-typed addresses. Subsequently
                                                                             Figure 1 contains a word-cloud of a sample of such addresses after
we devise methods for effective classification of such addresses. We
                                                                          splitting them into 4-character long strings. We can observe from
discuss about the nuances of the data and its transformation as it
                                                                          the figure that frequently used substrings predominantly match
passes through important stages of preprocessing.
                                                                          with successive letters on qwerty keyboard although there exist
                                                                          other combinations too.
3.1    Data Description                                                      We will further discuss challenges in the data and preprocessing
The number of addresses is in the order of millions. Monkey-typed         methods that are necessary for working with the proposed approach
addresses form a very small percentage of the total address set.          in the following section.
Table 2 contains a sample of normal addresses. The table contains
addresses as entered by the registered users. In the table, in order to   3.2    Address Data Preprocessing
maintain privacy of the users, we replaced specific dwelling identity
                                                                          Both the classes of addresses of normal and monkey-typed have
with x’s. It can be noticed from the data that a number of challenges
                                                                          many specific observations. Normal address data have the following
exist in processing these addresses. The details that could have been
                                                                          challenges.
provided in subsequent stages of address entry, such as city name,
and ZIP code would form part of the main address. It should further          (1) Control and special characters, that are inadvertently intro-
be noted that some city names are inadvertently spelled in different             duced during data entry, storage and retrieval.
ways. Some examples are BANAGALORE in place of BANGALORE                     (2) Upper and lower case letters
and Bhubaneswar in place of Bhubaneshwar. The preprocessing                  (3) Many spell variants of a given word
models also need to account for some city names that are renamed             (4) Missing spaces, such as, IndustrialLayout in place of Indus-
officially such as {Benguluru, Bangalore} or {Gurgaon, Gurgram}. We              trial Layout
SIGIR eCom, Aug 2017, Tokyo, Japan                                                                  T. Ravindra Babu and Vishal Kakkar

             Table 2: A Sample of Normal Addresses                      The monkey-typed addresses data have the following challenges.

   Sl. No.    Address                                                   (1) Variable length strings with maximum number reaching
      1       x-2xx, Beta 2 Greater Noida, Delhi                            nearly 100 characters.
      2       x-x, POLO ROAD ,OLD GUPTA COLONY ,                        (2) Many of them have repeated substrings such as asdfasdfgij
              OPPOSITE SHANI MANDIR, DELHI                                  etc.
     3        Plot no.-xxx/2xxxx, Jagamohan Nagar,                      (3) Some addresses are provided as multiple monkey-typed
              PO-Khandagiri, Near ITER College, Bhubaneswar                 strings so as to mimic a normal address.
     4        No xxx,1st stage,4th cross, behind LVK Garments,          (4) Combination of upper and lower case characters
              Opp SRS Water tank, PEENYA, BANAGALORE                    The above discussion presents a list of all the challenges present
     5        XXX/X, KENDRIYA VIHAR, SEC-11,                         in addresses. For the problem under consideration, some challenges
              KHARGHAR,NAVI MUMBAI                                   like accidental merging or separation of words is not important
                                                                     since we ultimately remove the spaces in the address while gener-
         Table 3: A Sample of Monkey-typed Addresses                 ating features as we present in the following discussion.
                                                                        We take following preprocessing or data preparation steps
              Sl. No.   Address                                      while generating features.
                 1      OEGVOQCS
                                                                        (1) Removing control and special characters, such as, $, ^,%, *,/,
                 2      ddfkddd
                                                                        (2) Reducing all data to lower case
                 3      afadfsf
                                                                        (3) Generating equivalent set of words with the help of cluster-
                 4      gdfgtdf
                                                                            ing of empirical data [2]
                 5      fjrjnvhejvnjdjdfogjfn vmfjgfnl
                                                                        (4) After incorporating these changes, identifying unique ad-
                                                                            dresses among the datasets
                                                                        (5) Further steps for Approach-1
                                                                            • Remove spaces between address words to convert it into
                                                                              a single string without spaces
                                                                            • Split the strings into k-character substrings to form fea-
                                                                              tures, for k=4 and 8.
                                                                            • Blank spaces remain in the last string when the address
                                                                              string length is not divisible by substring length ‘k’. Re-
                                                                              place those blanks that remain after “n modulo k” split,
                                                                              with *’s, where n and k are lengths of full string and sub-
                                                                              string respectively.
                                                                        (6) Further steps for Approach-2
                                                                            • Identify vowels.
                                                                            • Place ‘1’ for vowel and ‘0’ for the rest. For example, for
                                                                              the word anjaneyacomplex, highlighting vowels, it forms
                                                                              the word, anjaneyacomplex. The corresponding features
                                                                              are 100101010100010.
                                                                            • Append 0’s to arrive at common pattern length, as dis-
                                                                              cussed earlier.
Figure 1: Word Cloud of Monkey-typed Addresses. The ad-
dresses are split into 4-character long strings. The word                Table 4 illustrates processing of the addresses for Approach-1. It
cloud corresponds to them.                                           contains addresses in three stages such as (a) raw form as entered
                                                                     by customers, (b) preprocessed and (c) feature form. We provide
                                                                     illustrations for both 8-character and 4-character long features. Dif-
  (5) Undesirable spaces such as, Industrial Lay out in place of     ferent feature subsequence lengths are studied through experiments
      Industrial Layout                                              and reported in Section 4. In the table, we consider two normally
  (6) Terms that do not help classification[2] with the proposed     written addresses and three monkey-typed addresses. It is inter-
      solution framework, such as, ZIP codes, city names, time       esting to note the variety of letter combinations and word spaces
      stamp, direction to reach address, landmarks, etc.             that fraudsters use in writing monkey-typed address. The ‘*’s at
  (7) Repeated terms such as those which are accidentally typed      the end of a word indicate blanks that are necessary to complete 8-
      twice while entering the addresses                             character bound. It should however be noted here that as discussed
  (8) Lack of structure where address-words appear at different      in Section 3.2, repeat substrings are removed.
      locations by different users while writing an address of a         Table 5 contains features for the preprocessed addresses for
      location. For example, 3rd Stage, BTM Layout is also written   Approach-2. The preprocessed addresses are taken from Table 4.
      as BTM Layout, 3rd Stage.                                      We consider each of these binary digits as features.
Address Fraud: Monkey Typed Address Classification for e-Commerce Applications                                 SIGIR eCom, Aug 2017, Tokyo, Japan

       Table 4: Addresses and Features for Approach-1                           Table 5: Preprocessed Addresses and Features for Approach-
                                                                                2
   Preprocess         Sample Data
   Stage                                                                         Address sample and       Sample Data
   Raw                ADDRESS-ID Anjaneya Temple, om                             corresponding
   Addresses          Shanthitemple road,Hegnahalli cross,                       Features
                      SunkadaKatte                                               Address-1                anjaneyatempleomshanthitempleroad
                      ADDRESS-ID KR Layout, JP Nagar,                                                     hegnahallicrosssunkadakatte
                      4 Phase, Kalayana Magnum Techpark                          Features                 10010101010001000010000100000110
                      ADDRESS-ID hshsshshshsdfhdhsh                                                       010010100100100000010101001
                      ADDRESS-ID scbmdbsdvgjfsgk                                 Address-2                krlayoutjpnagar4phasekalayanamagnum
                      ADDRESS-ID GTYSF, KJHJUH, KJ,                                                       techpark
                      KERALA, DJUHGJDHI,IDJIDJID,                                Features                 00010110000101000010101010101010010
                      KGJFKHKDF, IJKJKL, FIJKJFGH, K                                                      01000100
                      IJGFKHJ, DFKL, FI, FKIJKLDFL, FI                           Address-3                hshshshsdfhdhsh
   Preprocessed       anjaneyatempleomshanthitempleroad                          Features                 000000000000000
   Addresses          hegnahallicrosssunkadakatte                                Address-4                scbmdbsdvgjfsgk
                      krlayoutjpnagar4phasekalayanamagnum                        Features                 000000000000000
                      techpark                                                   Address-5                gtysfkjhjuhkjkeraladjuhgjdhiidjid
                      hshshshsdfhdhsh                                                                     jidkgj
                      scbmdbsdvgjfsgk                                            Features                 00000000010000101010010000110010
                      gtysfkjhjuhkjkeraladjuhgjdhiidjidjidkgj                                             010000
                      fkjhkdfijkjklfijkjfghkijgfkhjdfklfifkijkldflfi             Address-6                fkjhkdfijkjklfijkjfghkijgfkhj
   Features           anjaneya templeom shanthit empleroa                                                 dfklfifkijkldflfi
   (8-char long)      dhegnaha llicross sunkadak atte****                        Features                 00000001000000100000001000000
                      krlayout jpnagar4 tphaseba ngalaore                                                 00000100100000001
                      hshshshs jshshs**
                      scbmdbsd vgjfsgk*
                                                                                Table 6: n-gram scores for Monkey typed and Normal Ad-
                      gtysfkjh juhkjker aladjuhg jdhiidji
                                                                                dresses using Language Model
                      dkgjfkjh kdfijkjk lfijkjfg hkijgfkh jdfklfif
                      kijkldfl fi******
   Features           anja neya temp leom shan thit empl eroa                             Sl    Address                         Score(*10−3 )
   (4-char long)      dheg naha llic ross sunk adak atte                                  No.                                  n=3      n=7
                      krla yout jpna gar4 tpha seba ngal aore                              1    6crosswilliamtown           9.4439 11.3258
                      hshs hshs jshs hs**                                                  2    1floorlloydvillalloyd
                      scbm dbsd vgjf sgk*                                                       roadcookertown             10.7745    15.3010
                      gtys fkjh juhk jker alad juhg jdhi idji                               3   kjnlknmlmklnmklnm           9.4387    19.4918
                      dkgj fkjh kdfi jkjk lfij kjfg hkij gfkh jdfk lfif                     4   mgrdtimesol                12.6973    13.3279
                      kijk ldfl fi**

                                                                                0.020} and {0.002,0.28} which overlap. This motivated us to look at
3.3    Experiments with Language model                                          alternate approaches to solve the problem.
Initial approach to solve this problem is by using statistical lan-
guage models, where a probability is assigned to a character se-                4    EXPERIMENTATION AND RESULTS
quence, {w 1 , w 2 , . . . , w n }. It is expected that the scores for monkey    Monkey typed addresses occur very rarely in a given dataset. But
typed addresses such as gtysfkjhjuhkjkeraladjuhgjdhiidjidjid-                   they are present and need to be located through a model. The ratio
kgj will be distinctly different from that of a normal address. We              of normal addresses to monkey typed addresses leads to severe class
conducted experiments with kylm language modeling toolkit [13]                  imbalance. Here we consider equal number of patterns for each
and generated scores for all the addresses. We notice that the scores           of the classes to demonstrate working of our proposed methods.
overlap. We experimented with different values of n. Table 6 con-               We consider 30,000 addresses with nearly equal number of normal
tains scores for a sample set of addresses for n=3 and 7. In the table,         and monkey-typed addresses. We present case studies and their
we present 2 examples each of normal and monkey typed addresses                 corresponding datasets in Table 7. For cases 1-3, we consider 23 : 13
for n=3 and n=7. For n=3, it can observed that score for the monkey             division between training and test datasets. As a first step, the data
typed address kjnlknmlmklnmklnm is closer to normal address                     is randomly shuffled. Each time an exclusive set of training dataset
placed in sl no.1. The score for the second monkey typed address                of size 23 , is chosen and the remaining data is considered as test data,
falls within the range of normal addresses, viz., {0.0091,0.042}. For           thus resulting in three distinct cases. For cases 4-5, the training and
n=7, the range for monkey typed and normal addresses are {0.0104,               test data are considered as 12 : 12 .
SIGIR eCom, Aug 2017, Tokyo, Japan                                                                            T. Ravindra Babu and Vishal Kakkar

               Table 7: Case Studies and Datasets                                     Table 8: Experimental Results for Approach-1

                 Case       Training           Test                                             Case Preci- Recall           F-
                  No         dataset         dataset                                             No sion(%)        (%)     Score
                   1        1-10000        10001-15000                                              8-Character long features
                   2      5001-15000         1-5000                                               1     97.75     91.30 94.42
                   3         1-5000,                                                              2     97.81     90.94 94.25
                          10001-15000      5001-10000                                             3     97.39     91.00 94.09
                   4         1-7500        7501-15000                                             4     97.84     90.61 94.09
                   5      7501-15000         1-7500                                               5     97.86     90.33 93.95
                   6        1-10000        10001-15000                                              4-Character long features
                   7      5001-15000         1-5000                                               6     97.98     97.86 97.92
                   8         1-5000,                                                              7     98.35     97.74 98.04
                          10001-15000      5001-10000                                             8     98.19     97.46 97.82
                    9        1-7500        7501-15000                                             9     98.58     97.72 98.05
                   10     7501-15000         1-7500                                              10     98.35     97.68 98.01


                                                                                      Table 9: Experimental Results for Approach-2
    Table 7 contains serial number of case studies and corresponding
datasets. The pattern numbers presented in Column 2 and 3 of the                                 Case     Preci-    Recall       F-
table correspond to training and test datasets of each of the classes.                            No     sion(%)     (%)       Score
For example, Case-3 indicates that from randomly shuffled training                                 1      96.81     96.81      96.81
data corresponding to class-1 and class-2, {1 to 5000} and {10001                                  2      96.76     96.75      96.75
to 15000} patterns each are considered for training and the rest                                   3      96.81     96.81      96.81
as test data. The training data and their corresponding labels are                                 4      96.81     96.80      96.80
considered accordingly.                                                                            5      96.75     96.72      96.72
    The data is subjected to 2-class classification using linear SVM
and Naive Bayes classifier. The performance is comparable. For
the sake of brevity, we present the results of the exercises using                          Percentage of monkey         FPR       FNR
linear SVM alone for both the methods. Table 8 contains results                                typed addresses
corresponding to Approach-1. The precision, recall and F-score                                 in training data
numbers are presented in columns 2,3 and 4 for the cases considered                                    5                0.002      0.655
in Table 7.                                                                                           10                0.002     0.4968
    Experiments are carried out with different character lengths for                                  20               0.0042     0.0424
the features such as 4,8, and 16. We observe that the performance is                                  50               0.0062     0.0176
better for 4 and 8. Further, the results with 4-character long features                               75                0.007      0.011
show better performance as evident from the table.                            Table 10: FPR and FNR for different proportion of monkey
    Performance of 4-character long features perform nearly the               typed addresses in the training data
same for both the data splits of 23 : 13 and 12 : 12 . The best and average
classification accuracy of 98% is obtained for features of length 4
and training-test data split of 12 : 21 .
    Table 9 contains results with Approach-2 for the same cases as                The classification time per pattern is 38 microsec on Intel i3,
listed in Table 7. The performance of all the five cases is nearly the        2.6Ghz, 4-core, 8GB RAM machine. This is well within the single
same. Approach-1 shows superior performance to Approach-2, in                 digit millisec latency requirement of the operational system.
general.                                                                          It is interesting to see how some failures have happened in the
    The true positive rate and false negative rates for Case-4 of             classification. We look at an example of a normal address that is
Table 7 for different proportion of monkey-typed and normal ad-               classified as monkey typed. The address ’microwave lab Dept: ECEE,
dresses are provided in Table 10. This is provided in order to in-            C04, IISC’, when split into 4-char features, amounts to ’micr owav
dicate performance under varying natural rates of monkey typed                elab dept ecee c04i isc*’. Such combination of characters, such as
addresses ranging from 5% to 75%.                                             ‘elab’, ‘isc*’, ’owav’, etc., occur in monkey typed addresses as well.
    Although Approach-1 outperforms the other approach, deploy-               This is the possible reason for its misclassification. Consider the
ing the solution across the country requires training of different            address which is monkey typed but classified as normal address, viz.,
combination of k-char groups, since the language spoken, names of             ’figb irds igjr educ tion sjfk kuit ebgo b***. It has mix of monkey-typed
locality, and ethnic words are distinct in different regions. Approach-       4-char features, such as, ‘igjr’,‘sjfk’, and ‘edgo’ and few features that
2 is immune to basic nuances of words and focus on pronounceabil-             occur in normal addresses, such as, ‘educ’, and ‘tion’. This possibly
ity of words. Thus, it is a more generic pattern representation.              resulted in its classification as a normal address.
Address Fraud: Monkey Typed Address Classification for e-Commerce Applications                                    SIGIR eCom, Aug 2017, Tokyo, Japan


   An important aspect of the deployment is dealing with class im-          feature lengths of 4 and 8 for Approach-1 and features marking
balance. We provide a discussion in Section 5 on various approaches         vowels for Approach-2. The performance is similar in both the
to tackle imbalance.                                                        approaches.
                                                                               Further work includes reducing the number of features through
5   CLASS IMBALANCE AND MITIGATION                                          optimal feature selection approaches, experimenting with models
                                                                            that could work across different geographical regions within a large
Like in many fraud or spam classification problems, monkey-typed
                                                                            country and adapting to class imbalance.
addresses form a very small percentage of combined class dataset
leading to severe class imbalance [4, 9, 11, 14]. The problem formu-        ACKNOWLEDGMENTS
lation, feature selection, approaches to solve the problem and the
experimental results indicate that the chosen approaches lead to            The authors would like to thank Aditya Rachakonda for his helpful
effective solution to the problem. As a next step, we need to make          comments.
it operational under class imbalance too. The costs of misclassifi-
cation [8, 12] of a monkey typed address and normal address are
                                                                            REFERENCES
                                                                             [1] Leman Akoglu, Rishi Chandy, and Christos Faloutsos. 2013. Opinion Fraud
not same. We discuss class imbalance problem and some important                  Detection in Online Reviews by Network Effects. In ICWSM. 2–11.
approaches to mitigate this problem.                                         [2] T. Ravindra Babu, A. Chatterjee, S. Khandeparker, A.V. Subhash, and S. Gupta.
    Consider a 2-class classification scenario, where the number                 2015. Geographical address classification without using geolocation coordinates.
                                                                                 In Proceedings of the 9th Workshop on Geographic Information Retrieval. ACM.
of patterns belonging to one class far exceeds those belonging to            [3] S. Bhattacharyya, Jha, K. Tharakunnel, and J.C. Westland. 2011. Data mining for
second class. Such data is termed imbalanced between the classes.                credit card fraud: A comparative study. Decision Support Systems 50, 3 (2011),
                                                                                 602–613.
Finding an effective classification model in such a scenario is chal-        [4] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. 2002. SMOTE:
lenging, since many classification models favour the majority class.             synthetic minority over-sampling technique. Journal of artificial intelligence
In many practical problems, the minority class represents fraudu-                research 16 (2002), 321–357.
                                                                             [5] R.A. Cole, Y. Yan, B. Mak, M. Fanty, and T. Bailey. 1996. The contribution of
lent patterns. Misclassification of fraudulent patterns as genuine               consonants versus vowels to word recognition in fluent speech. In Proccedings
ones is expensive. Some conventional approaches to solve the prob-               of IEEE Conf. on Acoustics, Speech, and Signal Processing, 1996. ICASSP-96, Vol. 2.
lem are under-sampling majority class till balance is obtained, over-            853–856.
                                                                             [6] Sharon Curry. 2000. An inside look at e-commerce fraud - Prevention and
sampling the minority class in similar manner and choice of unequal              solutions. http://www.scambusters.org/ecommercefraud.pdf (2000).
classification costs. In case of under-sampling, when representa-            [7] C. A. Davis-Jr. and F. T. Fonseca. 2007. Assessing the Certainty of Locations
                                                                                 Produced by an Address Geocoding System. Geoinformatica 11 (2007), 103–129.
tive patterns intra-class variations are eliminated, the classification      [8] C. Elkan. 2001. The foundations of cost-sensitive learning. In International joint
accuracy suffers. In case of over-sampling of existing patterns of               conference on artificial intelligence, Vol. 17. Lawrence Erlbaum Associates Ltd.
minority class, it is not guaranteed to arrive at inter-class discrimina-    [9] H. He and E. A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions
                                                                                 on knowledge and data engineering 21, 9 (2009), 1263–1284.
tive boundary. In the last 15 years, a number of intelligent sampling       [10] J. Hollmen. 2000. User profiling and classification for fraud detection in mobile
methods, boosting and combination of sampling and boosting, are                  communications networks. Helsinki University of Technology.
studied to effectively overcome these issues [4, 14]. Synthetic Mi-         [11] N. Japkowicz. 2000. The class imbalance problem: Significance and strategies. In
                                                                                 In Proc. of the International Conf. on Artificial Intelligence.
nority Over-sampling Technique (SMOTE) [4] marks intelligent                [12] M. Maloof. 2003. Learning when data sets are imbalanced and when costs are
sampling approach where a combination of over-sampling and                       unequal and unknown. In ICML-2003 workshop on learning from imbalanced data
                                                                                 sets II, Vol. 2.
under-sampling is resorted to, to demonstrate its superior perfor-          [13] Graham Neubig. [n. d.]. Kylm - The kyoto Language Modeling Toolkit. ([n. d.]).
mance using area under Receiver Operating Characterstic (ROC)                    http://www.phontron.com/kylm/
curve. Subsequent extensions of SMOTE and many variants of                  [14] C. Seiffert, T. M. Khoshgoftaar, J. V. Hulse, and Amri Napolitano. 2010. RUSBoost:
                                                                                 A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems,
boosting are outlined in [14], where RusBoost is shown to outper-                Man, and Cybernetics-Part A: Systems and Humans 40, 1 (2010), 185–197.
form the rest of the approaches.                                            [15] N. Shah, Alex Beutel, Bryan Hooi, Leman Akoglu, Stephan Gunnemann, Disha
    Adapting the above discussed approaches would help mitigating                Makhija, Mohit Kumar, , and Christos Faloutsos. 2016. EdgeCentric: Anomaly
                                                                                 Detection in Edge-Attributed Networks. In In Data Mining Workshops (ICDMW),
class imbalance problem for monkey-typed address classification.                 2016 IEEE 16th International Conference on. 327–334.
                                                                            [16] Y. Zhang, J. Bian, and W. Zhu. 2013. Trust fraud: A crucial challenge for China’s
                                                                                 e-commerce market. Electronic Commerce Research and Applications 12, 5 (2013),
6   SUMMARY AND FURTHER WORK                                                     299–308.
 We consider a practical problem that affects e-commerce companies
where potentially fraudulent addresses containing a single or multi-
ple strings of randomly typed addresses occur. We refer to them as
monkey-typed addresses. When such addresses remain undetected,
they would amount to loss of revenue and cause strain on resources
at various stages of processing, starting from order checkout to
logistics. The problem is challenging. Through an initial study, we
observed that conventional language modeling approaches do not
generalise well. We present two novel solutions (a) by forming
fixed-length features and (b) by exploiting inter-vowel distance,
after subjecting the addresses to a number of preprocessing steps.
We present the data challenges, preprocessing steps, proposed ap-
proaches and feature generation. We carry out experiments for

</pre>