The Classification of Imbalanced Spatial Data

                                                         Alina Lazar
                                    Department of Computer Science and Information Systems
                                                  Youngstown State University
                                                   Youngstown, OH 44555
                                                       alazar@ysu.edu

                                                         Bradley A. Shellito
                                                      Department of Geography
                                                     Youngstown State University
                                                      Youngstown, OH 44555
                                                        bashellito@ysu.edu

                                                                     techniques. A second class of methods uses meta-costs and
                          Abstract                                   assigns different penalties for the misclassified instances,
                                                                     depending on their true class. The problem with this type
  This paper describes a method of improving the prediction          of methods is that it is hard to come up with a good penalty
  of urbanization. The four datasets used in this study were         cost. The last type of methods is the algorithmic-based
  extracted using Geographical Information Systems (GIS).
                                                                     approach. They tweak the classifier to accommodate
  Each dataset contains seven independent variables related
  to urban development and a class label which denotes the           imbalanced datasets. The algorithm-based methods use
  urban areas versus the rural areas. Two classification             meta-learning (Liu, An, and Huang 2006, Zhu 2007) or on-
  methods Support Vector Machines (SVM) and Neural                   line active learning (Ertekin et al. 2007) to build better
  Networks (NN) were used in previous studies to perform             classifiers. Different combinations of these methods were
  the two-class classification task. Previous results achieved       also reported.
  high accuracies but low sensitivity, because of the                Real-world imbalanced datasets come from diverse
  imbalanced feature of the datasets. There are several ways         application areas like medical diagnosis, fraud detection,
  to deal with imbalanced data, but two sampling methods             intrusion detection, gene profiling, and object detection
  are compared in this study.
                                                                     from satellite images (Kubat, Holte, and Matwin 1998).
                                                                     Our study investigates the effect of two sampling
                       Introduction                                  techniques when applied on four large GIS datasets with an
                                                                     imbalance ratio between 2.4 and 12.5. The four datasets
The aim of this paper is to show that class imbalance has a
                                                                     contain over a million instances each, therefore there is no
powerful impact on the performance of binary
                                                                     need to use over-sampling. Besides that, over-sampling is
classification algorithms. Most machine learning
                                                                     known to introduce excessive noise and ambiguity.
algorithms provide models with better performances when
                                                                     Instead, the sampling methods considered were random
trained using balanced training datasets. However, most of
                                                                     sampling, under-sampling and the Wilson’s editing
the real-world datasets from various domains like medical
                                                                     algorithm in combination.
diagnosis, document classification, fraud and intrusion
                                                                     SVM and NN were used before in various studies to predict
detection are highly imbalanced towards the positive or the
                                                                     urbanization and land cover with almost similar results, but
minority class.
                                                                     different prediction patterns (Lazar and Shellito 2005,
In general, classification algorithms are designed to
                                                                     Shellito and Lazar 2005). Even if SVM itself does not
optimize the overall accuracy performance. However, for
                                                                     provide a mechanism to deal with imbalanced data, it can
imbalanced data, good accuracy does not mean that most
                                                                     be easily modified. SVM builds the decision boundary on a
examples from the minority class were correctly classified.
                                                                     limited number of instances that are close to the boundary,
Therefore, additional performance measures like recall, f-
                                                                     being unaffected by instances far away from the boundary.
measure, g-means, AUC should be included when we
                                                                     This observation can be used as an active learning selection
study imbalanced problems.
                                                                     strategy that provides a balanced training set for the early
One common approach to solve the imbalance problem is
                                                                     training stages of the SVM algorithm (Ertekin et al. 2007).
to sample the data to build an equally distributed training
                                                                     In the Background section we summarize related studies
dataset. Several sampling techniques were proposed and
                                                                     that deal with the problem of imbalanced datasets. The
analyzed in the literature (Van Hulse, Khoshgoftaar, and
                                                                     section Support Vector Machines and Multi-Layer
Napolitano 2007) including random under-sampling,
                                                                     Perceptrons presents the methods used, while the section
random over-sampling and more intelligent sampling
                                                                     describing our experiments presets a comparison between
random sampling, under-sampling and Wilson’s editing.                                          1
                                                                                   τ ( w) =
                                                                                                   2
The last section presents the conclusions.                          min                          w                                  (1)
                                                                w∈R × R ,b∈R
                                                                    l l
                                                                                               2
                      Background                                where yi ( w, xi           + b) ≥ 1 for all i = 1 ,…, l.
Previous research (Lazar and Shellito 2005, Pijanowski et
al. 2005, Pijanowski et al. 2002, Pijanowski et al. 2001,       One challenge is that in practice an ideal separating
Shellito and Lazar 2005, Shellito and Pijanowski 2003) has      hyperplane may not exist due to a large overlap between
shown that classification methods such as Support Vector        input data points from the two classes. In order to make
Machines (SVM) and Neural Networks (NN) can be                  the algorithm flexible a noise variable εi ≥ 0 for all i =
successfully used to predict patterns of urbanization in        1,…,l, is introduced in the objective function as follows:
large datasets. SVM and NN can then be used as                                                      1 2
                                                                                   τ ( w, ε i ) =     w + C ∑i =1 ε i
                                                                                                             l
predictive tools to determine if grid cells can be accurately       min                                                             (2)
                                                                w∈R l × R l ,b∈R                    2
predicted as urban or non-urban cells. The effectiveness of
the predictive capability of the SVM and NN can be
measured through standard accuracy and other measures.          when y i ( w, xi           + b) ≥ 1 − ε i for all i = 1 ,…, l.
The dataset generated for Mahoning County had over
1,000,000 instances and the imbalanced ratio was                By using Lagrange multipliers the previous problem can be
approximately 5:1. Even if the accuracy for both SVM and        formulated as the following convex maximization problem
NN were over 90%, the recall was quite low 55%.                 (Liu, An, and Huang 2006):
Lately, several studies dealt with imbalanced datasets and                                      1 l
                                                                W (α ) = ∑i =1α i −              ∑ y i y j α iα j K ( xi , x j )
                                                                                      l
their effect on classification performance; however none of                                                                         (3)
the studies included datasets with over a million instances.                                    2 i , j =1
Extensive experimental results using several sampling
techniques combined with several classification methods         when the following conditions are true, 0 ≤ α i ≤ C for

                                                                                                    ∑ α y = 0 . Here the positive
applied on several datasets were reported by (Van Hulse,                                              l
                                                                all i = 1 ,…, l, and                  i =1   i   i
Khoshgoftaar, and Napolitano 2007). The sampling
techniques      considered      were:   random      minority    constant C controls the trade-off between the maximization
oversampling, random majority oversampling, one-side            of (3) and the training error minimization, ∑εi.
selection, Wilson’s editing, SMOTE (Akbani, Kwek, and
Japkowicz 2004), borderline SMOTE and cluster-based             From the optimal hyperplane equation the decision
oversampling. They concluded that some of the more              function for classification can be generated. For any
complicated sampling techniques especially one-side             unknown instance x the decision will be made based on:

                                                                 f ( x) = sign(∑i =1 y iα i K ( xi , x) + b)
selection and cluster-based oversampling exhibit inferior                                  l
performance in comparison with some of the simple                                                                                   (4)
sampling techniques.
                                                                which geometrically corresponds to the distance of the
                                                                unknown instance to the hyperplane.
             Support Vector Machines
                                                                The method described until now works well on linear
The machine learning algorithms named support vector            problems. Function K, the kernel from (4) enables good
machines proposed by (Vapnik 1999) consist of two               results for nonlinear decision problems. The dot product of
important steps. Firstly, the dot product of the data points    the initial input space is called the new higher-dimensional
in the feature space, called the kernel, is computed.           feature space.
Secondly, a hyperplane learning algorithm is applied to the
kernel.                                                         K : R l × R l → R , K ( xi , x j ) = φ ( xi ), φ ( x j )           (5)
Let (xi, yi), i = 1,…,l, be the training set of examples. The
decision yi ∈ {-1, 1} is associated with each input instance    A polynomial kernel, the radial basis and the sigmoid
xi ∈ RN for a binary classification task. In order to find a    function are suitable kernels with similar behavior in terms
linear separating hyperplane with good generalization           of the resulting accuracy and they can be tuned by
abilities, for the input data points, the set of hyperplanes    changing the values of the parameters. There is no good
                                                                method to choose the best kernel function. The results
〈w,x〉+b=0 is considered. The optimal hyperplane can be
                                                                reported in this paper were obtained by using the following
determined by maximizing the distance between the
                                                                radial basis function (Schölkopf and Smola 2002) as kernel.
hyperplane and the closest input data points. The
hyperplane is the solution of the following problem:
                                    2
                         xi − x j                                                         Datasets
K ( xi , x j ) = exp(−                  )               (6)     Seven broad predictor variables, which aid in describing
                           2γ 2                                 the distribution of urbanization within the counties, were
                                                                constructed using ESRI’s ArcGIS 9.2 software package.
   Multi-layer Perceptron Neural Networks                       ArcGIS allows for modeling of a vast array of geospatial
                                                                techniques, including the cell-by-cell raster models. These
The multi-layer perceptron (MLP) (Witten and Frank              variables were chosen as they reflect large-scale factors
2000) is a popular technique because of its well-known          that influence the patterns of urbanization and general
ability to perform arbitrary mappings, not only                 urban trends for the region, as well as being similar to GIS
classifications. Usually built out of three or four layers of   variables for urban modeling within the Midwest
neurons, the input layer, the hidden layers and the output      (Pijanowski et al. 2005, Pijanowski et al. 2002, Pijanowski 2001,
layer, this network of neurons can be trained to identify       Shellito and Pijanowski 2003). The variables constructed
almost any input-output function. The back-propagation          were:
algorithm used for the training process adjusts the synaptic            a.          Distance to City Centers
weighs of the neurons according with the error at the                   b.          Distance to Highways
output. During the first step of the algorithm the predicted            c.          Distance to Interstates
outputs are calculated using the input values and the                   d.          Distance to Railroads
network weights. Afterwards, in the backward pass the                   e.          Distance to Lakes
partial derivatives of the cost function are propagated back            f.          Distance to Rivers
through the network and the weights are adjusted                        g.          Density of Agriculture
accordingly.                                                    For the county, a series of base layers was compiled to
The problem with the MLP methods is that they are               build the variables. The NLCD (National Land Cover
susceptible to converge towards local minimums. MLP             Database) 2001 data was used for location of urban areas
methods are considered as “black box”, because it is            and as a source of agricultural data. Base layers for
impossible to obtain snap-shots of the process.                 highways, interstates, and railways were drawn from US
                                                                Census 2000 TIGER files. Lakes and rivers data was
                  Sampling Methods                              derived from Ohio Department of Transportation (ODOT)
                                                                data. All base layers were projected into the UTM
Since the datasets considered have over a million instances
                                                                (Universal Transverse Mercator) projection and used to
we decided to investigate under-sampling (US). This             develop the predictor variables in raster format at 30m
sampling technique discards random instances from the           resolution. Distance variables were created by calculating
majority class until the two classes are equally represented.
                                                                the Euclidian distance of each cell from the closest feature
The other sampling method used in this study is called          in the base layers. The density variable was constructed by
Wilson’s editing (Barandela el al. 2004) (WE). A k-means        using a 3x3 moving window neighborhood operation and
nearest neighbor classification procedure is used with k=3      summing up the number of base layer grid cells in the
to classify each instance in the training set using all the     neighborhood. Urban land was identified by selecting all
remaining instances. Afterwards, all the instances from the     grid cells with the “developed” classification in the NLCD
majority class that were misclassified are removed.             dataset.
                                                                Predictor variables for each county were constructed by
                Performance Metrics                             incorporating data from their bordering Ohio counties, to
                                                                simulate the influence of nearby spatial factors outside the
Especially in the case of imbalanced datasets,
                                                                county borders (for instance, the proximity of a nearby city
classification accuracy alone is not the best metric to
evaluate a classifier. Several other performance metrics        center in a bordering county could potentially effect the
                                                                urban development within the target county). The resultant
can be used in order to get a more comprehensive picture
of the classifier’s capabilities.                               predictor variables created at this multi-county level were
Recall or sensitivity is the metric that measures the           then clipped down to the boundaries of the chosen county
                                                                and used in the analysis.
accuracy on the positive instances, It can be defined as
TruePositive / (TruePositive + FalseNegative). Specificity      This type of data was extracted for four counties from the
measures the accuracy on the negative instances and can         state of Ohio: Delaware, Holmes, Mahoning and Medina.
                                                                All four resulting datasets contain more than a million
be defined as TrueNegative / (TrueNegative +
FalsePositive).      Both sensitivity and specificity are       instances each. Table 1 shows for each county dataset how
incorporated in the g-means measure (Ertekin et al. 2007),      many instances belong to the positive class, how many
                                                                instance belong to the negative class and the ratio between
which is defined as square root from sensitivity *
specificity.                                                    the positive and the negative instances. All datasets are
m
mildly imbalan  nced from a 2.4
                              4:1 ratio for Maahoning Countty       Countyty dataset, whicch also has the lowest imbalaanced ratio.
too a 12.5:1 ratio
                 o for Holmes County.
                              C                                     Lookinng at the low   w recall valuues for the oother three
                                                                    dataseets, we need too investigate wways to better cclassify the
       Table 1. Number
                N      of Train
                              ning Instances and Ratios             instancces from thee positive claass. Experimeents using
                                                                    differeent sampling techniques arre reported inn the next
                # Positive        # Negative      Ratio             sectionn.
D
Delaware            209,765         1,106,749        5.2761:1
                                                            1
H
Holmes               90,164         1,129,403       12.5260:1
                                                            1                              Experimen
                                                                                                   nts
M
Mahoning            353,411           868,423        2.4576:1
                                                            1
M
Medina              228,819           987,405        4.3152:1
                                                            1       We ruun experimentts using RapidMiner (Mierrswa et al.
                                                                    2006) on the four ddatasets Delaw ware, Holmes, Mahoning
FFor the first sett of experiments we used two o classifiers, th
                                                               he   and MMedina as folloows. For each dataset we peerformed 5
SSVM and the Multi-Layer
                 M            Peerceptron (MLP P). We used th  he   runs oof a five-foldd cross validaation with thhe libSVM
liibSVM (Chan    ng and Lin 2001) softwaare to run th          he   softwaare. The rbf keernel was usedd. The two parrameters C
pparameter searcch, the trainingg and the testin
                                               ng for SVM an   nd   and gaamma where cchanged to vallues previouslyy found by
WWeka for the MLP.
                M       The expeeriments were similar
                                                s       with thhe   runninng a grid param
                                                                                        meter search.
eexperiments rep  ported in (Lazzar and Shellito
                                               o 2005, Shellitto
aand Lazar 2005  5) for Mahoninng County.
RRandom stratiffied sampling, which maintaains the ratio of    o
ppositive versus negative instaances in the dattasets, was useed
too generate dattasets of 10,0000 instances for the parameteer
ssearch and dataasets of 50,000 for training seets.
A grid parameeter search was   w performed for the SVM         M
cclassifier and the values forr the two parrameters C an       nd
ggamma are listeed below in tab ble 2.

       Table 2. Parameters
                P          C an
                              nd gamma for the LibSVM

                         C                      gamma
                                                g
   Delaware              8192                   0.125
                                                0
   Holmes                2048                   0.5
                                                0
   Mahoning              2                      32
                                                3
   Medina                128                    0.125
                                                0                             Figure 1. Reecall for Holmess County Dataseet

NNext, both classsifiers SVM and
                              a MLP weree trained on th     he
550,000 instances datasets an nd the models obtained werre
tested using thhe entire datassets. The resullts obtained arre
rreported in Tab
               ble 3. For each dataset and forr each classifieer
((SVM and ML    LP) three perrformance mettrics are listed   d:
aaccuracy, recalll and g-means..

     Table 3. Cllassification Perfformances for NN
                                                N and SVM

                          Del       Hol      Maah     Med
                  SVM     91.11
                              1     92.86    87.87    87.67
    Accuracy
                  MLP     80.36
                              6     93.8     85.72    87.53
                  SVM     57.35
                              5     4.19     70.44    47.13
    Recall
                  MLP     18.86
                              6     21.68    70.85    49.36
                  SVM     74.78
                              8     20.47    81.79    67.64
    G-means
                  MLP     40.02
                              2     46.46    80.63    68.97
                                                                             Figure 2. G-m
                                                                                         means for Holmes County Datasset
TThe results shoow that even if SVM has highher accuracy fo
                                                          or
thhree of the daatasets MLP hash higher reccall, so a betteer
                                                                    Three sampling tecchniques were used: random m stratified
cclassification of
                o the positivee instances fo or three of th
                                                          he
                                                                    sampliing (RS), equual under-samppling (US) andd Wilson’s
ddatasets. Recalll has the larg
                              gest values forr the Mahoninng
eediting sampliing (WE). Eaach experimen       nt was iterateed                              Conclusioons
thhrough subsam  mple datasets with sizes between 100 an     nd
55000, with a steep of 100.                                        We haave presented aan experimenttal analysis perrformed on
TThe results arre shown on two counties Holmes an            nd    large iimbalanced GIIS extracted ddatasets. The ggoal was to
MMedina, due to  o space limitattion. The Holm mes County haas     find wwhat sampling techniques im   mprove the claassification
thhe highest im mbalanced ratioo of approxim  mately 12.5 an nd          rmance especiially for the minority class. It is
                                                                   perform
MMedina has a 4.3
                4 imbalanced ratio.                                importtant in the casee of imbalanceed datasets thatt additional
AAll four figurres show thaat both underr-sampling an        nd    perform
                                                                         rmance measuures like reccall and g-m        means are
WWilson’s editin ng sampling have
                               h      a great in
                                               nfluence on thhe    compaared in additionn to the usual accuracy. We concluded
cclassification performance of the SVM        M learner. As  A     that booth equal undeer-sampling annd Wilson’s edditing work
aaccuracy is nott relevant in th
                               he case of imbaalanced datasetts    better that just simplle random strattified samplingg, but there
wwe looked at recall and g-m   means. The Wilson’e
                                              W         editin
                                                             ng    is no ssignificant diffe
                                                                                         ference betweenn the two.
wworked only slightly better thhan the equal under-sampling
                                               u              g,   Furtheer research maay investigate how other learners like
bbut required extensive preprocessing.
                               p                  The biggesst     Neuraal Networks orr Decision Treees perform w     with under-
ddifference in peerformance can n be seen in Fiigure 1 with th
                                                             he    sampliing and Wilsoon’s editing saampling. Overr-sampling,
rrecall for the Holmes
                H       County..                                   cost-seensitive learnning, and m   meta-learners are other
                                                                   alternaatives that cann be used to immprove the peerformance
                                                                   for ourr datasets.

                                                                                             Referencees
                                                                   Akbanii, R.; Kwek, S.; and Japkowicz N. 2004. Applyying support
                                                                   vector machines to imbbalanced datasetts. Proceedings oof European
                                                                   Conferrence on Machiine Learning. 399-50. Pisa, Italyy, Springer-
                                                                   Verlagg, Germany.

                                                                   Baranddela, R.; Valdovvinos, R. M.; Saanchez J. S.; andd Ferri, F. J.
                                                                   2004. T
                                                                         The Imbalancedd Training Sampple Problem: Unnder or Over
                                                                   Sampliing? In Joint IAAPR Internationaal Workshops onn Structural,
                                                                   Syntacttic, and Statisttical Pattern Reecognition (SSPPR/SPR’04),
                                                                   Lecturee Notes in Compputer Science 3138: 806-814.

                                                                   Chang,, C.; and Lin, C--J. 2001
                                                                   LIBSVVM : a library foor support vectoor machines, 20001. Software
                                                                   at htttp://www.csie.nttu.edu.tw/~cjlin//libsvm.    Lasst accessed
          Figuree 3. Recall for Medina
                                M      County Dataset
                                              D                    01/15/22011.

                                                                   Cristiannini, N; and Shhawe-Taylor, J.. 2000. An Intrroduction to
                                                                   Supporrt Vector Machhines and Other Kernel-baseed Learning
                                                                   Methodds, Cambridge UUniversity Press,, Cambridge, Enngland.

                                                                   Ertekinn, S.; Huang, JJ.; Bottou, L.; and Lee Giless, C. 2007.
                                                                   Learninng on the borrder: active leaarning in imballanced data
                                                                   classifiication. In Proceeedings of the S
                                                                                                           Sixteenth ACM Conference
                                                                   on Coonference on Innformation andd Knowledge M      Management
                                                                   (Lisbonn, Portugal, Novvember 06 - 10, 2007). CIKM '007. 127-136.
                                                                   ACM, New York, NY..

                                                                   Koggallage, R.; and Haalgamuge, S. 2004. “Reducing the Number
                                                                   of Trraining Samplees for Fast S     Support Vectoor Machine
                                                                   Classiffication.” Neurral Information Processing – Letters and
                                                                   Review
                                                                        ws 2 (3): 57-65.

                                                                   Kubat, M.; Holte, R. CC.; and Matwin. S. 1998. Machiine Learning
                                                                   for thee detection of oiil spills in satelllite radar imagees. Machine
                                                                   Learninng, 30(2-3): 1955-215.
         Figure 4. G-means for Medina
                               M      County Dataset
                                                                   Lazar, A.; and Shellitoo, B. A. 2005. Coomparing Machiine Learning
                                                                   Classiffication Schemees – a GIS Appproach. In Proceedings of
ICMLA’2005: The 2005 International Conference on Machine
Learning and Applications, IEEE.

Liu, Y.; An A.; and Huang, X. 2006. Boosting Prediction
Accuracy on Imbalanced Datasets with SVM Ensembles. Lecture
Notes in Artificial Intelligence, vol. 3918: 107-118.

Mierswa, I.; Wurst, M.; Klinkenberg, R.; Scholz, M.; and Euler,
T. 2006. YALE: Rapid Prototyping for Complex Data Mining
Task. In Proceedings of the 12th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (KDD-
06).

Pijanowski, B.; Pithadia, S.; Shellito, B. A.; and Alexandridis, K.
2005. Calibrating a neural network based urban change model for
two metropolitan areas of the upper Midwest of the United
States. International Journal of Geographical Information
Science 19 (2): 197-216.

Pijanowski, B.; Brown, D.; Shellito, B. A.; and Manik, G.
2002. Use of Neural Networks and GIS to Predict Land Use
Change. Computers, Environment, and Urban Systems 26(6):
553-575.

Pijanowski, B.; Shellito, B. A.; Bauer, M. and Sawaya, K. 2001.
“Using GIS, Artificial Neural Networks and Remote Sensing to
Model Urban Change in the Minneapolis-St. Paul and Detroit
Metropolitan Areas.” In Proceedings of the ASPRS Annual
Conference, St. Louis, MO.

Schölkopf, B.; and Smola, A. 2002. Learning with Kernels. MIT
Press, Cambridge Massachusetts.

Shellito, B. A.; and Lazar, A. 2005. Applying Support Vector
Machines and GIS to Urban Pattern Recognition. In Papers of the
Applied Geography Conferences, volume 28.

Shellito, B. A.; and Pijanowski, B. 2003. “Using Neural Nets to
Model the Spatial Distribution of Seasonal Homes.”
Cartography and Geographic Information Science 30 (3): 281-
290.

Van Hulse, J.; Khoshgoftaar, T. M.; and Napolitano. A. 2007.
Experimental perspectives on learning from imbalanced data. In
Proceedings of the 24th International Conference on Machine
Learning (Corvalis, Oregon, June 20 - 24, 2007). Z. Ghahramani,
Ed. ICML '07, vol. 227. ACM, New York, NY, 935-942.

Vapnik, V. N. 1999. The Nature of Statistical Learning Theory,
2nd edition, Springer-Verlag, New York, NY.

Witten, I. H.; and Frank, E. 2000. Data Mining: Practical
Machine Learning Tools and Techniques with Java
Implementation, Morgan Kaufmann Publishers, Burlington, MA.

Zhu, X. 2007. Lazy Bagging for Classifying Imbalanced Data. In
Seventh IEEE International Conference on Data Mining. 763-
768. Omaha, NE.