The Classification of Imbalanced Spatial Data Alina Lazar Department of Computer Science and Information Systems Youngstown State University Youngstown, OH 44555 alazar@ysu.edu Bradley A. Shellito Department of Geography Youngstown State University Youngstown, OH 44555 bashellito@ysu.edu techniques. A second class of methods uses meta-costs and Abstract assigns different penalties for the misclassified instances, depending on their true class. The problem with this type This paper describes a method of improving the prediction of methods is that it is hard to come up with a good penalty of urbanization. The four datasets used in this study were cost. The last type of methods is the algorithmic-based extracted using Geographical Information Systems (GIS). approach. They tweak the classifier to accommodate Each dataset contains seven independent variables related to urban development and a class label which denotes the imbalanced datasets. The algorithm-based methods use urban areas versus the rural areas. Two classification meta-learning (Liu, An, and Huang 2006, Zhu 2007) or on- methods Support Vector Machines (SVM) and Neural line active learning (Ertekin et al. 2007) to build better Networks (NN) were used in previous studies to perform classifiers. Different combinations of these methods were the two-class classification task. Previous results achieved also reported. high accuracies but low sensitivity, because of the Real-world imbalanced datasets come from diverse imbalanced feature of the datasets. There are several ways application areas like medical diagnosis, fraud detection, to deal with imbalanced data, but two sampling methods intrusion detection, gene profiling, and object detection are compared in this study. from satellite images (Kubat, Holte, and Matwin 1998). Our study investigates the effect of two sampling Introduction techniques when applied on four large GIS datasets with an imbalance ratio between 2.4 and 12.5. The four datasets The aim of this paper is to show that class imbalance has a contain over a million instances each, therefore there is no powerful impact on the performance of binary need to use over-sampling. Besides that, over-sampling is classification algorithms. Most machine learning known to introduce excessive noise and ambiguity. algorithms provide models with better performances when Instead, the sampling methods considered were random trained using balanced training datasets. However, most of sampling, under-sampling and the Wilson’s editing the real-world datasets from various domains like medical algorithm in combination. diagnosis, document classification, fraud and intrusion SVM and NN were used before in various studies to predict detection are highly imbalanced towards the positive or the urbanization and land cover with almost similar results, but minority class. different prediction patterns (Lazar and Shellito 2005, In general, classification algorithms are designed to Shellito and Lazar 2005). Even if SVM itself does not optimize the overall accuracy performance. However, for provide a mechanism to deal with imbalanced data, it can imbalanced data, good accuracy does not mean that most be easily modified. SVM builds the decision boundary on a examples from the minority class were correctly classified. limited number of instances that are close to the boundary, Therefore, additional performance measures like recall, f- being unaffected by instances far away from the boundary. measure, g-means, AUC should be included when we This observation can be used as an active learning selection study imbalanced problems. strategy that provides a balanced training set for the early One common approach to solve the imbalance problem is training stages of the SVM algorithm (Ertekin et al. 2007). to sample the data to build an equally distributed training In the Background section we summarize related studies dataset. Several sampling techniques were proposed and that deal with the problem of imbalanced datasets. The analyzed in the literature (Van Hulse, Khoshgoftaar, and section Support Vector Machines and Multi-Layer Napolitano 2007) including random under-sampling, Perceptrons presents the methods used, while the section random over-sampling and more intelligent sampling describing our experiments presets a comparison between random sampling, under-sampling and Wilson’s editing. 1 τ ( w) = 2 The last section presents the conclusions. min w (1) w∈R × R ,b∈R l l 2 Background where yi ( w, xi + b) ≥ 1 for all i = 1 ,…, l. Previous research (Lazar and Shellito 2005, Pijanowski et al. 2005, Pijanowski et al. 2002, Pijanowski et al. 2001, One challenge is that in practice an ideal separating Shellito and Lazar 2005, Shellito and Pijanowski 2003) has hyperplane may not exist due to a large overlap between shown that classification methods such as Support Vector input data points from the two classes. In order to make Machines (SVM) and Neural Networks (NN) can be the algorithm flexible a noise variable εi ≥ 0 for all i = successfully used to predict patterns of urbanization in 1,…,l, is introduced in the objective function as follows: large datasets. SVM and NN can then be used as 1 2 τ ( w, ε i ) = w + C ∑i =1 ε i l predictive tools to determine if grid cells can be accurately min (2) w∈R l × R l ,b∈R 2 predicted as urban or non-urban cells. The effectiveness of the predictive capability of the SVM and NN can be measured through standard accuracy and other measures. when y i ( w, xi + b) ≥ 1 − ε i for all i = 1 ,…, l. The dataset generated for Mahoning County had over 1,000,000 instances and the imbalanced ratio was By using Lagrange multipliers the previous problem can be approximately 5:1. Even if the accuracy for both SVM and formulated as the following convex maximization problem NN were over 90%, the recall was quite low 55%. (Liu, An, and Huang 2006): Lately, several studies dealt with imbalanced datasets and 1 l W (α ) = ∑i =1α i − ∑ y i y j α iα j K ( xi , x j ) l their effect on classification performance; however none of (3) the studies included datasets with over a million instances. 2 i , j =1 Extensive experimental results using several sampling techniques combined with several classification methods when the following conditions are true, 0 ≤ α i ≤ C for ∑ α y = 0 . Here the positive applied on several datasets were reported by (Van Hulse, l all i = 1 ,…, l, and i =1 i i Khoshgoftaar, and Napolitano 2007). The sampling techniques considered were: random minority constant C controls the trade-off between the maximization oversampling, random majority oversampling, one-side of (3) and the training error minimization, ∑εi. selection, Wilson’s editing, SMOTE (Akbani, Kwek, and Japkowicz 2004), borderline SMOTE and cluster-based From the optimal hyperplane equation the decision oversampling. They concluded that some of the more function for classification can be generated. For any complicated sampling techniques especially one-side unknown instance x the decision will be made based on: f ( x) = sign(∑i =1 y iα i K ( xi , x) + b) selection and cluster-based oversampling exhibit inferior l performance in comparison with some of the simple (4) sampling techniques. which geometrically corresponds to the distance of the unknown instance to the hyperplane. Support Vector Machines The method described until now works well on linear The machine learning algorithms named support vector problems. Function K, the kernel from (4) enables good machines proposed by (Vapnik 1999) consist of two results for nonlinear decision problems. The dot product of important steps. Firstly, the dot product of the data points the initial input space is called the new higher-dimensional in the feature space, called the kernel, is computed. feature space. Secondly, a hyperplane learning algorithm is applied to the kernel. K : R l × R l → R , K ( xi , x j ) = φ ( xi ), φ ( x j ) (5) Let (xi, yi), i = 1,…,l, be the training set of examples. The decision yi ∈ {-1, 1} is associated with each input instance A polynomial kernel, the radial basis and the sigmoid xi ∈ RN for a binary classification task. In order to find a function are suitable kernels with similar behavior in terms linear separating hyperplane with good generalization of the resulting accuracy and they can be tuned by abilities, for the input data points, the set of hyperplanes changing the values of the parameters. There is no good method to choose the best kernel function. The results 〈w,x〉+b=0 is considered. The optimal hyperplane can be reported in this paper were obtained by using the following determined by maximizing the distance between the radial basis function (Schölkopf and Smola 2002) as kernel. hyperplane and the closest input data points. The hyperplane is the solution of the following problem: 2 xi − x j Datasets K ( xi , x j ) = exp(− ) (6) Seven broad predictor variables, which aid in describing 2γ 2 the distribution of urbanization within the counties, were constructed using ESRI’s ArcGIS 9.2 software package. Multi-layer Perceptron Neural Networks ArcGIS allows for modeling of a vast array of geospatial techniques, including the cell-by-cell raster models. These The multi-layer perceptron (MLP) (Witten and Frank variables were chosen as they reflect large-scale factors 2000) is a popular technique because of its well-known that influence the patterns of urbanization and general ability to perform arbitrary mappings, not only urban trends for the region, as well as being similar to GIS classifications. Usually built out of three or four layers of variables for urban modeling within the Midwest neurons, the input layer, the hidden layers and the output (Pijanowski et al. 2005, Pijanowski et al. 2002, Pijanowski 2001, layer, this network of neurons can be trained to identify Shellito and Pijanowski 2003). The variables constructed almost any input-output function. The back-propagation were: algorithm used for the training process adjusts the synaptic a. Distance to City Centers weighs of the neurons according with the error at the b. Distance to Highways output. During the first step of the algorithm the predicted c. Distance to Interstates outputs are calculated using the input values and the d. Distance to Railroads network weights. Afterwards, in the backward pass the e. Distance to Lakes partial derivatives of the cost function are propagated back f. Distance to Rivers through the network and the weights are adjusted g. Density of Agriculture accordingly. For the county, a series of base layers was compiled to The problem with the MLP methods is that they are build the variables. The NLCD (National Land Cover susceptible to converge towards local minimums. MLP Database) 2001 data was used for location of urban areas methods are considered as “black box”, because it is and as a source of agricultural data. Base layers for impossible to obtain snap-shots of the process. highways, interstates, and railways were drawn from US Census 2000 TIGER files. Lakes and rivers data was Sampling Methods derived from Ohio Department of Transportation (ODOT) data. All base layers were projected into the UTM Since the datasets considered have over a million instances (Universal Transverse Mercator) projection and used to we decided to investigate under-sampling (US). This develop the predictor variables in raster format at 30m sampling technique discards random instances from the resolution. Distance variables were created by calculating majority class until the two classes are equally represented. the Euclidian distance of each cell from the closest feature The other sampling method used in this study is called in the base layers. The density variable was constructed by Wilson’s editing (Barandela el al. 2004) (WE). A k-means using a 3x3 moving window neighborhood operation and nearest neighbor classification procedure is used with k=3 summing up the number of base layer grid cells in the to classify each instance in the training set using all the neighborhood. Urban land was identified by selecting all remaining instances. Afterwards, all the instances from the grid cells with the “developed” classification in the NLCD majority class that were misclassified are removed. dataset. Predictor variables for each county were constructed by Performance Metrics incorporating data from their bordering Ohio counties, to simulate the influence of nearby spatial factors outside the Especially in the case of imbalanced datasets, county borders (for instance, the proximity of a nearby city classification accuracy alone is not the best metric to evaluate a classifier. Several other performance metrics center in a bordering county could potentially effect the urban development within the target county). The resultant can be used in order to get a more comprehensive picture of the classifier’s capabilities. predictor variables created at this multi-county level were Recall or sensitivity is the metric that measures the then clipped down to the boundaries of the chosen county and used in the analysis. accuracy on the positive instances, It can be defined as TruePositive / (TruePositive + FalseNegative). Specificity This type of data was extracted for four counties from the measures the accuracy on the negative instances and can state of Ohio: Delaware, Holmes, Mahoning and Medina. All four resulting datasets contain more than a million be defined as TrueNegative / (TrueNegative + FalsePositive). Both sensitivity and specificity are instances each. Table 1 shows for each county dataset how incorporated in the g-means measure (Ertekin et al. 2007), many instances belong to the positive class, how many instance belong to the negative class and the ratio between which is defined as square root from sensitivity * specificity. the positive and the negative instances. All datasets are m mildly imbalan nced from a 2.4 4:1 ratio for Maahoning Countty Countyty dataset, whicch also has the lowest imbalaanced ratio. too a 12.5:1 ratio o for Holmes County. C Lookinng at the low w recall valuues for the oother three dataseets, we need too investigate wways to better cclassify the Table 1. Number N of Train ning Instances and Ratios instancces from thee positive claass. Experimeents using differeent sampling techniques arre reported inn the next # Positive # Negative Ratio sectionn. D Delaware 209,765 1,106,749 5.2761:1 1 H Holmes 90,164 1,129,403 12.5260:1 1 Experimen nts M Mahoning 353,411 868,423 2.4576:1 1 M Medina 228,819 987,405 4.3152:1 1 We ruun experimentts using RapidMiner (Mierrswa et al. 2006) on the four ddatasets Delaw ware, Holmes, Mahoning FFor the first sett of experiments we used two o classifiers, th he and MMedina as folloows. For each dataset we peerformed 5 SSVM and the Multi-Layer M Peerceptron (MLP P). We used th he runs oof a five-foldd cross validaation with thhe libSVM liibSVM (Chan ng and Lin 2001) softwaare to run th he softwaare. The rbf keernel was usedd. The two parrameters C pparameter searcch, the trainingg and the testin ng for SVM an nd and gaamma where cchanged to vallues previouslyy found by WWeka for the MLP. M The expeeriments were similar s with thhe runninng a grid param meter search. eexperiments rep ported in (Lazzar and Shellito o 2005, Shellitto aand Lazar 2005 5) for Mahoninng County. RRandom stratiffied sampling, which maintaains the ratio of o ppositive versus negative instaances in the dattasets, was useed too generate dattasets of 10,0000 instances for the parameteer ssearch and dataasets of 50,000 for training seets. A grid parameeter search was w performed for the SVM M cclassifier and the values forr the two parrameters C an nd ggamma are listeed below in tab ble 2. Table 2. Parameters P C an nd gamma for the LibSVM C gamma g Delaware 8192 0.125 0 Holmes 2048 0.5 0 Mahoning 2 32 3 Medina 128 0.125 0 Figure 1. Reecall for Holmess County Dataseet NNext, both classsifiers SVM and a MLP weree trained on th he 550,000 instances datasets an nd the models obtained werre tested using thhe entire datassets. The resullts obtained arre rreported in Tab ble 3. For each dataset and forr each classifieer ((SVM and ML LP) three perrformance mettrics are listed d: aaccuracy, recalll and g-means.. Table 3. Cllassification Perfformances for NN N and SVM Del Hol Maah Med SVM 91.11 1 92.86 87.87 87.67 Accuracy MLP 80.36 6 93.8 85.72 87.53 SVM 57.35 5 4.19 70.44 47.13 Recall MLP 18.86 6 21.68 70.85 49.36 SVM 74.78 8 20.47 81.79 67.64 G-means MLP 40.02 2 46.46 80.63 68.97 Figure 2. G-m means for Holmes County Datasset TThe results shoow that even if SVM has highher accuracy fo or thhree of the daatasets MLP hash higher reccall, so a betteer Three sampling tecchniques were used: random m stratified cclassification of o the positivee instances fo or three of th he sampliing (RS), equual under-samppling (US) andd Wilson’s ddatasets. Recalll has the larg gest values forr the Mahoninng eediting sampliing (WE). Eaach experimen nt was iterateed Conclusioons thhrough subsam mple datasets with sizes between 100 an nd 55000, with a steep of 100. We haave presented aan experimenttal analysis perrformed on TThe results arre shown on two counties Holmes an nd large iimbalanced GIIS extracted ddatasets. The ggoal was to MMedina, due to o space limitattion. The Holm mes County haas find wwhat sampling techniques im mprove the claassification thhe highest im mbalanced ratioo of approxim mately 12.5 an nd rmance especiially for the minority class. It is perform MMedina has a 4.3 4 imbalanced ratio. importtant in the casee of imbalanceed datasets thatt additional AAll four figurres show thaat both underr-sampling an nd perform rmance measuures like reccall and g-m means are WWilson’s editin ng sampling have h a great in nfluence on thhe compaared in additionn to the usual accuracy. We concluded cclassification performance of the SVM M learner. As A that booth equal undeer-sampling annd Wilson’s edditing work aaccuracy is nott relevant in th he case of imbaalanced datasetts better that just simplle random strattified samplingg, but there wwe looked at recall and g-m means. The Wilson’e W editin ng is no ssignificant diffe ference betweenn the two. wworked only slightly better thhan the equal under-sampling u g, Furtheer research maay investigate how other learners like bbut required extensive preprocessing. p The biggesst Neuraal Networks orr Decision Treees perform w with under- ddifference in peerformance can n be seen in Fiigure 1 with th he sampliing and Wilsoon’s editing saampling. Overr-sampling, rrecall for the Holmes H County.. cost-seensitive learnning, and m meta-learners are other alternaatives that cann be used to immprove the peerformance for ourr datasets. Referencees Akbanii, R.; Kwek, S.; and Japkowicz N. 2004. Applyying support vector machines to imbbalanced datasetts. Proceedings oof European Conferrence on Machiine Learning. 399-50. Pisa, Italyy, Springer- Verlagg, Germany. Baranddela, R.; Valdovvinos, R. M.; Saanchez J. S.; andd Ferri, F. J. 2004. T The Imbalancedd Training Sampple Problem: Unnder or Over Sampliing? In Joint IAAPR Internationaal Workshops onn Structural, Syntacttic, and Statisttical Pattern Reecognition (SSPPR/SPR’04), Lecturee Notes in Compputer Science 3138: 806-814. Chang,, C.; and Lin, C--J. 2001 LIBSVVM : a library foor support vectoor machines, 20001. Software at htttp://www.csie.nttu.edu.tw/~cjlin//libsvm. Lasst accessed Figuree 3. Recall for Medina M County Dataset D 01/15/22011. Cristiannini, N; and Shhawe-Taylor, J.. 2000. An Intrroduction to Supporrt Vector Machhines and Other Kernel-baseed Learning Methodds, Cambridge UUniversity Press,, Cambridge, Enngland. Ertekinn, S.; Huang, JJ.; Bottou, L.; and Lee Giless, C. 2007. Learninng on the borrder: active leaarning in imballanced data classifiication. In Proceeedings of the S Sixteenth ACM Conference on Coonference on Innformation andd Knowledge M Management (Lisbonn, Portugal, Novvember 06 - 10, 2007). CIKM '007. 127-136. ACM, New York, NY.. Koggallage, R.; and Haalgamuge, S. 2004. “Reducing the Number of Trraining Samplees for Fast S Support Vectoor Machine Classiffication.” Neurral Information Processing – Letters and Review ws 2 (3): 57-65. Kubat, M.; Holte, R. CC.; and Matwin. S. 1998. Machiine Learning for thee detection of oiil spills in satelllite radar imagees. Machine Learninng, 30(2-3): 1955-215. Figure 4. G-means for Medina M County Dataset Lazar, A.; and Shellitoo, B. A. 2005. Coomparing Machiine Learning Classiffication Schemees – a GIS Appproach. In Proceedings of ICMLA’2005: The 2005 International Conference on Machine Learning and Applications, IEEE. Liu, Y.; An A.; and Huang, X. 2006. Boosting Prediction Accuracy on Imbalanced Datasets with SVM Ensembles. Lecture Notes in Artificial Intelligence, vol. 3918: 107-118. Mierswa, I.; Wurst, M.; Klinkenberg, R.; Scholz, M.; and Euler, T. 2006. YALE: Rapid Prototyping for Complex Data Mining Task. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD- 06). Pijanowski, B.; Pithadia, S.; Shellito, B. A.; and Alexandridis, K. 2005. Calibrating a neural network based urban change model for two metropolitan areas of the upper Midwest of the United States. International Journal of Geographical Information Science 19 (2): 197-216. Pijanowski, B.; Brown, D.; Shellito, B. A.; and Manik, G. 2002. Use of Neural Networks and GIS to Predict Land Use Change. Computers, Environment, and Urban Systems 26(6): 553-575. Pijanowski, B.; Shellito, B. A.; Bauer, M. and Sawaya, K. 2001. “Using GIS, Artificial Neural Networks and Remote Sensing to Model Urban Change in the Minneapolis-St. Paul and Detroit Metropolitan Areas.” In Proceedings of the ASPRS Annual Conference, St. Louis, MO. Schölkopf, B.; and Smola, A. 2002. Learning with Kernels. MIT Press, Cambridge Massachusetts. Shellito, B. A.; and Lazar, A. 2005. Applying Support Vector Machines and GIS to Urban Pattern Recognition. In Papers of the Applied Geography Conferences, volume 28. Shellito, B. A.; and Pijanowski, B. 2003. “Using Neural Nets to Model the Spatial Distribution of Seasonal Homes.” Cartography and Geographic Information Science 30 (3): 281- 290. Van Hulse, J.; Khoshgoftaar, T. M.; and Napolitano. A. 2007. Experimental perspectives on learning from imbalanced data. In Proceedings of the 24th International Conference on Machine Learning (Corvalis, Oregon, June 20 - 24, 2007). Z. Ghahramani, Ed. ICML '07, vol. 227. ACM, New York, NY, 935-942. Vapnik, V. N. 1999. The Nature of Statistical Learning Theory, 2nd edition, Springer-Verlag, New York, NY. Witten, I. H.; and Frank, E. 2000. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementation, Morgan Kaufmann Publishers, Burlington, MA. Zhu, X. 2007. Lazy Bagging for Classifying Imbalanced Data. In Seventh IEEE International Conference on Data Mining. 763- 768. Omaha, NE.