=Paper=
{{Paper
|id=Vol-1348/maics2013_paper_8
|storemode=property
|title=Comparison of Fast Learning Large-Scale Multi-Class Classification
|pdfUrl=https://ceur-ws.org/Vol-1348/maics2013_paper_8.pdf
|volume=Vol-1348
|dblpUrl=https://dblp.org/rec/conf/maics/JovanovichL13
}}
==Comparison of Fast Learning Large-Scale Multi-Class Classification==
Comparison of Fast Learning Large Scale Multi-Class Classification A. Jovanovich and A. Lazar Department of Computer Science and Information Systems Youngstown State University Youngstown, Oh 44555 {ajovanovich, alazar}@gmail.com Abstract In the past, classification models have been shown to Recent progress in the development of techniques handle large amounts of data well, and several to optimize large-scale classification problems has optimization techniques have been applied to efficiently extended the use of multi-class classification. train data intensive models (Aly 2005). However the Specifically the use of multi-class classification performance of the algorithms begin to decline when the algorithms when the dataset is to large to fit into limited memory available of most computers. The data cannot be processed into memory (Yu et al. 2012). In most prominent algorithms used today solve the these cases, training techniques that deal well with multi-class classification problem through an memory limitations become critical. optimization approach based on coordinate decent. Recent progress has been made in the development of Two of the most recognized algorithms, Vowpal techniques to optimize over memory constrained systems, Wabbit and LIBLINEAR LibSVM have emerged as the most consistent options when solving for a including binary classification. It is a well studied special multi-class problems. This paper proposes an case of the classification problem that provides the analysis of these methods and tests the efficiency foundation for multi-class problem solvers. Statistical and performance of each algorithm. The results properties of binary classifiers, such as consistency, have are recorded and comparisons are made. After been investigated in a variety of settings. Most machine- analyzing the results, the conclusion made is that the Vowpal Wabbit algorithm is best suited for learning algorithms are originally developed for binary solving large-scale multi-class classification decision problems, and extended to handle multi-class problems when computer memory is constrained. problems (Argyriou, Herbster, and Pontil 2005). The goal of this paper is to study the existing multi- class classification methods and provide a noteworthy Introduction comparison. In order to provide a thorough and Big Data has affected the way statistical analysis is being comprehensive comparison, classification tests were run conducted. Many real-world datasets contain hundreds or multiple times over contrasting datasets. The paper is thousands of variables of interest which can contain organized in the following manner. Section 2 contains hundreds of thousand or millions of records. Time spent background information. Section 3 presents related work. on reading/writing between memory and disk becomes Our approach is in section 4, followed by experiments in the bottleneck, rendering most algorithms inefficient (Yu section 5. Conclusions follow and finally work cited. et al. 2012). Even with the growing memory sizes of computers, a large data set can still be problematic. As a consequence, the complexity of analysis increasingly Background becomes unmanageable by using traditional machine The principles of Multi-class classification are founded learning algorithms. To extract useful knowledge from upon the same principles of binary classification. Each dense data makes the task of analysis time consuming. case involves assigning a class label for each instance Coupled with the fact that most algorithms use a iteration present. Given a training data set of the form (xi , yi), process that cycles through the dataset multiple times, the where xi ∈ Rn is the ith example and yi ∈ {1, . . . , K} is process is seemingly impossible to finish. the ith class label, we aim at finding a learning model H such that H(xi) = yi for new unseen examples (Menon the size of blocks. The framework consists of 3 steps that 2009). The problem is simply formulated in the two class split the data and read the blocks into memory, set initial case, where the labels yi are just 0 or 1 for the two classes values before solving for classification through an involved. iterative process. The algorithm can be summarized as Several algorithms have been proposed to solve the following: binary problem. Some of these can be easily extended to the multi-class case, and some that involve high levels of Algorithm 1 A block minimization framework for Dual computation. In contrast to traditional data classification SVM where each instance is assigned to only one label, instances used in multi-class classification may be 1. Split {1, . . . , l} to B1 , . . . , Bm and store data into m simultaneously relevant to several labels. For example: in partitions accordingly. 2. Set initial w text categorization, one document can belong to multiple 3. For j = 1, . . . , m (inner iterations) topics. For K = 1, 2, . . . (outer iterations) Existing approaches to handle Large Scale multi-class 3.1. Read xr , ∀r ∈ Bj from disk classification can be categorized into two types. The first 3.2 Approximately solve the sub-problem (1) approach solves problems through a block minimization 3.3 Update w by (2) scheme. The second approach considers online learning algorithms (Dekel 2008). Both algorithms attempt to Here, {B1 , . . . , Bm } are sequential partitions of all solve the classification problem of assigning labels from data indices {1, . . . , l}. The size of the blocks are Y to instances X (Tewari and Bartlett 2007). determined by the known memory constraints. Each instance is read and randomly assigned a block. Algorithm 2 explains the process for data splitting. Block Minimization For Support Vector Optimization over a single block is identified as the inner Machines iteration, whereas the m steps of going over all blocks is deemed an outer iteration (Yu et al. 2012). Optimization through block minimization has been used in the past to efficiently deal with data to large too fit into Algorithm 2 Framework for block splitting memory. One of the most prominent learning algorithms associated with block minimization is Support Vector 1. Decide m and create m empty files Machine (SVM) algorithm. Proposed by Cortes and 2. For i = 1, …. Vapnik in 1995, the algorithm has since grown into one of 2.1 Convert xi to a binary format xi . the most widely used learning algorithm in the world 2.2 Randomly choose number j {1,...,m}. (Bottou and Lin 2007). The implementation and broad 2.3 append xi into the end of the jth file. uses of SVMs have been well documented in the years since past. In order to optimize through block minimization only The framework for an SVM is modified to solve for the dual form of SVM must be used. By examining the classification problems in the event limited memory is dual form of the optimization problem we are able to available. The process entails using an optimization write the entire algorithm in terms of only inner products technique based on block minimization. Where the term between input feature vectors. Updates to the weight “block” refers to partitions of the dataset that can be read vector w, which corresponds to the entire data set treating through the memory available. The size and content of instances uniformly prevent the primal for of SVM to be each block varies from approach to approach. Pérez- used (Shalev-Shwartz et al. 2007). Algorithm 1 is able to Cruz, Figueiras, and Artes (2004) propose the use of efficiently learn in very high dimensional spaces. “double chunking.” Where data is partitioned into both The sub-problem is solved in the inner iteration step, “large chunks” and “small chunks.” Another approach and the solution is then used to update w. The solution described by Chang and Roth (2011) uses selective uses only instances that belong to block Bj. The solution sampling for block minimization. By selecting only to the sub-problem is presented below in (1). significant instances, the goal is to minimize the size of data blocks and speed up the iteration process. Yu et al. l (2012) suggest a framework for block minimization that is w = ∑ αiyixi (1) also used for testing in this paper. In this approach the i=1 amount of memory available for processing correlates to The iteration round is then complete after w is updated. algorithm that uses stochastic gradient descent. Vowpal To update w, if dBj is an optimal solution for the sub- Wabbit can handle very large datasets without ever problem: needing to load the entire dataset into memory. The algorithm also requires less computational power and far w←w + ∑ dr yrxr (2) fewer resources by learning through online gradient r∈Bj descent (Langford, Li, and Zhang, 2009). Algorithm 3 below presents the framework. The iteration process continues until optimization is reached, converging when one of two conditions are met Algorithm 3 Online-learning framework for Vowpal Wabbit (Yu et al. 2012). The first condition states that Inputs: optimization is complete when the sub-problem for each • threshold θ ≥ 0 block is solved and the solutions converge. The second • gravity sequence gi ≥ 0 condition is a stopping criteria. Usually a finite number • learning rate η ∈ (0, 1) of iterations is chosen, or an accuracy threshold is obtained. j initialize weights w ← 0 ( j = 1, . . . , d) LIBLINEAR addresses both conditions while solving 1 2 d 1. Acquire an unlabeled example x = [x , x , . . . , x ] for the sub-problem. The software contains a library with 2. Compute prediction: y = ∑ j w x j j tools used for SVM classification when data cannot fit 3. Acquire the label y from dataset into memory (Yu et al. 2012). LIBLINEAR sequentially j for all weights w ( j = 1, . . . , d) selects one variable for update and fixes others inside the j j j j (a) if w > 0 and w ≤ θ then w ← max{w − gi η, 0} block. The framework not shown in this paper is j j j j (b) else if w < 0 and w ≥ −θ then w ← min{w + gi η, 0} explained by Yu et al. (2012). LIBLINEAR uses a SVM j j 4. Update weights for all features j: w ← w + 2η(y − y)x j coordinate descent method and solver to update instances in block Bj before solving for (1). j Here, the superscripted symbol w denotes the jth component of vector w in order to differentiate from wi, which references the i th weight vector (Langford Online Learning with Stochastic Gradient and Zhang 2009). Descent Updates to the assigned weights are stored in a cache Online learning algorithms are designed to effectively file that is small enough to be read into the limited classify data by building a weight model derived from memory available. Vowpal Wabbit stores only the cache sequentially received training examples. Compared to file, allowing for faster implementation, and increased block minimization which solves for the sub-problem of performance through the optimization process. each block, online learning updates instances through the use of a cache file. Each iteration round updates the cache file where the weight model is stored. The algorithm classifies each instance, and uses the new Related Work “instance-label pair” to update and improve the stored In this section we discuss related work pertaining to model (Tewari and Bartlett 2007). This method is comparisons of Large-scale Classification problems. The expected to accurately predict the labels of instances that topic has compiled a comprehensive library primarily for are not part of the training set. binary classification. As such there seems to be a gap in Several strategies were proposed to optimize online knowledge pertaining to multi-class classification. We learning algorithms. Most of which aim to extend the derive the techniques used for our comparison from original purpose of binary classification to multi-class strategies implemented in related work. learning. In the study conducted by Argyriou et al (2005), Yu et al. first introduced a comparison for large linear an extension to a graph-based approach to online learning classification in his paper published in 2012. His was discussed. Shalev-Shwartz et al (2007) exploited the comparison of SVM solvers and online-learning dual formation of optimization to create a more efficient algorithms only extended to binary classification. The online learning algorithm. study defined one assumption that is significant in our John Langford and his colleagues at Yahoo! Research comparison. developed Vowpal Wabbit, a fast online-learning 45000 4000 40000 3500 35000 3000 30000 2500 25000 Frequency 2000 Frequency 20000 1500 15000 1000 10000 5000 500 0 0 1 2 3 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 Class Class Figure 1 SensIT Vehicle Class Training Frequency Figure 2 RCV1 Class Training Frequency Assuming that the amount of available memory is Data limited, entire datasets cannot be stored in memory, but Comparisons were drawn from test conducted over multi- can be stored in the disk of one computer. The size of the class datasets. The SensIT Vehicle dataset was used, in datasets used in this paper are large enough to satisfying addition to a larger 53 class (RCV1) dataset. Using the this constraint, and must be accessed through the hard smaller dataset to establish a baseline reading and drive where they are stored. evaluation. The classification problem was then extended Multi-class classification was addressed in the paper by the simple classification problem to one more complex in Chang and Roth in 2011. In the paper comparisons were that of the RCV1 dataset. The tests conducted here would drawn from both batch learning as well as online-learning reinforce the results gathered from the SensIT dataset. algorithms. Unfortunately Vowpal Wabbit at that time The SensIT Vehicle dataset features 3-class labels. The was limited to binary classification and no comparisons instances were extracted from sensor data collected were drawn. Instead the focus was primarily on block during a real world experiment carried out at Twenty nine minimizing algorithms. Of which LibSVM was deemed Palms, CA in November of 2001 (Duarte and Hen 2004). superior to several other algorithms. It is our intent to The sensors were used to obtain both acoustic and seismic extend this comparison of LibSVM to Vowpal Wabbit. activity from vehicles in the vicinity. In total 50 features were extracted. Each vehicle was driven around a road while sensors collected information as they passed. Approach Classes included in the training set included: AAV3 (class The aim of this paper is to test the performance of Vowpal 1), DW3 (class 2), and a third class for noise (class 3). In Wabbit (Langford, Li, and Zhang, 2009) against the total there are 78,823 training samples and 19,705 testing LIBLINEAR block learning extension of LibSVM (Yu et samples. In Figure 1 the frequency of each class used for al 2012 ). Due to the recent advances in the Vowpal training is presented. The distribution of samples shows Wabbit library this is one of the first tests using the multi- an unbalanced dataset where more instances of noise class solver. classification are recorded than vehicle. The total sample Testing was performed using one-against-all multi-class size of the 2 vehicles combines is close to the total sample solver options implemented in both algorithms. For a K- size of the noise. The data being skewed as such can class problem the one-against-all method constructs K increase the error rate and degrade performance in certain models, where each model separates a class from the rest. cases. Later in the paper we discuss the impact of this on The ith model is trained with all of the binary instances classification accuracy. pertaining to the ith class. The final output of the one- The RCV1 dataset was used in part due to its 53-class against-all method is the class that corresponds to the problem. The Reuters Corpus Volume I (Reuters RCV1) SVM with the highest output value (Liu, Wang, and Zeng is one of the most widely used test collection for text 2007). categorization research (Lewis et al. 2004). It contains 534,135 newswire documents, which are split into 15,564 training documents and 518,571 test documents. The Vowpal Wabbit libSVM Vowpal Wabbit libSVM 140 120 120 100 100 Time (seconds) 80 Time (seconds) 80 60 60 40 40 20 20 0 0 1 2 3 4 5 6 7 8 9 10 20 30 50 78 1 2 3 4 5 6 7 8 9 10 20 30 50 78 Iteration Iteration Figure 3 Iteration time over SensIT Data Figure 5 Iteration time over RCV1 Data Vowpal Wabbit libSVM Vowpal Wabbit libSVM 90% 80.6% 89% 80.4% 88% 87% 80.2% 86% Accuracy 85% Accuracy 80.0% 84% 79.8% 83% 82% 79.6% 81% 79.4% 80% 1 2 3 4 5 6 7 8 9 10 20 30 50 78 1 2 3 4 5 6 7 8 9 10 20 30 50 78 Iteration Iteration Figure 4 Percent Accuracy Per Iteration over SensIT Data Figure 6 Percent Accuracy Per Iteration over RCV1 Data dataset contains 47,236 features. The class frequency of Before the experiment can be run using the all 53 labels is presented in figure 2. Again the frequency LIBLINEAR extension LibSVM, the training dataset of the samples collected in the training dataset represent must be partitioned into smaller blocks. The optimal an unbalanced sample size where the first 25 class number of partitions was discovered to be 8 for this case. frequencies are significantly larger than the last 28. Vowpal Wabbit on the other hand uses a cache file which fulfills the memory constraint. The dataset does not need to be split or compressed, Vowpal Wabbit can access each instance without reading the entire datasets. The Experiment algorithms were first trained and tested using the SensIT In order to successfully draw comparisons testing criteria dataset. The dataset represents a simple scenario where was first established. The criteria is based of overall binary classification is extended into a 3-class problem. performance of the ability of the algorithms to learn a The results from both algorithms are shown in figures 3 classification model and successfully label the instance and 4. into the the right class. Stopping criteria was also On average the runtime per iteration for Vowpal established before tests were conducted. The iteration Wabbit was 0.36 seconds, while the average for libSVM threshold was set to 78 rounds. Figures 3 – 6 present the was 1.48 seconds. Vowpal Wabbit performed at a pace 4 individual performance measures of the algorithms from times as fast as LibSVM. Both algorithms scored over iteration round 1 to the stopping threshold reached in 80% accuracy classifying testing instances. By the end of round 78. The algorithms were compiled in C++ and run the 78 iteration the algorithms were close to converging on the Linux operating system. All experiments were run around the 80.3 percentile. The biggest differences on a laptop with 6GB of memory. To ensure accurate recorded can be seen between the first 5 iteration rounds. results each test was repeated 3 times and the averages This is were the optimization approaches can be were used to rate performance. SensIT Dataset RCV1 Dataset Vowpal Wabbit libSVM Vowpal Wabbit libSVM Iteration Time Acc Iteration Time Acc Iteration Time Acc Iteration Time Acc 1 1.28 79.86% 1 5.52 80.47% 1 1.06 83.40% 1 8.20 88.64% 2 1.64 80.06% 2 8.00 80.47% 2 1.53 85.81% 2 9.46 88.64% 3 2.05 80.12% 3 9.53 80.43% 3 1.94 86.57% 3 10.80 88.64% 4 2.46 80.15% 4 11.25 80.46% 4 2.36 86.85% 4 12.25 88.64% 5 2.87 80.16% 5 13.03 80.42% 5 2.85 86.96% 5 13.61 88.64% 6 3.29 80.20% 6 15.43 80.43% 6 3.37 86.99% 6 14.87 88.64% 7 3.56 80.19% 7 16.09 80.46% 7 3.84 86.96% 7 16.20 88.64% 8 3.95 80.19% 8 17.27 80.43% 8 4.02 86.90% 8 17.63 88.64% 9 4.40 80.19% 9 18.76 80.43% 9 4.48 86.83% 9 19.01 88.64% 10 4.81 80.19% 10 20.14 80.47% 10 5.26 86.75% 10 20.33 88.64% 20 7.13 80.24% 20 34.28 80.45% 20 9.46 86.12% 20 33.99 80.45% 30 10.83 80.22% 30 48.93 80.43% 30 13.91 85.76% 30 47.30 80.43% 50 17.52 80.27% 50 76.93 80.44% 50 22.06 85.45% 50 74.17 80.44% 78 27.81 80.32% 78 116.56 80.43% 78 33.85 85.28% 78 111.93 80.43% Figure 7 SensIT Data Table of Statistics Figure 8 RCV1 Data Table of Statistics distinguished. As the Vowpal Wabbit algorithm passes 80.47% of the SensIT testing dataset, whereas Vowpal through the dataset and the weights are updated from the Wabbit managed to obtain a lower rate of 80.32% after initial label, the accuracy of classification begins to rise. the final iteration. Figure 7 presents a table with the While the consistent accuracy rate for LibSVM can be results gathered testing over the SensIT test dataset. Once attributed to the block minimization framework. By more LibSVM attained the highest classification rate over solving the subproblem within each equally distributed the RCV1 testing dataset. The results from both block first, the updated weights do not yield higher algorithms are presented in figure 8. the highest accuracy accuracy. Instead the rate is consistent throughout the mark was obtained during the first iteration round when optimization process. LibSVM correctly classifies 88.64% of testing instances. The second experiment was conducting over the RCV1 This rate was maintained throughout the iteration process. dataset. Classification using the one-against-all option is Vowpal Wabbit reached the t highest mark with an increasingly complicated with each additional class. The accuracy of 86.99% after the 6th iteration round. increased sample, class, and feature size in the RCV1 The higher accuracy ratings from LibSVM do come at dataset taxes the memory load in such a constrained a cost however. The runtime for each iteration was environment. Performance from both algorithms stayed significantly slower than that of Vowpal Wabbit. The total consistent with the baseline ratings established with the time needed to reach the 78th iteration round presented in tests run over the SensIT dataset. Average runtime per figure 7 was only 27.81 seconds. Compared to 116.56 iteration for Vowpal Wabbit increases slightly to 0.43 seconds for LibSVM. Figure 8 presented similar results. seconds, while LibSVM decreased slightly to 1.43 Vowpal Wabbit was recorded at 33.85 seconds where seconds. The runtime results of RCV1 is presented in LibSVM reached the stopping threshold in 111.93 figure 5 are similar to those of SensIT testing in figure 3. seconds. In both cases Vowpal Wabbit was 4 times as As with the sensIT data set, Vowpal Wabbit is consistently fast. quicker than libSVM. Surprisingly both algorithms achieved higher levels of Classification accuracy rose 8 percent when comparing performance solving for the larger RCV1 classification the results of figure 6 to figure 4. The larger dataset has problem. Iteration runtime of both algorithms experienced had a positive impact on the accuracy rate of a small increase in time when compared in figures 3 and classification for the two algorithms. Again LibSVM 5. The classification accuracy improved in figure 6 when scores a higher classification accuracy while achieving compared to the accuracy ratings in figure 3. The lower more consistent results. The accuracy of Vowpal Wabbit rates of classification accuracy can can be contributed to begins to rise until the 20th iterations. From there the the skewed training dataset. The frequency distribution in accuracy begins to drop until the 78th iteration. Unlike presented in figures 1 and 2 are unbalanced. As such the the SensIT tests, the algorithms never appear to converge. classification accuracy over the smaller SensIT data set Overall the LIBLINEAR extension LibSVM was lower than the larger RCV1 dataset. The increase in classification algorithm achieved the highest classification instances correlated with the accuracy of overall rates. At no point in the iteration process did the Vowpal classification. The more instances the algorithm had to Wabbit algorithm correctly classify more test instances. train over, the more classification accuracy improved. The highest classification accuracy rate occurred during the first iteration step. LibSVM correctly classified Conclusion Using the results recorded from figures 3 -8 the following question about Fast Learning large-scale multi-class classification can be answered: Question: Which algorithm is most efficient when their are constraints to the memory? When compared, Vowpal Wabbit is the most efficient multi-class classification algorithm. The results from the SensIT test case suggested that Vowpal Wabbit was the quicker algorithm while maintaining a slightly lower accuracy percentile than LIBSVM. Moving from the SensIT dataset to the larger RCV1, the results remained consistent. We have concluded that Vowpal Wabbit had a slight advantage in overall efficiency when there is a constraint placed of computer memory size. The performance of both algorithms was relatively close however. Due to the small size of the experiment, further testing is needed for a thorough comparison. Testing over datasets that are more expansive, both in sample and feature size could be used for a more significant experiment. However, we feel that the size of the datasets used in this paper are adequate to provide conclusive results for the comparisons made. References Liu, Y., Wang, R., and Zeng, Y. S. 2007. An Improvement of One-Against-One Method for Multi-class Support Aly, M. 2005. Survey on multiclass classification Vector Machine. In Machine Learning and Cybernetics, methods. Neural networks, 1-9. 2007 International Conference 5: 2915-2920. Argyriou, A., Herbster, M., and Pontil, M. 2005. Menon, A. K. 2009. Large-scale support vector machines: Combining graph Laplacians for semi--supervised algorithms and theory. Research Exam, University of learning. California, San Diego. Bottou, L., and Lin, C. J. 2007. Support vector machine Pechyony, D., Shen, L., and Jones, R. Solving Large Scale solvers. Large scale kernel machines: 301-320. Linear SVM with Distributed Block Minimization. Chang, K. W., and Roth, D. 2011. Selective block Pérez-Cruz, F., Figueiras, A., and Artes, A. 2004. Double minimization for faster convergence of limited memory chunking for solving SVMs for very large datasets. large-scale linear models. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge Shalev-Shwartz, S., Singer, Y., and Srebro, N. 2007. discovery and data mining: 699-707. Pegasos: Primal estimated sub-gradient solver for svm. In Proceedings of the 24th international conference on Chang, C. and Lin, C. 2011. LIBSVM: a library for Machine learning, 807-814.: ACM. support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3): 27:1--27:27. Tewari, A., and Bartlett, P. L. 2007. On the consistency of Software available at multiclass classification methods. Journal of Machine http://www.csie.ntu.edu.tw/~cjlin/libsvm Learning Research 8: 1007-1025. Cortes, C., and Vapnik, V. 1995. Support-vector networks. Yu, H. F., Hsieh, C. J., Chang, K. W., and Lin, C. J. 2012. Machine learning, 20(3): 273-297. Large linear classification when data cannot fit in memory. ACM Transactions on Knowledge Discovery Dekel, O. 2008. From online to batch learning with from Data TKDD 5(4): 23. cutoff-averaging. NIPS. Duarte, M. F., and Hen Hu, Y. 2004. Vehicle classification in distributed sensor networks. Journal of Parallel and Distributed Computing 64(7): 826-838. Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). LIBLINEAR: A library for large linear classification. The Journal of Machine Learning Research 9: 1871-1874. Langford, J., Li, L., and Zhang, T. 2009. Sparse online learning via truncated gradient. The Journal of Machine Learning Research 10: 777-801. Lewis, D., Yang, Y., Rose, T., and Li, F. 2004. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5: 361- 397.