=Paper= {{Paper |id=Vol-1348/maics2013_paper_8 |storemode=property |title=Comparison of Fast Learning Large-Scale Multi-Class Classification |pdfUrl=https://ceur-ws.org/Vol-1348/maics2013_paper_8.pdf |volume=Vol-1348 |dblpUrl=https://dblp.org/rec/conf/maics/JovanovichL13 }} ==Comparison of Fast Learning Large-Scale Multi-Class Classification== https://ceur-ws.org/Vol-1348/maics2013_paper_8.pdf
    Comparison of Fast Learning Large Scale Multi-Class Classification

                                             A. Jovanovich and A. Lazar
                                    Department of Computer Science and Information Systems
                                                  Youngstown State University
                                                    Youngstown, Oh 44555
                                               {ajovanovich, alazar}@gmail.com




                       Abstract                                       In the past, classification models have been shown to
Recent progress in the development of techniques                   handle large amounts of data well, and several
to optimize large-scale classification problems has                optimization techniques have been applied to efficiently
extended the use of multi-class classification.                    train data intensive models (Aly 2005). However the
Specifically the use of multi-class classification
                                                                   performance of the algorithms begin to decline when the
algorithms when the dataset is to large to fit into
limited memory available of most computers. The                    data cannot be processed into memory (Yu et al. 2012). In
most prominent algorithms used today solve the                     these cases, training techniques that deal well with
multi-class classification problem through an                      memory limitations become critical.
optimization approach based on coordinate decent.                     Recent progress has been made in the development of
Two of the most recognized algorithms, Vowpal
                                                                   techniques to optimize over memory constrained systems,
Wabbit and LIBLINEAR LibSVM have emerged as
the most consistent options when solving for a                     including binary classification. It is a well studied special
multi-class problems.     This paper proposes an                   case of the classification problem that provides the
analysis of these methods and tests the efficiency                 foundation for multi-class problem solvers. Statistical
and performance of each algorithm. The results                     properties of binary classifiers, such as consistency, have
are recorded and comparisons are made. After
                                                                   been investigated in a variety of settings. Most machine-
analyzing the results, the conclusion made is that
the Vowpal Wabbit algorithm is best suited for                     learning algorithms are originally developed for binary
solving    large-scale   multi-class    classification             decision problems, and extended to handle multi-class
problems when computer memory is constrained.                      problems (Argyriou, Herbster, and Pontil 2005).
                                                                      The goal of this paper is to study the existing multi-
                                                                   class classification methods and provide a noteworthy
                     Introduction                                  comparison.      In order to provide a thorough and
Big Data has affected the way statistical analysis is being        comprehensive comparison, classification tests were run
conducted. Many real-world datasets contain hundreds or            multiple times over contrasting datasets. The paper is
thousands of variables of interest which can contain               organized in the following manner. Section 2 contains
hundreds of thousand or millions of records. Time spent            background information. Section 3 presents related work.
on reading/writing between memory and disk becomes                 Our approach is in section 4, followed by experiments in
the bottleneck, rendering most algorithms inefficient (Yu          section 5. Conclusions follow and finally work cited.
et al. 2012). Even with the growing memory sizes of
computers, a large data set can still be problematic. As a
consequence, the complexity of analysis increasingly                                     Background
becomes unmanageable by using traditional machine                  The principles of Multi-class classification are founded
learning algorithms. To extract useful knowledge from              upon the same principles of binary classification. Each
dense data makes the task of analysis time consuming.              case involves assigning a class label for each instance
Coupled with the fact that most algorithms use a iteration         present. Given a training data set of the form (xi , yi),
process that cycles through the dataset multiple times, the        where xi ∈ Rn is the ith example and yi ∈ {1, . . . , K} is
process is seemingly impossible to finish.                         the ith class label, we aim at finding a learning model H
such that H(xi) = yi for new unseen examples (Menon               the size of blocks. The framework consists of 3 steps that
2009). The problem is simply formulated in the two class          split the data and read the blocks into memory, set initial
case, where the labels yi are just 0 or 1 for the two classes     values before solving for classification through an
involved.                                                         iterative process. The algorithm can be summarized as
   Several algorithms have been proposed to solve the             following:
binary problem. Some of these can be easily extended to
the multi-class case, and some that involve high levels of         Algorithm 1 A block minimization framework for Dual
computation. In contrast to traditional data classification        SVM
where each instance is assigned to only one label,
instances used in multi-class classification may be                 1. Split {1, . . . , l} to B1 , . . . , Bm and store data into m
simultaneously relevant to several labels. For example: in          partitions accordingly.
                                                                    2. Set initial w
text categorization, one document can belong to multiple            3. For j = 1, . . . , m (inner iterations)
topics.                                                                For K = 1, 2, . . . (outer iterations)
   Existing approaches to handle Large Scale multi-class             3.1. Read xr , ∀r ∈ Bj from disk
classification can be categorized into two types. The first          3.2 Approximately solve the sub-problem (1)
approach solves problems through a block minimization                3.3 Update w by (2)
scheme. The second approach considers online learning
algorithms (Dekel 2008). Both algorithms attempt to                  Here, {B1 , . . . , Bm } are sequential partitions of all
solve the classification problem of assigning labels from         data indices {1, . . . , l}. The size of the blocks are
Y to instances X (Tewari and Bartlett 2007).                      determined by the known memory constraints.           Each
                                                                  instance is read and randomly assigned a block.
                                                                  Algorithm 2 explains the process for data splitting.
Block Minimization For Support Vector                             Optimization over a single block is identified as the inner
Machines                                                          iteration, whereas the m steps of going over all blocks is
                                                                  deemed an outer iteration (Yu et al. 2012).
Optimization through block minimization has been used
in the past to efficiently deal with data to large too fit into    Algorithm 2 Framework for block splitting
memory. One of the most prominent learning algorithms
associated with block minimization is Support Vector               1. Decide m and create m empty files
Machine (SVM) algorithm. Proposed by Cortes and                    2. For i = 1, ….
Vapnik in 1995, the algorithm has since grown into one of           2.1 Convert xi to a binary format xi .
the most widely used learning algorithm in the world                2.2 Randomly choose number j {1,...,m}.
(Bottou and Lin 2007). The implementation and broad                 2.3 append xi into the end of the jth file.
uses of SVMs have been well documented in the years
since past.                                                          In order to optimize through block minimization only
   The framework for an SVM is modified to solve for              the dual form of SVM must be used. By examining the
classification problems in the event limited memory is            dual form of the optimization problem we are able to
available. The process entails using an optimization              write the entire algorithm in terms of only inner products
technique based on block minimization. Where the term             between input feature vectors. Updates to the weight
“block” refers to partitions of the dataset that can be read      vector w, which corresponds to the entire data set treating
through the memory available. The size and content of             instances uniformly prevent the primal for of SVM to be
each block varies from approach to approach. Pérez-               used (Shalev-Shwartz et al. 2007). Algorithm 1 is able to
Cruz, Figueiras, and Artes (2004) propose the use of              efficiently learn in very high dimensional spaces.
“double chunking.” Where data is partitioned into both                The sub-problem is solved in the inner iteration step,
“large chunks” and “small chunks.” Another approach               and the solution is then used to update w. The solution
described by Chang and Roth (2011) uses selective                 uses only instances that belong to block Bj. The solution
sampling for block minimization.          By selecting only       to the sub-problem is presented below in (1).
significant instances, the goal is to minimize the size of
data blocks and speed up the iteration process. Yu et al.                            l
(2012) suggest a framework for block minimization that is                      w = ∑ αiyixi                                            (1)
also used for testing in this paper. In this approach the                          i=1
amount of memory available for processing correlates to
  The iteration round is then complete after w is updated.    algorithm that uses stochastic gradient descent. Vowpal
To update w, if dBj is an optimal solution for the sub-       Wabbit can handle very large datasets without ever
problem:                                                      needing to load the entire dataset into memory. The
                                                              algorithm also requires less computational power and far
           w←w + ∑ dr yrxr                             (2)    fewer resources by learning through online gradient
                   r∈Bj                                       descent (Langford, Li, and Zhang, 2009). Algorithm 3
                                                              below presents the framework.
   The iteration process continues until optimization is
reached, converging when one of two conditions are met         Algorithm 3 Online-learning framework for Vowpal Wabbit
(Yu et al. 2012).        The first condition states that
                                                               Inputs:
optimization is complete when the sub-problem for each
                                                                • threshold θ ≥ 0
block is solved and the solutions converge. The second          • gravity sequence gi ≥ 0
condition is a stopping criteria. Usually a finite number       • learning rate η ∈ (0, 1)
of iterations is chosen, or an accuracy threshold is
obtained.                                                                          j
                                                               initialize weights w ← 0 ( j = 1, . . . , d)
     LIBLINEAR addresses both conditions while solving                                                      1 2       d
                                                                1. Acquire an unlabeled example x = [x , x , . . . , x ]
for the sub-problem. The software contains a library with       2. Compute prediction: y = ∑ j w x
                                                                                                     j j

tools used for SVM classification when data cannot fit          3. Acquire the label y from dataset
into memory (Yu et al. 2012). LIBLINEAR sequentially                                  j
                                                                    for all weights w ( j = 1, . . . , d)
selects one variable for update and fixes others inside the                  j          j              j          j
                                                                     (a) if w > 0 and w ≤ θ then w ← max{w − gi η, 0}
block. The framework not shown in this paper is                                  j          j               j       j
                                                                    (b) else if w < 0 and w ≥ −θ then w ← min{w + gi η, 0}
explained by Yu et al. (2012). LIBLINEAR uses a SVM                                                       j     j
                                                                4. Update weights for all features j: w ← w + 2η(y − y)x
                                                                                                                         j

coordinate descent method and solver to update instances
in block Bj before solving for (1).                                                                    j
                                                                Here, the superscripted symbol w denotes the jth
                                                                component of vector w in order to differentiate from
                                                                wi, which references the i th weight vector (Langford
Online Learning with Stochastic Gradient                        and Zhang 2009).
Descent
                                                                 Updates to the assigned weights are stored in a cache
Online learning algorithms are designed to effectively
                                                              file that is small enough to be read into the limited
classify data by building a weight model derived from
                                                              memory available. Vowpal Wabbit stores only the cache
sequentially received training examples. Compared to
                                                              file, allowing for faster implementation, and increased
block minimization which solves for the sub-problem of
                                                              performance through the optimization process.
each block, online learning updates instances through the
use of a cache file. Each iteration round updates the
cache file where the weight model is stored. The
algorithm classifies each instance, and uses the new                                   Related Work
“instance-label pair” to update and improve the stored        In this section we discuss related work pertaining to
model (Tewari and Bartlett 2007). This method is              comparisons of Large-scale Classification problems. The
expected to accurately predict the labels of instances that   topic has compiled a comprehensive library primarily for
are not part of the training set.                             binary classification. As such there seems to be a gap in
   Several strategies were proposed to optimize online        knowledge pertaining to multi-class classification. We
learning algorithms. Most of which aim to extend the          derive the techniques used for our comparison from
original purpose of binary classification to multi-class      strategies implemented in related work.
learning. In the study conducted by Argyriou et al (2005),       Yu et al. first introduced a comparison for large linear
an extension to a graph-based approach to online learning     classification in his paper published in 2012. His
was discussed. Shalev-Shwartz et al (2007) exploited the      comparison of SVM solvers and online-learning
dual formation of optimization to create a more efficient     algorithms only extended to binary classification. The
online learning algorithm.                                    study defined one assumption that is significant in our
   John Langford and his colleagues at Yahoo! Research        comparison.
developed Vowpal Wabbit, a fast online-learning
               45000                                                            4000

               40000                                                            3500
               35000                                                            3000
               30000                                                            2500
               25000




                                                                    Frequency
                                                                                2000
   Frequency




               20000
                                                                                1500
               15000
                                                                                1000
               10000
               5000                                                             500

                  0                                                               0
                          1                   2         3                              1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

                                      Class                                                                          Class




               Figure 1 SensIT Vehicle Class Training Frequency                          Figure 2 RCV1 Class Training Frequency

  Assuming that the amount of available memory is                 Data
limited, entire datasets cannot be stored in memory, but          Comparisons were drawn from test conducted over multi-
can be stored in the disk of one computer. The size of the        class datasets. The SensIT Vehicle dataset was used, in
datasets used in this paper are large enough to satisfying        addition to a larger 53 class (RCV1) dataset. Using the
this constraint, and must be accessed through the hard            smaller dataset to establish a baseline reading and
drive where they are stored.                                      evaluation. The classification problem was then extended
   Multi-class classification was addressed in the paper by       the simple classification problem to one more complex in
Chang and Roth in 2011. In the paper comparisons were             that of the RCV1 dataset. The tests conducted here would
drawn from both batch learning as well as online-learning         reinforce the results gathered from the SensIT dataset.
algorithms. Unfortunately Vowpal Wabbit at that time                The SensIT Vehicle dataset features 3-class labels. The
was limited to binary classification and no comparisons           instances were extracted from sensor data collected
were drawn. Instead the focus was primarily on block              during a real world experiment carried out at Twenty nine
minimizing algorithms. Of which LibSVM was deemed                 Palms, CA in November of 2001 (Duarte and Hen 2004).
superior to several other algorithms. It is our intent to         The sensors were used to obtain both acoustic and seismic
extend this comparison of LibSVM to Vowpal Wabbit.                activity from vehicles in the vicinity. In total 50 features
                                                                  were extracted. Each vehicle was driven around a road
                                                                  while sensors collected information as they passed.
                                Approach                          Classes included in the training set included: AAV3 (class
The aim of this paper is to test the performance of Vowpal        1), DW3 (class 2), and a third class for noise (class 3). In
Wabbit (Langford, Li, and Zhang, 2009) against the                total there are 78,823 training samples and 19,705 testing
LIBLINEAR block learning extension of LibSVM (Yu et               samples. In Figure 1 the frequency of each class used for
al 2012 ). Due to the recent advances in the Vowpal               training is presented. The distribution of samples shows
Wabbit library this is one of the first tests using the multi-    an unbalanced dataset where more instances of noise
class solver.                                                     classification are recorded than vehicle. The total sample
   Testing was performed using one-against-all multi-class        size of the 2 vehicles combines is close to the total sample
solver options implemented in both algorithms. For a K-           size of the noise. The data being skewed as such can
class problem the one-against-all method constructs K             increase the error rate and degrade performance in certain
models, where each model separates a class from the rest.         cases. Later in the paper we discuss the impact of this on
The ith model is trained with all of the binary instances         classification accuracy.
pertaining to the ith class. The final output of the one-            The RCV1 dataset was used in part due to its 53-class
against-all method is the class that corresponds to the           problem. The Reuters Corpus Volume I (Reuters RCV1)
SVM with the highest output value (Liu, Wang, and Zeng            is one of the most widely used test collection for text
2007).                                                            categorization research (Lewis et al. 2004). It contains
                                                                  534,135 newswire documents, which are split into 15,564
                                                                  training documents and 518,571 test documents. The
                                               Vowpal Wabbit                       libSVM                                                                                                     Vowpal Wabbit                   libSVM

                        140                                                                                                                     120

                        120                                                                                                                     100

                        100
      Time (seconds)




                                                                                                                                                    80




                                                                                                                            Time (seconds)
                          80
                                                                                                                                                    60
                          60
                                                                                                                                                    40
                          40
                          20                                                                                                                        20

                           0                                                                                                                            0
                               1   2   3   4         5       6         7       8       9    10    20    30    50    78                                      1         2       3       4       5       6         7         8       9       10    20    30    50   78

                                                                 Iteration                                                                                                                                    Iteration




                                   Figure 3 Iteration time over SensIT Data                                                                                           Figure 5 Iteration time over RCV1 Data

                                                                                                                                                                                                  Vowpal Wabbit                   libSVM
                                                Vowpal Wabbit                  libSVM
                                                                                                                                                            90%
                       80.6%
                                                                                                                                                            89%
                       80.4%                                                                                                                                88%
                                                                                                                                                            87%
                       80.2%                                                                                                                                86%




                                                                                                                                             Accuracy
                                                                                                                                                            85%
 Accuracy




                       80.0%
                                                                                                                                                            84%
                       79.8%                                                                                                                                83%
                                                                                                                                                            82%
                       79.6%
                                                                                                                                                            81%
                       79.4%                                                                                                                                80%
                               1   2   3   4     5       6         7       8       9       10    20    30    50    78                                             1       2       3       4       5       6         7         8       9    10    20    30   50   78

                                                             Iteration                                                                                                                                          Iteration




     Figure 4 Percent Accuracy Per Iteration over SensIT Data                                                                     Figure 6 Percent Accuracy Per Iteration over RCV1 Data

dataset contains 47,236 features. The class frequency of                                                                    Before the experiment can be run using the
all 53 labels is presented in figure 2. Again the frequency                                                              LIBLINEAR extension LibSVM, the training dataset
of the samples collected in the training dataset represent                                                               must be partitioned into smaller blocks. The optimal
an unbalanced sample size where the first 25 class                                                                       number of partitions was discovered to be 8 for this case.
frequencies are significantly larger than the last 28.                                                                   Vowpal Wabbit on the other hand uses a cache file which
                                                                                                                         fulfills the memory constraint. The dataset does not need
                                                                                                                         to be split or compressed, Vowpal Wabbit can access each
                                                                                                                         instance without reading the entire datasets.         The
                                                 Experiment
                                                                                                                         algorithms were first trained and tested using the SensIT
In order to successfully draw comparisons testing criteria                                                               dataset. The dataset represents a simple scenario where
was first established. The criteria is based of overall                                                                  binary classification is extended into a 3-class problem.
performance of the ability of the algorithms to learn a                                                                  The results from both algorithms are shown in figures 3
classification model and successfully label the instance                                                                 and 4.
into the the right class. Stopping criteria was also                                                                        On average the runtime per iteration for Vowpal
established before tests were conducted. The iteration                                                                   Wabbit was 0.36 seconds, while the average for libSVM
threshold was set to 78 rounds. Figures 3 – 6 present the                                                                was 1.48 seconds. Vowpal Wabbit performed at a pace 4
individual performance measures of the algorithms from                                                                   times as fast as LibSVM. Both algorithms scored over
iteration round 1 to the stopping threshold reached in                                                                   80% accuracy classifying testing instances. By the end of
round 78. The algorithms were compiled in C++ and run                                                                    the 78 iteration the algorithms were close to converging
on the Linux operating system. All experiments were run                                                                  around the 80.3 percentile. The biggest differences
on a laptop with 6GB of memory. To ensure accurate                                                                       recorded can be seen between the first 5 iteration rounds.
results each test was repeated 3 times and the averages                                                                  This is were the optimization approaches can be
were used to rate performance.
                              SensIT Dataset                                                        RCV1 Dataset
              Vowpal Wabbit                       libSVM                           Vowpal Wabbit                       libSVM
  Iteration      Time        Acc      Iteration    Time      Acc       Iteration      Time        Acc      Iteration    Time      Acc
      1           1.28      79.86%        1        5.52    80.47%          1           1.06      83.40%        1        8.20    88.64%
      2           1.64      80.06%        2        8.00    80.47%          2           1.53      85.81%        2        9.46    88.64%
      3           2.05      80.12%        3        9.53    80.43%          3           1.94      86.57%        3       10.80    88.64%
      4           2.46      80.15%        4       11.25    80.46%          4           2.36      86.85%        4       12.25    88.64%
      5           2.87      80.16%        5       13.03    80.42%          5           2.85      86.96%        5       13.61    88.64%
      6           3.29      80.20%        6       15.43    80.43%          6           3.37      86.99%        6       14.87    88.64%
      7           3.56      80.19%        7       16.09    80.46%          7           3.84      86.96%        7       16.20    88.64%
      8           3.95      80.19%        8       17.27    80.43%          8           4.02      86.90%        8       17.63    88.64%
      9           4.40      80.19%        9       18.76    80.43%          9           4.48      86.83%        9       19.01    88.64%
     10           4.81      80.19%       10       20.14    80.47%         10           5.26      86.75%       10       20.33    88.64%
     20           7.13      80.24%       20       34.28    80.45%         20           9.46      86.12%       20       33.99    80.45%
     30          10.83      80.22%       30       48.93    80.43%         30          13.91      85.76%       30       47.30    80.43%
     50          17.52      80.27%       50       76.93    80.44%         50          22.06      85.45%       50       74.17    80.44%
     78          27.81      80.32%       78       116.56   80.43%         78          33.85      85.28%       78       111.93   80.43%



                 Figure 7 SensIT Data Table of Statistics                           Figure 8 RCV1 Data Table of Statistics

distinguished. As the Vowpal Wabbit algorithm passes                80.47% of the SensIT testing dataset, whereas Vowpal
through the dataset and the weights are updated from the            Wabbit managed to obtain a lower rate of 80.32% after
initial label, the accuracy of classification begins to rise.       the final iteration. Figure 7 presents a table with the
While the consistent accuracy rate for LibSVM can be                results gathered testing over the SensIT test dataset. Once
attributed to the block minimization framework. By                  more LibSVM attained the highest classification rate over
solving the subproblem within each equally distributed              the RCV1 testing dataset. The results from both
block first, the updated weights do not yield higher                algorithms are presented in figure 8. the highest accuracy
accuracy. Instead the rate is consistent throughout the             mark was obtained during the first iteration round when
optimization process.                                               LibSVM correctly classifies 88.64% of testing instances.
   The second experiment was conducting over the RCV1               This rate was maintained throughout the iteration process.
dataset. Classification using the one-against-all option is         Vowpal Wabbit reached the t highest mark with an
increasingly complicated with each additional class. The            accuracy of 86.99% after the 6th iteration round.
increased sample, class, and feature size in the RCV1                  The higher accuracy ratings from LibSVM do come at
dataset taxes the memory load in such a constrained                 a cost however. The runtime for each iteration was
environment. Performance from both algorithms stayed                significantly slower than that of Vowpal Wabbit. The total
consistent with the baseline ratings established with the           time needed to reach the 78th iteration round presented in
tests run over the SensIT dataset. Average runtime per              figure 7 was only 27.81 seconds. Compared to 116.56
iteration for Vowpal Wabbit increases slightly to 0.43              seconds for LibSVM. Figure 8 presented similar results.
seconds, while LibSVM decreased slightly to 1.43                    Vowpal Wabbit was recorded at 33.85 seconds where
seconds. The runtime results of RCV1 is presented in                LibSVM reached the stopping threshold in 111.93
figure 5 are similar to those of SensIT testing in figure 3.        seconds. In both cases Vowpal Wabbit was 4 times as
As with the sensIT data set, Vowpal Wabbit is consistently          fast.
quicker than libSVM.                                                   Surprisingly both algorithms achieved higher levels of
   Classification accuracy rose 8 percent when comparing            performance solving for the larger RCV1 classification
the results of figure 6 to figure 4. The larger dataset has         problem. Iteration runtime of both algorithms experienced
had a positive impact on the accuracy rate of                       a small increase in time when compared in figures 3 and
classification for the two algorithms. Again LibSVM                 5. The classification accuracy improved in figure 6 when
scores a higher classification accuracy while achieving             compared to the accuracy ratings in figure 3. The lower
more consistent results. The accuracy of Vowpal Wabbit              rates of classification accuracy can can be contributed to
begins to rise until the 20th iterations. From there the            the skewed training dataset. The frequency distribution in
accuracy begins to drop until the 78th iteration. Unlike            presented in figures 1 and 2 are unbalanced. As such the
the SensIT tests, the algorithms never appear to converge.          classification accuracy over the smaller SensIT data set
   Overall the LIBLINEAR extension LibSVM                           was lower than the larger RCV1 dataset. The increase in
classification algorithm achieved the highest classification        instances correlated with the accuracy of overall
rates. At no point in the iteration process did the Vowpal          classification. The more instances the algorithm had to
Wabbit algorithm correctly classify more test instances.            train over, the more classification accuracy improved.
The highest classification accuracy rate occurred during
the first iteration step. LibSVM correctly classified
                     Conclusion
Using the results recorded from figures 3 -8 the following
question about Fast Learning large-scale multi-class
classification can be answered:
   Question: Which algorithm is most efficient when their
are constraints to the memory? When compared, Vowpal
Wabbit is the most efficient multi-class classification
algorithm.      The results from the SensIT test case
suggested that Vowpal Wabbit was the quicker algorithm
while maintaining a slightly lower accuracy percentile
than LIBSVM. Moving from the SensIT dataset to the
larger RCV1, the results remained consistent. We have
concluded that Vowpal Wabbit had a slight advantage in
overall efficiency when there is a constraint placed of
computer memory size.
   The performance of both algorithms was relatively
close however. Due to the small size of the experiment,
further testing is needed for a thorough comparison.
Testing over datasets that are more expansive, both in
sample and feature size could be used for a more
significant experiment. However, we feel that the size of
the datasets used in this paper are adequate to provide
conclusive results for the comparisons made.
                       References                            Liu, Y., Wang, R., and Zeng, Y. S. 2007. An Improvement
                                                             of One-Against-One Method for Multi-class Support
Aly, M. 2005. Survey on multiclass classification            Vector Machine. In Machine Learning and Cybernetics,
methods. Neural networks, 1-9.                               2007 International Conference 5: 2915-2920.

Argyriou, A., Herbster, M., and Pontil, M. 2005.             Menon, A. K. 2009. Large-scale support vector machines:
Combining graph Laplacians for semi--supervised              algorithms and theory. Research Exam, University of
learning.                                                    California, San Diego.

Bottou, L., and Lin, C. J. 2007. Support vector machine      Pechyony, D., Shen, L., and Jones, R. Solving Large Scale
solvers. Large scale kernel machines: 301-320.               Linear SVM with Distributed Block Minimization.

Chang, K. W., and Roth, D. 2011. Selective block             Pérez-Cruz, F., Figueiras, A., and Artes, A. 2004. Double
minimization for faster convergence of limited memory        chunking for solving SVMs for very large datasets.
large-scale linear models. In Proceedings of the 17th
ACM SIGKDD international conference on Knowledge             Shalev-Shwartz, S., Singer, Y., and Srebro, N. 2007.
discovery and data mining: 699-707.                          Pegasos: Primal estimated sub-gradient solver for svm. In
                                                             Proceedings of the 24th international conference on
Chang, C. and Lin, C. 2011. LIBSVM: a library for            Machine learning, 807-814.: ACM.
support vector machines. ACM Transactions on
Intelligent Systems and Technology, 2(3): 27:1--27:27.       Tewari, A., and Bartlett, P. L. 2007. On the consistency of
Software available at                                        multiclass classification methods. Journal of Machine
http://www.csie.ntu.edu.tw/~cjlin/libsvm                     Learning Research 8: 1007-1025.

Cortes, C., and Vapnik, V. 1995. Support-vector networks.    Yu, H. F., Hsieh, C. J., Chang, K. W., and Lin, C. J. 2012.
Machine learning, 20(3): 273-297.                            Large linear classification when data cannot fit in
                                                             memory. ACM Transactions on Knowledge Discovery
Dekel, O. 2008. From online to batch learning with           from Data TKDD 5(4): 23.
cutoff-averaging. NIPS.

Duarte, M. F., and Hen Hu, Y. 2004. Vehicle classification
in distributed sensor networks. Journal of Parallel and
Distributed Computing 64(7): 826-838.

Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., &
Lin, C. J. (2008). LIBLINEAR: A library for large linear
classification. The Journal of Machine Learning Research
9: 1871-1874.

Langford, J., Li, L., and Zhang, T. 2009. Sparse online
learning via truncated gradient. The Journal of Machine
Learning Research 10: 777-801.

Lewis, D., Yang, Y., Rose, T., and Li, F. 2004. RCV1: A
new benchmark collection for text categorization
research. Journal of Machine Learning Research 5: 361-
397.