=Paper= {{Paper |id=Vol-1348/maics2013_paper_8 |storemode=property |title=Comparison of Fast Learning Large-Scale Multi-Class Classification |pdfUrl=https://ceur-ws.org/Vol-1348/maics2013_paper_8.pdf |volume=Vol-1348 |dblpUrl=https://dblp.org/rec/conf/maics/JovanovichL13 }} ==Comparison of Fast Learning Large-Scale Multi-Class Classification== https://ceur-ws.org/Vol-1348/maics2013_paper_8.pdf

Comparison of Fast Learning Large Scale Multi-Class Classification

A. Jovanovich and A. Lazar
Department of Computer Science and Information Systems
Youngstown State University
Youngstown, Oh 44555
{ajovanovich, alazar}@gmail.com

Abstract In the past, classification models have been shown to
Recent progress in the development of techniques handle large amounts of data well, and several
to optimize large-scale classification problems has optimization techniques have been applied to efficiently
extended the use of multi-class classification. train data intensive models (Aly 2005). However the
Specifically the use of multi-class classification
performance of the algorithms begin to decline when the
algorithms when the dataset is to large to fit into
limited memory available of most computers. The data cannot be processed into memory (Yu et al. 2012). In
most prominent algorithms used today solve the these cases, training techniques that deal well with
multi-class classification problem through an memory limitations become critical.
optimization approach based on coordinate decent. Recent progress has been made in the development of
Two of the most recognized algorithms, Vowpal
techniques to optimize over memory constrained systems,
Wabbit and LIBLINEAR LibSVM have emerged as
the most consistent options when solving for a including binary classification. It is a well studied special
multi-class problems. This paper proposes an case of the classification problem that provides the
analysis of these methods and tests the efficiency foundation for multi-class problem solvers. Statistical
and performance of each algorithm. The results properties of binary classifiers, such as consistency, have
are recorded and comparisons are made. After
been investigated in a variety of settings. Most machine-
analyzing the results, the conclusion made is that
the Vowpal Wabbit algorithm is best suited for learning algorithms are originally developed for binary
solving large-scale multi-class classification decision problems, and extended to handle multi-class
problems when computer memory is constrained. problems (Argyriou, Herbster, and Pontil 2005).
The goal of this paper is to study the existing multi-
class classification methods and provide a noteworthy
Introduction comparison. In order to provide a thorough and
Big Data has affected the way statistical analysis is being comprehensive comparison, classification tests were run
conducted. Many real-world datasets contain hundreds or multiple times over contrasting datasets. The paper is
thousands of variables of interest which can contain organized in the following manner. Section 2 contains
hundreds of thousand or millions of records. Time spent background information. Section 3 presents related work.
on reading/writing between memory and disk becomes Our approach is in section 4, followed by experiments in
the bottleneck, rendering most algorithms inefficient (Yu section 5. Conclusions follow and finally work cited.
et al. 2012). Even with the growing memory sizes of
computers, a large data set can still be problematic. As a
consequence, the complexity of analysis increasingly Background
becomes unmanageable by using traditional machine The principles of Multi-class classification are founded
learning algorithms. To extract useful knowledge from upon the same principles of binary classification. Each
dense data makes the task of analysis time consuming. case involves assigning a class label for each instance
Coupled with the fact that most algorithms use a iteration present. Given a training data set of the form (xi , yi),
process that cycles through the dataset multiple times, the where xi ∈ Rn is the ith example and yi ∈ {1, . . . , K} is
process is seemingly impossible to finish. the ith class label, we aim at finding a learning model H
such that H(xi) = yi for new unseen examples (Menon the size of blocks. The framework consists of 3 steps that
2009). The problem is simply formulated in the two class split the data and read the blocks into memory, set initial
case, where the labels yi are just 0 or 1 for the two classes values before solving for classification through an
involved. iterative process. The algorithm can be summarized as
Several algorithms have been proposed to solve the following:
binary problem. Some of these can be easily extended to
the multi-class case, and some that involve high levels of Algorithm 1 A block minimization framework for Dual
computation. In contrast to traditional data classification SVM
where each instance is assigned to only one label,
instances used in multi-class classification may be 1. Split {1, . . . , l} to B1 , . . . , Bm and store data into m
simultaneously relevant to several labels. For example: in partitions accordingly.
2. Set initial w
text categorization, one document can belong to multiple 3. For j = 1, . . . , m (inner iterations)
topics. For K = 1, 2, . . . (outer iterations)
Existing approaches to handle Large Scale multi-class 3.1. Read xr , ∀r ∈ Bj from disk
classification can be categorized into two types. The first 3.2 Approximately solve the sub-problem (1)
approach solves problems through a block minimization 3.3 Update w by (2)
scheme. The second approach considers online learning
algorithms (Dekel 2008). Both algorithms attempt to Here, {B1 , . . . , Bm } are sequential partitions of all
solve the classification problem of assigning labels from data indices {1, . . . , l}. The size of the blocks are
Y to instances X (Tewari and Bartlett 2007). determined by the known memory constraints. Each
instance is read and randomly assigned a block.
Algorithm 2 explains the process for data splitting.
Block Minimization For Support Vector Optimization over a single block is identified as the inner
Machines iteration, whereas the m steps of going over all blocks is
deemed an outer iteration (Yu et al. 2012).
Optimization through block minimization has been used
in the past to efficiently deal with data to large too fit into Algorithm 2 Framework for block splitting
memory. One of the most prominent learning algorithms
associated with block minimization is Support Vector 1. Decide m and create m empty files
Machine (SVM) algorithm. Proposed by Cortes and 2. For i = 1, ….
Vapnik in 1995, the algorithm has since grown into one of 2.1 Convert xi to a binary format xi .
the most widely used learning algorithm in the world 2.2 Randomly choose number j {1,...,m}.
(Bottou and Lin 2007). The implementation and broad 2.3 append xi into the end of the jth file.
uses of SVMs have been well documented in the years
since past. In order to optimize through block minimization only
The framework for an SVM is modified to solve for the dual form of SVM must be used. By examining the
classification problems in the event limited memory is dual form of the optimization problem we are able to
available. The process entails using an optimization write the entire algorithm in terms of only inner products
technique based on block minimization. Where the term between input feature vectors. Updates to the weight
“block” refers to partitions of the dataset that can be read vector w, which corresponds to the entire data set treating
through the memory available. The size and content of instances uniformly prevent the primal for of SVM to be
each block varies from approach to approach. Pérez- used (Shalev-Shwartz et al. 2007). Algorithm 1 is able to
Cruz, Figueiras, and Artes (2004) propose the use of efficiently learn in very high dimensional spaces.
“double chunking.” Where data is partitioned into both The sub-problem is solved in the inner iteration step,
“large chunks” and “small chunks.” Another approach and the solution is then used to update w. The solution
described by Chang and Roth (2011) uses selective uses only instances that belong to block Bj. The solution
sampling for block minimization. By selecting only to the sub-problem is presented below in (1).
significant instances, the goal is to minimize the size of
data blocks and speed up the iteration process. Yu et al. l
(2012) suggest a framework for block minimization that is w = ∑ αiyixi (1)
also used for testing in this paper. In this approach the i=1
amount of memory available for processing correlates to
The iteration round is then complete after w is updated. algorithm that uses stochastic gradient descent. Vowpal
To update w, if dBj is an optimal solution for the sub- Wabbit can handle very large datasets without ever
problem: needing to load the entire dataset into memory. The
algorithm also requires less computational power and far
w←w + ∑ dr yrxr (2) fewer resources by learning through online gradient
r∈Bj descent (Langford, Li, and Zhang, 2009). Algorithm 3
below presents the framework.
The iteration process continues until optimization is
reached, converging when one of two conditions are met Algorithm 3 Online-learning framework for Vowpal Wabbit
(Yu et al. 2012). The first condition states that
Inputs:
optimization is complete when the sub-problem for each
• threshold θ ≥ 0
block is solved and the solutions converge. The second • gravity sequence gi ≥ 0
condition is a stopping criteria. Usually a finite number • learning rate η ∈ (0, 1)
of iterations is chosen, or an accuracy threshold is
obtained. j
initialize weights w ← 0 ( j = 1, . . . , d)
LIBLINEAR addresses both conditions while solving 1 2 d
1. Acquire an unlabeled example x = [x , x , . . . , x ]
for the sub-problem. The software contains a library with 2. Compute prediction: y = ∑ j w x
j j

tools used for SVM classification when data cannot fit 3. Acquire the label y from dataset
into memory (Yu et al. 2012). LIBLINEAR sequentially j
for all weights w ( j = 1, . . . , d)
selects one variable for update and fixes others inside the j j j j
(a) if w > 0 and w ≤ θ then w ← max{w − gi η, 0}
block. The framework not shown in this paper is j j j j
(b) else if w < 0 and w ≥ −θ then w ← min{w + gi η, 0}
explained by Yu et al. (2012). LIBLINEAR uses a SVM j j
4. Update weights for all features j: w ← w + 2η(y − y)x
j

coordinate descent method and solver to update instances
in block Bj before solving for (1). j
Here, the superscripted symbol w denotes the jth
component of vector w in order to differentiate from
wi, which references the i th weight vector (Langford
Online Learning with Stochastic Gradient and Zhang 2009).
Descent
Updates to the assigned weights are stored in a cache
Online learning algorithms are designed to effectively
file that is small enough to be read into the limited
classify data by building a weight model derived from
memory available. Vowpal Wabbit stores only the cache
sequentially received training examples. Compared to
file, allowing for faster implementation, and increased
block minimization which solves for the sub-problem of
performance through the optimization process.
each block, online learning updates instances through the
use of a cache file. Each iteration round updates the
cache file where the weight model is stored. The
algorithm classifies each instance, and uses the new Related Work
“instance-label pair” to update and improve the stored In this section we discuss related work pertaining to
model (Tewari and Bartlett 2007). This method is comparisons of Large-scale Classification problems. The
expected to accurately predict the labels of instances that topic has compiled a comprehensive library primarily for
are not part of the training set. binary classification. As such there seems to be a gap in
Several strategies were proposed to optimize online knowledge pertaining to multi-class classification. We
learning algorithms. Most of which aim to extend the derive the techniques used for our comparison from
original purpose of binary classification to multi-class strategies implemented in related work.
learning. In the study conducted by Argyriou et al (2005), Yu et al. first introduced a comparison for large linear
an extension to a graph-based approach to online learning classification in his paper published in 2012. His
was discussed. Shalev-Shwartz et al (2007) exploited the comparison of SVM solvers and online-learning
dual formation of optimization to create a more efficient algorithms only extended to binary classification. The
online learning algorithm. study defined one assumption that is significant in our
John Langford and his colleagues at Yahoo! Research comparison.
developed Vowpal Wabbit, a fast online-learning
45000 4000

40000 3500
35000 3000
30000 2500
25000

Frequency
2000
Frequency

20000
1500
15000
1000
10000
5000 500

0 0
1 2 3 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51

Class Class

Figure 1 SensIT Vehicle Class Training Frequency Figure 2 RCV1 Class Training Frequency

Assuming that the amount of available memory is Data
limited, entire datasets cannot be stored in memory, but Comparisons were drawn from test conducted over multi-
can be stored in the disk of one computer. The size of the class datasets. The SensIT Vehicle dataset was used, in
datasets used in this paper are large enough to satisfying addition to a larger 53 class (RCV1) dataset. Using the
this constraint, and must be accessed through the hard smaller dataset to establish a baseline reading and
drive where they are stored. evaluation. The classification problem was then extended
Multi-class classification was addressed in the paper by the simple classification problem to one more complex in
Chang and Roth in 2011. In the paper comparisons were that of the RCV1 dataset. The tests conducted here would
drawn from both batch learning as well as online-learning reinforce the results gathered from the SensIT dataset.
algorithms. Unfortunately Vowpal Wabbit at that time The SensIT Vehicle dataset features 3-class labels. The
was limited to binary classification and no comparisons instances were extracted from sensor data collected
were drawn. Instead the focus was primarily on block during a real world experiment carried out at Twenty nine
minimizing algorithms. Of which LibSVM was deemed Palms, CA in November of 2001 (Duarte and Hen 2004).
superior to several other algorithms. It is our intent to The sensors were used to obtain both acoustic and seismic
extend this comparison of LibSVM to Vowpal Wabbit. activity from vehicles in the vicinity. In total 50 features
were extracted. Each vehicle was driven around a road
while sensors collected information as they passed.
Approach Classes included in the training set included: AAV3 (class
The aim of this paper is to test the performance of Vowpal 1), DW3 (class 2), and a third class for noise (class 3). In
Wabbit (Langford, Li, and Zhang, 2009) against the total there are 78,823 training samples and 19,705 testing
LIBLINEAR block learning extension of LibSVM (Yu et samples. In Figure 1 the frequency of each class used for
al 2012 ). Due to the recent advances in the Vowpal training is presented. The distribution of samples shows
Wabbit library this is one of the first tests using the multi- an unbalanced dataset where more instances of noise
class solver. classification are recorded than vehicle. The total sample
Testing was performed using one-against-all multi-class size of the 2 vehicles combines is close to the total sample
solver options implemented in both algorithms. For a K- size of the noise. The data being skewed as such can
class problem the one-against-all method constructs K increase the error rate and degrade performance in certain
models, where each model separates a class from the rest. cases. Later in the paper we discuss the impact of this on
The ith model is trained with all of the binary instances classification accuracy.
pertaining to the ith class. The final output of the one- The RCV1 dataset was used in part due to its 53-class
against-all method is the class that corresponds to the problem. The Reuters Corpus Volume I (Reuters RCV1)
SVM with the highest output value (Liu, Wang, and Zeng is one of the most widely used test collection for text
2007). categorization research (Lewis et al. 2004). It contains
534,135 newswire documents, which are split into 15,564
training documents and 518,571 test documents. The
Vowpal Wabbit libSVM Vowpal Wabbit libSVM

140 120

120 100

100
Time (seconds)

Time (seconds)
80
60
60
40
40
20 20

0 0
1 2 3 4 5 6 7 8 9 10 20 30 50 78 1 2 3 4 5 6 7 8 9 10 20 30 50 78

Iteration Iteration

Figure 3 Iteration time over SensIT Data Figure 5 Iteration time over RCV1 Data

Vowpal Wabbit libSVM
Vowpal Wabbit libSVM
90%
80.6%
89%
80.4% 88%
87%
80.2% 86%

Accuracy
85%
Accuracy

80.0%
84%
79.8% 83%
82%
79.6%
81%
79.4% 80%
1 2 3 4 5 6 7 8 9 10 20 30 50 78 1 2 3 4 5 6 7 8 9 10 20 30 50 78

Iteration Iteration

Figure 4 Percent Accuracy Per Iteration over SensIT Data Figure 6 Percent Accuracy Per Iteration over RCV1 Data

dataset contains 47,236 features. The class frequency of Before the experiment can be run using the
all 53 labels is presented in figure 2. Again the frequency LIBLINEAR extension LibSVM, the training dataset
of the samples collected in the training dataset represent must be partitioned into smaller blocks. The optimal
an unbalanced sample size where the first 25 class number of partitions was discovered to be 8 for this case.
frequencies are significantly larger than the last 28. Vowpal Wabbit on the other hand uses a cache file which
fulfills the memory constraint. The dataset does not need
to be split or compressed, Vowpal Wabbit can access each
instance without reading the entire datasets. The
Experiment
algorithms were first trained and tested using the SensIT
In order to successfully draw comparisons testing criteria dataset. The dataset represents a simple scenario where
was first established. The criteria is based of overall binary classification is extended into a 3-class problem.
performance of the ability of the algorithms to learn a The results from both algorithms are shown in figures 3
classification model and successfully label the instance and 4.
into the the right class. Stopping criteria was also On average the runtime per iteration for Vowpal
established before tests were conducted. The iteration Wabbit was 0.36 seconds, while the average for libSVM
threshold was set to 78 rounds. Figures 3 – 6 present the was 1.48 seconds. Vowpal Wabbit performed at a pace 4
individual performance measures of the algorithms from times as fast as LibSVM. Both algorithms scored over
iteration round 1 to the stopping threshold reached in 80% accuracy classifying testing instances. By the end of
round 78. The algorithms were compiled in C++ and run the 78 iteration the algorithms were close to converging
on the Linux operating system. All experiments were run around the 80.3 percentile. The biggest differences
on a laptop with 6GB of memory. To ensure accurate recorded can be seen between the first 5 iteration rounds.
results each test was repeated 3 times and the averages This is were the optimization approaches can be
were used to rate performance.
SensIT Dataset RCV1 Dataset
Vowpal Wabbit libSVM Vowpal Wabbit libSVM
Iteration Time Acc Iteration Time Acc Iteration Time Acc Iteration Time Acc
1 1.28 79.86% 1 5.52 80.47% 1 1.06 83.40% 1 8.20 88.64%
2 1.64 80.06% 2 8.00 80.47% 2 1.53 85.81% 2 9.46 88.64%
3 2.05 80.12% 3 9.53 80.43% 3 1.94 86.57% 3 10.80 88.64%
4 2.46 80.15% 4 11.25 80.46% 4 2.36 86.85% 4 12.25 88.64%
5 2.87 80.16% 5 13.03 80.42% 5 2.85 86.96% 5 13.61 88.64%
6 3.29 80.20% 6 15.43 80.43% 6 3.37 86.99% 6 14.87 88.64%
7 3.56 80.19% 7 16.09 80.46% 7 3.84 86.96% 7 16.20 88.64%
8 3.95 80.19% 8 17.27 80.43% 8 4.02 86.90% 8 17.63 88.64%
9 4.40 80.19% 9 18.76 80.43% 9 4.48 86.83% 9 19.01 88.64%
10 4.81 80.19% 10 20.14 80.47% 10 5.26 86.75% 10 20.33 88.64%
20 7.13 80.24% 20 34.28 80.45% 20 9.46 86.12% 20 33.99 80.45%
30 10.83 80.22% 30 48.93 80.43% 30 13.91 85.76% 30 47.30 80.43%
50 17.52 80.27% 50 76.93 80.44% 50 22.06 85.45% 50 74.17 80.44%
78 27.81 80.32% 78 116.56 80.43% 78 33.85 85.28% 78 111.93 80.43%

Figure 7 SensIT Data Table of Statistics Figure 8 RCV1 Data Table of Statistics

distinguished. As the Vowpal Wabbit algorithm passes 80.47% of the SensIT testing dataset, whereas Vowpal
through the dataset and the weights are updated from the Wabbit managed to obtain a lower rate of 80.32% after
initial label, the accuracy of classification begins to rise. the final iteration. Figure 7 presents a table with the
While the consistent accuracy rate for LibSVM can be results gathered testing over the SensIT test dataset. Once
attributed to the block minimization framework. By more LibSVM attained the highest classification rate over
solving the subproblem within each equally distributed the RCV1 testing dataset. The results from both
block first, the updated weights do not yield higher algorithms are presented in figure 8. the highest accuracy
accuracy. Instead the rate is consistent throughout the mark was obtained during the first iteration round when
optimization process. LibSVM correctly classifies 88.64% of testing instances.
The second experiment was conducting over the RCV1 This rate was maintained throughout the iteration process.
dataset. Classification using the one-against-all option is Vowpal Wabbit reached the t highest mark with an
increasingly complicated with each additional class. The accuracy of 86.99% after the 6th iteration round.
increased sample, class, and feature size in the RCV1 The higher accuracy ratings from LibSVM do come at
dataset taxes the memory load in such a constrained a cost however. The runtime for each iteration was
environment. Performance from both algorithms stayed significantly slower than that of Vowpal Wabbit. The total
consistent with the baseline ratings established with the time needed to reach the 78th iteration round presented in
tests run over the SensIT dataset. Average runtime per figure 7 was only 27.81 seconds. Compared to 116.56
iteration for Vowpal Wabbit increases slightly to 0.43 seconds for LibSVM. Figure 8 presented similar results.
seconds, while LibSVM decreased slightly to 1.43 Vowpal Wabbit was recorded at 33.85 seconds where
seconds. The runtime results of RCV1 is presented in LibSVM reached the stopping threshold in 111.93
figure 5 are similar to those of SensIT testing in figure 3. seconds. In both cases Vowpal Wabbit was 4 times as
As with the sensIT data set, Vowpal Wabbit is consistently fast.
quicker than libSVM. Surprisingly both algorithms achieved higher levels of
Classification accuracy rose 8 percent when comparing performance solving for the larger RCV1 classification
the results of figure 6 to figure 4. The larger dataset has problem. Iteration runtime of both algorithms experienced
had a positive impact on the accuracy rate of a small increase in time when compared in figures 3 and
classification for the two algorithms. Again LibSVM 5. The classification accuracy improved in figure 6 when
scores a higher classification accuracy while achieving compared to the accuracy ratings in figure 3. The lower
more consistent results. The accuracy of Vowpal Wabbit rates of classification accuracy can can be contributed to
begins to rise until the 20th iterations. From there the the skewed training dataset. The frequency distribution in
accuracy begins to drop until the 78th iteration. Unlike presented in figures 1 and 2 are unbalanced. As such the
the SensIT tests, the algorithms never appear to converge. classification accuracy over the smaller SensIT data set
Overall the LIBLINEAR extension LibSVM was lower than the larger RCV1 dataset. The increase in
classification algorithm achieved the highest classification instances correlated with the accuracy of overall
rates. At no point in the iteration process did the Vowpal classification. The more instances the algorithm had to
Wabbit algorithm correctly classify more test instances. train over, the more classification accuracy improved.
The highest classification accuracy rate occurred during
the first iteration step. LibSVM correctly classified
Conclusion
Using the results recorded from figures 3 -8 the following
question about Fast Learning large-scale multi-class
classification can be answered:
Question: Which algorithm is most efficient when their
are constraints to the memory? When compared, Vowpal
Wabbit is the most efficient multi-class classification
algorithm. The results from the SensIT test case
suggested that Vowpal Wabbit was the quicker algorithm
while maintaining a slightly lower accuracy percentile
than LIBSVM. Moving from the SensIT dataset to the
larger RCV1, the results remained consistent. We have
concluded that Vowpal Wabbit had a slight advantage in
overall efficiency when there is a constraint placed of
computer memory size.
The performance of both algorithms was relatively
close however. Due to the small size of the experiment,
further testing is needed for a thorough comparison.
Testing over datasets that are more expansive, both in
sample and feature size could be used for a more
significant experiment. However, we feel that the size of
the datasets used in this paper are adequate to provide
conclusive results for the comparisons made.
References Liu, Y., Wang, R., and Zeng, Y. S. 2007. An Improvement
of One-Against-One Method for Multi-class Support
Aly, M. 2005. Survey on multiclass classification Vector Machine. In Machine Learning and Cybernetics,
methods. Neural networks, 1-9. 2007 International Conference 5: 2915-2920.

Argyriou, A., Herbster, M., and Pontil, M. 2005. Menon, A. K. 2009. Large-scale support vector machines:
Combining graph Laplacians for semi--supervised algorithms and theory. Research Exam, University of
learning. California, San Diego.

Bottou, L., and Lin, C. J. 2007. Support vector machine Pechyony, D., Shen, L., and Jones, R. Solving Large Scale
solvers. Large scale kernel machines: 301-320. Linear SVM with Distributed Block Minimization.

Chang, K. W., and Roth, D. 2011. Selective block Pérez-Cruz, F., Figueiras, A., and Artes, A. 2004. Double
minimization for faster convergence of limited memory chunking for solving SVMs for very large datasets.
large-scale linear models. In Proceedings of the 17th
ACM SIGKDD international conference on Knowledge Shalev-Shwartz, S., Singer, Y., and Srebro, N. 2007.
discovery and data mining: 699-707. Pegasos: Primal estimated sub-gradient solver for svm. In
Proceedings of the 24th international conference on
Chang, C. and Lin, C. 2011. LIBSVM: a library for Machine learning, 807-814.: ACM.
support vector machines. ACM Transactions on
Intelligent Systems and Technology, 2(3): 27:1--27:27. Tewari, A., and Bartlett, P. L. 2007. On the consistency of
Software available at multiclass classification methods. Journal of Machine
http://www.csie.ntu.edu.tw/~cjlin/libsvm Learning Research 8: 1007-1025.

Cortes, C., and Vapnik, V. 1995. Support-vector networks. Yu, H. F., Hsieh, C. J., Chang, K. W., and Lin, C. J. 2012.
Machine learning, 20(3): 273-297. Large linear classification when data cannot fit in
memory. ACM Transactions on Knowledge Discovery
Dekel, O. 2008. From online to batch learning with from Data TKDD 5(4): 23.
cutoff-averaging. NIPS.

Duarte, M. F., and Hen Hu, Y. 2004. Vehicle classification
in distributed sensor networks. Journal of Parallel and
Distributed Computing 64(7): 826-838.

Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., &
Lin, C. J. (2008). LIBLINEAR: A library for large linear
classification. The Journal of Machine Learning Research
9: 1871-1874.

Langford, J., Li, L., and Zhang, T. 2009. Sparse online
learning via truncated gradient. The Journal of Machine
Learning Research 10: 777-801.

Lewis, D., Yang, Y., Rose, T., and Li, F. 2004. RCV1: A
new benchmark collection for text categorization
research. Journal of Machine Learning Research 5: 361-
397.