Multi-class Classication in Big Data
                         1                        1               2                    1
   Anton Malenichev , Olga Krasotkina , Vadim Mottl , and Oleg Seredin

                             1
                                 Tula State University, Tula, Russia
       malenichev@mail.ru, o.v.krasotkina@yandex.ru, oseredin@yandex.ru
                2
                    Dorodnicyn Computing Centre, RAS, Moscow, Russia
                                      v.v.mottl@yandex.ru


        Abstract.    The paper suggests the on-line multi-class classier with a
        sublinear computational complexity relative to the number of training
        objects. The proposed approach is based on the combining of two-class
        probabilistic classiers. Pairwise coupling is a popular multi-class classi-
        cation method that combines all comparisons for each pair of classes.
        Unfortunately pairwise coupling suers in many cases from incompati-
        bility in that some regions of its input space the sum of probabilities are
        not equal to one. In this paper we propose the optimal approximation
        for probabilities in each point of object space. This paper proposes a
        new probabilistic interpretation of the Support Vector Machine for ob-
        taining class probabilities. We show how the SVM can be viewed as a
        maximum likelihood estimate of a class of probabilistic models. As a com-
        putational method for big data we use the stochastic gradient descent
        approach minimizing directly the primal SVM objective. Unfortunately
        the hinge loss of the true SVM classier did not allow to use SGD pro-
        cedure for determining the classier bias. In this paper we propose the
        piece-wise quadratic loss that helps to overcome this obstacle and gives
        an instrument to obtain the bias from SGD procedure.
        Keywords:   pairwise coupling, stochastic gradient descent, support vec-
        tor machine, multiclass learning, large datasets.


1 Introduction
The multi-class classication problem refers to assigning each of the observations
into one of k classes. Most of the real world classication applications, such as
image search and text recognition, involve many classes. Thus, the user needs to
select from amongst a large class of labels in addition to handling a huge data
set.
   In the general sense the multi-label classication methods can be catego-
rized into two main categories: "Single Machine approaches" that try to solve a
single optimization problem that trains many binary classiers simultaneously
and approaches based on combining independent binary classiers. Weston and
Watkins [1] propose a formulation of Support Vector Machine approach that en-
ables a multi class problem to be solved in a single optimization criterion. Lee,
Lin and Wahba [2] propose the multicategory support vector machine (MSVM),
which extends the binary SVM to the multicategory case. The proposed method
provides a unifying framework when there are either equal or unequal misclassi-
cation costs. Both Weston and Lee formulations have many dummy variables,
have no decomposition method and could not be used in the case of big train-
ing samples. Crammer and Singer [3] describe the algorithmic implementation
of multiclass kernel-based vector machines with ecient iterative decomposition
scheme. But proposed algorithm use a lot of memory to cache kernel products.
The paper states that the fastest version has two more technical improvements
which are not discussed here but will be documented in the code that we will
shortly make available. But the code was never made available.

   The dominating approach for solving multiclass problems has been based
on reducing a single multiclass problems into multiple binary problems. Con-
structing of all-versus-all (AVA) or one-versus-all (OVA) classiers is a popular
approach in this strategy [4,5,6]. As can be seen from the literature, AVA seems
                                                                                2
faster and more memory ecient in the case of big sample. It requires O(m )
classiers instead of O(m) (m - number of classes), but each classier is (on av-
erage) much smaller. If the time to build a classier is superlinear in the number
of data points, AVA is a better choice.

   A common way to combine pairwise comparisons is by voting [8,9]. It con-
structs a rule for discriminating between every pair of classes and then selecting
the class with the most winning two-class decisions. The voting procedure only
predicts a class label. In many cases, however, probability estimates are desired.
Hastie and Tibshirani [10] have proposed probability estimates by combining the
pairwise class probabilities. In this paper we propose a method for combining
the class probabilities that is more stable than voting and the method by Hastie
and Tibshirani.

   This paper proposes a new probabilistic interpretation of the Support Vector
Machine for obtaining class probabilities. We show how the SVM can be viewed
as a maximum likelihood estimate of a class of probabilistic models. As a com-
putational method for big data we use the stochastic gradient descent approach
minimizes directly the primal SVM objective. Unfortunately the hinge loss of
the true SVM classier did not allow to use SGD procedure for determining the
classier bias. The bias term often plays a crucial role when the distribution
of the labels is uneven as is typically the case in text processing applications.
Shalev-Schwarz et al. [17] proposed several approaches for learning the bias term.
This approach simply amounts to adding one more feature to each instance and
incorporating the bias into direction vector. The disadvantage of this approach
is that we solve a relatively dierent optimization problem, not SVM. Second
approach consist in optimizing the non-convex SVM loss as is. But in this case
the algorithm has slower convergence rate. In this paper we propose the piece-
wise quadratic loss that helps to overcome this obstacle and gives an instrument
to obtain the bias from SGD procedure.

   The paper is organized as follows. Section 2 states the optimal approximation
for pair-wise probabilities dening the pairwise coupling approach. Section 3
reviews the binary SVM. Section 4 presents a numerical study for illustration.
Then, Section 5 presents concluding remarks and discussion of future directions.


2 The problem of multi-class classication on large data
  sets
The paper considers the classical formulation of the learning pattern recognition
problem. We suppose that there are many real-world objects Ω . We also assume
that each object ω     ∈ Ω may be characterized by the label y ∈ {0, 1, .., m −
1}, m > 2, and by vector x ∈ Rn . In other words, each object is most fully
represented by triple (ωi , yi , xi ), i = 1, .., N , where N is the count of objects in
current set. The number yi in this case is called the class of the object ωi , vector
xi is a feature vector. The number n, is the dimension of the feature space,
indicates the length of vectors xi . Usually we often know only the feature vector
xi for each object ωi , and doesn't know its class yi . The recognition problem is
to build a function, which requires the feature vector xi and returns the class of
an object ŷi , and it has to make mistakes as little as possible.
      This problem is unsolvable without some additional information about ob-
                                                         train
jects. Suppose that we have some set of objects ωi               , i = 1, .., N , for which a
feature vector xi is known as well as their class labels yi . The whole set of such
objects called the training set that consists of N objects. Say at once that the
number N in the case of the large data sets analysis is large enough. For clarity
we determine N ≥ 10000.


2.1      Our pairwise coupling approach

It is easy to calculate that if we have m dierent classes the number of dierent
pairs of classes equals to
                                   2     m(m − 1)
                                  Cm =            .                                           (1)
                                            2
                                                                                         kl
      Suppose that we want to determine the pairwise condence function P                     (z).
Let us select objects of the classes k and l from the initial training set. Objects of
other classes are not considered when constructing pairwise condence function.
Note that it's necessary to make calculation only for k < l. If k = l the problem
does not make sense, and if k > l we can make a simple and obvious transition
P lk = 1 − P kl .
                                                                         kl
      After receiving whole set of pairwise condence functions P             (z) it is requires
to construct the general classication rule π(z).
      Suppose that at the point z the required distribution π(z) exists and it is
                                           kl
agreed with all pairwise probabilities P        (z). Actually, strictly speaking, this is
                                                     kl
not always true, since pairwise probabilities P         (z) are obtained independently.
Besides their combination may be inconsistent. However, experience shows that
if the inconsistency observed, it presents in very small areas of the feature space
only. Moreover, we show how to choose the approximation in case of inconsis-
tency.
    Suppose that each pairwise probability is equal to


                                                       π k (z)
                                  P kl (z) =                          ,                             (2)
                                                π k (z) + π l (z)

                                                                 π l (z)
                         P lk (z) = 1 − P kl (z) =                           .                      (3)
                                                           π k (z) + π l (z)
    As a simplication we assume also that


                                          P kk (z) = 0.5.                                           (4)

                 l
    Let express π (z) from (2)


                                        1 − P kl (z) k
                            π l (z) =               π (z), l 6= k.                                  (5)
                                          P kl (z)

    Since whole set of π(z) makes a complete group of events we can express
π k (z)
                                          m
                                          X
                                                π k (z) = 1                                         (6)
                                          k=1
                                                                 
                              m                      m        kl
                              X                     X  1 − P (z)  k
             π k (z) = 1 −           π l (z) = 1 −                π (z).                          (7)
                                                   
                                                         P kl (z)
                              l=1                         l=1
                              l6=k                        l6=k

                                                     −1
                                         m        kl
                                         X 1 − P (z) 
                          π k (z) = 1 +               .                                           (8)
                                    
                                             P kl (z)
                                                l=1
                                                l6=k

    The formula (8) can be simplied. If we assume that the probability of as-
signing an object to its own class is 0.5, it becomes


                                            m
                                                                     !−1
                              k
                                            X 1 − P kl (z)
                             π (z) =                                       .                        (9)
                                                       P kl (z)
                                            l=1

    If we estimate the dichotomous probability only for k < l, then we can rewrite
the formula (9) as follows


                           k−1                                   m
                                                                                 !−1
                           X      1 − P lk (z)       X 1 − P kl (z)
             π k (z) =                         + 1 +                                   .           (10)
                                    P lk (z)             P kl (z)
                           l=1                             l=k+1

                                                                                   kl
    We have to note that if at least one of the probabilities P                            in the origi-
nal formula (8) is zero, then the corresponding denominator becomes zero too.
                                               kq
Consider an innitesimal value P                    → +0, q 6= k . Then the formula (8) can be
rewritten using the limit

                                                             −1
                                                      1
                       π k (z) =                                                    P kq (z) = 0.
                                                                                            
              lim                    lim                            =     lim                               (11)
            P kq →+0               P kq →+0        P kq (z)             P kq →+0

                                                                                           kl
In other words, if at least one of the dichotomous probabilities P                              (z) = 0, k 6= l,
                                           k
the corresponding probability π                (z) = 0.
    The inconsistency is reected in the fact that equality (6) is not satised, i.e.
the sum of all the probabilities is not equal to one. The easiest way to nd the
approximation is to use the normalisation as follows:


                k−1                                 m
                                                                         !−1        m
                                                                                                  !−1
    k
                X      1 − P lk (z)       X 1 − P kl (z)                            X
                                                                                           k
   π (z) =                          + 1 +                                       ·         π (z)         .   (12)
                         P lk (z)             P kl (z)
                l=1                             l=k+1                               k=1


3 Two-class classication: Optimizing primal Support
  Vector Machine objective
We are using a linear decision function for classication. Let us assume that there
is a hyperplane, which correctly classies almost all objects from the training
sample (X, Y )      = {(xj , yj ), j = 1, ..., N }, d(xj | a, b) = (aT xj + b) for all j =
1, ..., N
    The loss function in general looks as follows:

                                                                                    α
                         q(x, y, a, b) = {max [0, 1 − yd(x, a, b)]} .                                       (13)


                Fig. 1. The view of loss functions for dierent values of α


    In the original formulation of the support vector machine the degree α shall
to be equal to one. However, this leads to fracture of the loss function and,
consequently, nondierentiability at the break point. In this paper we use the
gradient method, which involves the procedure of dierentiation of the original
SVM criterion and contains the sum of loss function values for all objects lying
in the area between the hyperplane and the gap. It's so-called support objects,
and vector features, describing these objects, are called the support vectors. We
need the loss function which is dierentiable at all points, so we take α = 2:
                                                                          2
                     q(x, y, a, b) = {max [0, 1 − yd(x, a, b)]} .                      (14)


   We'll choose a hyperplane for which the gap between it and the nearest vector
                                                                      n
of training set in the sense of the Euclidean metric in R                 is maximum


            yj d(xj | a, b) = yj (aT xj + b) ≥ ε, ε → max, aT a = 1.                   (15)


   This formulation of the problem leads to the following criteria:


                                X                                2
     J(a, b) = aT a + C                        1 − yj (aT xj + b) → min(a, b).
                                              
                                                                                       (16)
                          j:yj (aT xj +b)≤1


   Usually an SVM criterion is optimized in a dual form. It gives the exact
solution, but the high computational complexity and the need of loading the
whole training objects into the memory at the same time prohibit the using of
this method on a large training sets. We need the method for online learning,
which would produce the adjustment of decision rule over the time on the basis of
single or few random training objects for each iteration. In this paper we propose
to solve the primal SVM objective using an iterative approximation method of
stochastic gradient descent. It allows the on-line training without loading the
entire training set to the memory.
   There are several implementations of stochastic gradient descent methods
for solving the primal SVM objective [11,13,14,17]. However, they only allow
us to estimate the normal vector of hyperplane, while ignoring the bias value.
In order to estimate the bias methods based on ROC-analysis are usually used,
which signicantly aects the nal computational complexity and eliminates the
advantage in speed. This paper proposes the method for optimizing the origi-
nal SVM criterion with a quadratic loss function using the stochastic gradient
descent method. This simple method combines high performance, capacity for
additional training and simultaneous assessment of the normal vector and bias
of the hyperplane.
   Let us go to the expanded feature space via introducing new designations:

                             
                               a
                         c =       ∈ Rn+1
                       
                       
                               b
                       
                       
                       
                                   
                                 I 0
                       
                         A=      T     [(n + 1) × (n + 1)] .                           (17)
                       
                              0 0
                                xj
                       
                       
                        zj =        ∈ Rn+1
                       
                       
                                 1

   According to equations (17) the training criterion can be rewritten as follows:

                                                    X                         2
                J(c ∈ n+1 ) = cT Ac + C                        (1 − yj zTj c) .        (18)
                                                 j:yj zT
                                                       j c≤1
                                  s
                         s        a
   Let us denote by c        =        ∈ Rn+1 an approximation of the solution on
                                  bs
the s-th iteration of the algorithm. The next approximation is calculated by the
            s+1
formula c  = cs − αs g (J(cs )) .
                   s
   The coecients α are selected to satisfy the condition

                             ∞
                             X                ∞
                                              X          2
                                    αs = ∞;          (αs ) < ∞.                     (19)
                             s=−∞             s=−∞

   Sub-gradient for a single object is equal to


                                             (1 − yzT c)2 , yzT c ≤ 1
                                                                    
                                 T
              gc (J(c)) = gc c Ac + C                                   =
                                             0, yzT c > 1
                                           T        T
                         
                           (−yz + yy zz c), yz c ≤ 1
              2Ac + 2C                                       =
                                      |{z}
                                        1
                                 T                                        .
                           0, yz   c >   1                                          (20)
                           (−yz + zzT c), yzT c ≤ 1
                         
              2Ac + 2C          T                      =
                         0, yz
                              T
                                   c>1 
                                    T
                                                        yz, yzT c ≤ 1
                                                     
                           zz , yz c ≤ 1
              2 A+C             T            cs − 2C
                           0, yz c > 1                  0, yzT c > 1

   The algorithm stops when the condition             J(cs+1 ) − J(cs ) < ξ, is satised.
Here ξ - required accuracy.
   In this case the posterior probability of belonging to one of the two classes
expresses as (21)

                          h             2
                                          i
                       exp −C (1 −zT c)
                   
                   
                                    T c)2
                                              , zT c < − 1,
                    1+ exp[−C(1 −z         ]
                   
                                 h               2
                                                    i
            P kl =             exp −C (1 −zT c)
                                                                    T
                                     2                    2 , −1 ≤ z c ≤ 1,         (21)
                    exp[−C(1 +zT c) ] + exp[−C(1 −zT c) ]
                   
                   
                                1
                                              , zT c > 1,
                   
                   
                      exp[−C(1 +zT c)2 ] + 1
            P lk = 1 − P kl .


4 Experimental research
The experimental research was performed on real datasets from UCI repository
in opposition to Hastie and Tibshirani method. There were 3 datasets used.
Short dataset descriptions are presented at the table 1 as well as the results of
experimental study. The timings are presented for pairwise coupling procedures
only. The binary classier for both cases is SGD SVM with pairwised quadratic
loss. Bold values mean better results.
   The experimental stand consisted of CPU Intel Core i5-2430M 2.4Ghz, 8 Gb
RAM. The experiment was performed at single core.
   The experimental research shows that proposed approach has low computa-
tional complexity as well as low error rate and it ts well for multiclass big data
recognition tasks.
                Table 1. Summary table of the experimental research

                                                    Hastie
                                                                  Proposed
     Dataset Classes, Features, # train # test & Tibshirani
                                                                   method
      name       m        n     objects objects    method
                                                Err, % Time, s Err, % Time, s
    Pendigits 10      16        7494 3498 4.49 1.46            4.69 0.39
    Satimage 6        36        4435 2000 15.30 1.01           15.20 0.18
    Kdd-cup 6         112       47120 20191 29.54 9.44         14.00 1.71


5 Conclusion
Highly eective method for multi-class classication in big data was proposed.
It based on the pairwise probability classiers coupling in accordance to AVA
scheme. It's pretty easy to implement, but it has good recognition abilities.
   As the binary classier was proposed the modied Stochastic Gradient De-
scent SVM method that have sublinear computational complexity relative to
the number of training objects. The main drawback of the original SGD SVM
method is inability of evaluating the bias as well as the normal vector of hyper-
plane. To overcome it we proposed to use the piecewised quadratic loss function.
   Experimental research shows that developed method successfully handles
with the multi-class classication task in big data with acceptable timing and
accuracy.
   The main direction for future study is further decreasing of computational
complexity by using methods for non-enumerative cross-validation based on the
classical Akaike Information Criterion.


Acknowledgments The work supported by grants 0018322 of the Foundation
for Assistance to Small Innovative Enterprises (FASIE) and 14-07-00964, 16-37-
00399, 14-07-00527, 16-57-52042 of the Russian Foundation for Basic Research.


References
1. Weston, Jason and Watkins, Chris. Support vector machines for multi-class pattern
   recognition, ESANN'1999 proceedings - European Symposium on Articial Neural
   Networks Bruges (Belgium), 21-23 April 1999, ISBN 2-600049-9-X, pp. 219-224
2. Yoonkyung Lee and Yi Lin and Grace Wahba. Multicategory Support Vector Ma-
   chines, theory, and application to the classication of microarray data and satellite
   radiance data, Journal of the American Statistical Association, 2004,99,6781
3. Crammer, Koby; and Singer, Yoram. On the Algorithmic Implementation of Multi-
   class Kernel-based Vector Machines. Journal of Machine Learning Research, 2001,
   2, 265292.
4. Duan, K. B., Keerthi, S. S.. Which Is the Best Multiclass SVM Method? An Em-
   pirical Study. Multiple Classier Systems. LNCS 3541., 2005, pp. 278285.
5. Hsu, Chih-Wei and Lin, Chih-Jen. A Comparison of Methods for Multiclass Support
   Vector Machines. 2002, IEEE Transactions on Neural Networks.
6. Rifkin,      R.      MITMulticlass       Classication,      Available      online:
   http://www.mit.edu/ 9.520/spring09/Classes/multiclass.pdf.
7. Sergey Dvoenko, Vadim Mottl, Oleg Seredin Multiclass pattern recognition proce-
   dure based on pairwise condence functions for pairs of classes. Izvestija TulGU,
   series Computer science, automation, management, volume 2 part 2, 1999, pp.
   28-35. (in Russian)
8. S. Knerr, L. Personnaz, and G. Dreyfus. Single-layer learning revisited: a stepwise
   procedure for building and training a neural network. In J. Fogelman, editor, Neu-
   rocomputing: Algorithms, Architectures and Applications. Springer-Verlag, 1990.
9. J. Friedman. Another approach to polychotomous classication. Technical report,
   Department of Statistics, Stanford University, 1996.
10. T. Hastie and R. Tibshirani. Classication by pairwise coupling. The Annals of
   Statistics, 26(1):451471, 1998.
11. Bordes, Antoine. New algorithms for large-scale support vector machines. Diss.
   Université Pierre et Marie Curie-Paris VI, 2010.
12. Hosmer Jr D. W., Lemeshow S. Applied logistic regression. New York: John Wiley
   & Sons, 2004.
13. John Duchi and Yoram Singer Online and Batch Learning using Forward Looking
   Subgradients, 2008. Manuscript.
14. Kivinen J., Smola A. J., Williamson R. C. Online learning with kernels. Advances
   in neural information processing systems. 2001. pp. 785-792.
15. Menon, Aditya Krishna Large-scale support vector machines: algorithms and the-
   ory. Research Exam, University of California, San Diego (2009): 1-17.
16. P. Jain, A. Kapoor Active Learning for Large Multi-class Problems. Proc. the IEEE
   Conference on Computer Vision and Pattern Recognition (CVPR), 2009: 762-769.
17. Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro Pegasos: Primal Estimated
   sub-GrAdient SOlver for SVM. ICML 2007: Proceedings of the 24th International
   Conference on Machine learning, pages 807-814, New York, NY, USA, 2007. ACM.
   ISBN 978-1-59593-793-3.
18. Weston J., Watkins C. Multi-class support vector machines. Technical Report
   CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of
   London, May, 1998.