Multi-class Classication in Big Data 1 1 2 1 Anton Malenichev , Olga Krasotkina , Vadim Mottl , and Oleg Seredin 1 Tula State University, Tula, Russia malenichev@mail.ru, o.v.krasotkina@yandex.ru, oseredin@yandex.ru 2 Dorodnicyn Computing Centre, RAS, Moscow, Russia v.v.mottl@yandex.ru Abstract. The paper suggests the on-line multi-class classier with a sublinear computational complexity relative to the number of training objects. The proposed approach is based on the combining of two-class probabilistic classiers. Pairwise coupling is a popular multi-class classi- cation method that combines all comparisons for each pair of classes. Unfortunately pairwise coupling suers in many cases from incompati- bility in that some regions of its input space the sum of probabilities are not equal to one. In this paper we propose the optimal approximation for probabilities in each point of object space. This paper proposes a new probabilistic interpretation of the Support Vector Machine for ob- taining class probabilities. We show how the SVM can be viewed as a maximum likelihood estimate of a class of probabilistic models. As a com- putational method for big data we use the stochastic gradient descent approach minimizing directly the primal SVM objective. Unfortunately the hinge loss of the true SVM classier did not allow to use SGD pro- cedure for determining the classier bias. In this paper we propose the piece-wise quadratic loss that helps to overcome this obstacle and gives an instrument to obtain the bias from SGD procedure. Keywords: pairwise coupling, stochastic gradient descent, support vec- tor machine, multiclass learning, large datasets. 1 Introduction The multi-class classication problem refers to assigning each of the observations into one of k classes. Most of the real world classication applications, such as image search and text recognition, involve many classes. Thus, the user needs to select from amongst a large class of labels in addition to handling a huge data set. In the general sense the multi-label classication methods can be catego- rized into two main categories: "Single Machine approaches" that try to solve a single optimization problem that trains many binary classiers simultaneously and approaches based on combining independent binary classiers. Weston and Watkins [1] propose a formulation of Support Vector Machine approach that en- ables a multi class problem to be solved in a single optimization criterion. Lee, Lin and Wahba [2] propose the multicategory support vector machine (MSVM), which extends the binary SVM to the multicategory case. The proposed method provides a unifying framework when there are either equal or unequal misclassi- cation costs. Both Weston and Lee formulations have many dummy variables, have no decomposition method and could not be used in the case of big train- ing samples. Crammer and Singer [3] describe the algorithmic implementation of multiclass kernel-based vector machines with ecient iterative decomposition scheme. But proposed algorithm use a lot of memory to cache kernel products. The paper states that the fastest version has two more technical improvements which are not discussed here but will be documented in the code that we will shortly make available. But the code was never made available. The dominating approach for solving multiclass problems has been based on reducing a single multiclass problems into multiple binary problems. Con- structing of all-versus-all (AVA) or one-versus-all (OVA) classiers is a popular approach in this strategy [4,5,6]. As can be seen from the literature, AVA seems 2 faster and more memory ecient in the case of big sample. It requires O(m ) classiers instead of O(m) (m - number of classes), but each classier is (on av- erage) much smaller. If the time to build a classier is superlinear in the number of data points, AVA is a better choice. A common way to combine pairwise comparisons is by voting [8,9]. It con- structs a rule for discriminating between every pair of classes and then selecting the class with the most winning two-class decisions. The voting procedure only predicts a class label. In many cases, however, probability estimates are desired. Hastie and Tibshirani [10] have proposed probability estimates by combining the pairwise class probabilities. In this paper we propose a method for combining the class probabilities that is more stable than voting and the method by Hastie and Tibshirani. This paper proposes a new probabilistic interpretation of the Support Vector Machine for obtaining class probabilities. We show how the SVM can be viewed as a maximum likelihood estimate of a class of probabilistic models. As a com- putational method for big data we use the stochastic gradient descent approach minimizes directly the primal SVM objective. Unfortunately the hinge loss of the true SVM classier did not allow to use SGD procedure for determining the classier bias. The bias term often plays a crucial role when the distribution of the labels is uneven as is typically the case in text processing applications. Shalev-Schwarz et al. [17] proposed several approaches for learning the bias term. This approach simply amounts to adding one more feature to each instance and incorporating the bias into direction vector. The disadvantage of this approach is that we solve a relatively dierent optimization problem, not SVM. Second approach consist in optimizing the non-convex SVM loss as is. But in this case the algorithm has slower convergence rate. In this paper we propose the piece- wise quadratic loss that helps to overcome this obstacle and gives an instrument to obtain the bias from SGD procedure. The paper is organized as follows. Section 2 states the optimal approximation for pair-wise probabilities dening the pairwise coupling approach. Section 3 reviews the binary SVM. Section 4 presents a numerical study for illustration. Then, Section 5 presents concluding remarks and discussion of future directions. 2 The problem of multi-class classication on large data sets The paper considers the classical formulation of the learning pattern recognition problem. We suppose that there are many real-world objects Ω . We also assume that each object ω ∈ Ω may be characterized by the label y ∈ {0, 1, .., m − 1}, m > 2, and by vector x ∈ Rn . In other words, each object is most fully represented by triple (ωi , yi , xi ), i = 1, .., N , where N is the count of objects in current set. The number yi in this case is called the class of the object ωi , vector xi is a feature vector. The number n, is the dimension of the feature space, indicates the length of vectors xi . Usually we often know only the feature vector xi for each object ωi , and doesn't know its class yi . The recognition problem is to build a function, which requires the feature vector xi and returns the class of an object ŷi , and it has to make mistakes as little as possible. This problem is unsolvable without some additional information about ob- train jects. Suppose that we have some set of objects ωi , i = 1, .., N , for which a feature vector xi is known as well as their class labels yi . The whole set of such objects called the training set that consists of N objects. Say at once that the number N in the case of the large data sets analysis is large enough. For clarity we determine N ≥ 10000. 2.1 Our pairwise coupling approach It is easy to calculate that if we have m dierent classes the number of dierent pairs of classes equals to 2 m(m − 1) Cm = . (1) 2 kl Suppose that we want to determine the pairwise condence function P (z). Let us select objects of the classes k and l from the initial training set. Objects of other classes are not considered when constructing pairwise condence function. Note that it's necessary to make calculation only for k < l. If k = l the problem does not make sense, and if k > l we can make a simple and obvious transition P lk = 1 − P kl . kl After receiving whole set of pairwise condence functions P (z) it is requires to construct the general classication rule π(z). Suppose that at the point z the required distribution π(z) exists and it is kl agreed with all pairwise probabilities P (z). Actually, strictly speaking, this is kl not always true, since pairwise probabilities P (z) are obtained independently. Besides their combination may be inconsistent. However, experience shows that if the inconsistency observed, it presents in very small areas of the feature space only. Moreover, we show how to choose the approximation in case of inconsis- tency. Suppose that each pairwise probability is equal to π k (z) P kl (z) = , (2) π k (z) + π l (z) π l (z) P lk (z) = 1 − P kl (z) = . (3) π k (z) + π l (z) As a simplication we assume also that P kk (z) = 0.5. (4) l Let express π (z) from (2) 1 − P kl (z) k π l (z) = π (z), l 6= k. (5) P kl (z) Since whole set of π(z) makes a complete group of events we can express π k (z) m X π k (z) = 1 (6) k=1   m m kl X X 1 − P (z)  k π k (z) = 1 − π l (z) = 1 −   π (z). (7)  P kl (z) l=1 l=1 l6=k l6=k  −1 m kl X 1 − P (z)  π k (z) = 1 +  . (8)  P kl (z) l=1 l6=k The formula (8) can be simplied. If we assume that the probability of as- signing an object to its own class is 0.5, it becomes m !−1 k X 1 − P kl (z) π (z) = . (9) P kl (z) l=1 If we estimate the dichotomous probability only for k < l, then we can rewrite the formula (9) as follows k−1 m !−1 X 1 − P lk (z) X 1 − P kl (z) π k (z) = + 1 + . (10) P lk (z) P kl (z) l=1 l=k+1 kl We have to note that if at least one of the probabilities P in the origi- nal formula (8) is zero, then the corresponding denominator becomes zero too. kq Consider an innitesimal value P → +0, q 6= k . Then the formula (8) can be rewritten using the limit  −1 1 π k (z) = P kq (z) = 0.  lim lim = lim (11) P kq →+0 P kq →+0 P kq (z) P kq →+0 kl In other words, if at least one of the dichotomous probabilities P (z) = 0, k 6= l, k the corresponding probability π (z) = 0. The inconsistency is reected in the fact that equality (6) is not satised, i.e. the sum of all the probabilities is not equal to one. The easiest way to nd the approximation is to use the normalisation as follows: k−1 m !−1 m !−1 k X 1 − P lk (z) X 1 − P kl (z) X k π (z) = + 1 + · π (z) . (12) P lk (z) P kl (z) l=1 l=k+1 k=1 3 Two-class classication: Optimizing primal Support Vector Machine objective We are using a linear decision function for classication. Let us assume that there is a hyperplane, which correctly classies almost all objects from the training sample (X, Y ) = {(xj , yj ), j = 1, ..., N }, d(xj | a, b) = (aT xj + b) for all j = 1, ..., N The loss function in general looks as follows: α q(x, y, a, b) = {max [0, 1 − yd(x, a, b)]} . (13) Fig. 1. The view of loss functions for dierent values of α In the original formulation of the support vector machine the degree α shall to be equal to one. However, this leads to fracture of the loss function and, consequently, nondierentiability at the break point. In this paper we use the gradient method, which involves the procedure of dierentiation of the original SVM criterion and contains the sum of loss function values for all objects lying in the area between the hyperplane and the gap. It's so-called support objects, and vector features, describing these objects, are called the support vectors. We need the loss function which is dierentiable at all points, so we take α = 2: 2 q(x, y, a, b) = {max [0, 1 − yd(x, a, b)]} . (14) We'll choose a hyperplane for which the gap between it and the nearest vector n of training set in the sense of the Euclidean metric in R is maximum yj d(xj | a, b) = yj (aT xj + b) ≥ ε, ε → max, aT a = 1. (15) This formulation of the problem leads to the following criteria: X 2 J(a, b) = aT a + C 1 − yj (aT xj + b) → min(a, b).  (16) j:yj (aT xj +b)≤1 Usually an SVM criterion is optimized in a dual form. It gives the exact solution, but the high computational complexity and the need of loading the whole training objects into the memory at the same time prohibit the using of this method on a large training sets. We need the method for online learning, which would produce the adjustment of decision rule over the time on the basis of single or few random training objects for each iteration. In this paper we propose to solve the primal SVM objective using an iterative approximation method of stochastic gradient descent. It allows the on-line training without loading the entire training set to the memory. There are several implementations of stochastic gradient descent methods for solving the primal SVM objective [11,13,14,17]. However, they only allow us to estimate the normal vector of hyperplane, while ignoring the bias value. In order to estimate the bias methods based on ROC-analysis are usually used, which signicantly aects the nal computational complexity and eliminates the advantage in speed. This paper proposes the method for optimizing the origi- nal SVM criterion with a quadratic loss function using the stochastic gradient descent method. This simple method combines high performance, capacity for additional training and simultaneous assessment of the normal vector and bias of the hyperplane. Let us go to the expanded feature space via introducing new designations:    a c = ∈ Rn+1   b       I 0  A= T [(n + 1) × (n + 1)] . (17)    0 0 xj    zj = ∈ Rn+1   1 According to equations (17) the training criterion can be rewritten as follows: X 2 J(c ∈ n+1 ) = cT Ac + C (1 − yj zTj c) . (18) j:yj zT j c≤1  s s a Let us denote by c = ∈ Rn+1 an approximation of the solution on bs the s-th iteration of the algorithm. The next approximation is calculated by the s+1 formula c = cs − αs g (J(cs )) . s The coecients α are selected to satisfy the condition ∞ X ∞ X 2 αs = ∞; (αs ) < ∞. (19) s=−∞ s=−∞ Sub-gradient for a single object is equal to (1 − yzT c)2 , yzT c ≤ 1    T gc (J(c)) = gc c Ac + C = 0, yzT c > 1 T T  (−yz + yy zz c), yz c ≤ 1 2Ac + 2C  = |{z} 1 T . 0, yz c > 1 (20) (−yz + zzT c), yzT c ≤ 1  2Ac + 2C T =  0, yz T c>1  T yz, yzT c ≤ 1   zz , yz c ≤ 1 2 A+C T cs − 2C 0, yz c > 1 0, yzT c > 1 The algorithm stops when the condition J(cs+1 ) − J(cs ) < ξ, is satised. Here ξ - required accuracy. In this case the posterior probability of belonging to one of the two classes expresses as (21) h 2 i exp −C (1 −zT c)    T c)2 , zT c < − 1,  1+ exp[−C(1 −z ]   h 2 i P kl = exp −C (1 −zT c) T 2 2 , −1 ≤ z c ≤ 1, (21)  exp[−C(1 +zT c) ] + exp[−C(1 −zT c) ]   1 , zT c > 1,   exp[−C(1 +zT c)2 ] + 1 P lk = 1 − P kl . 4 Experimental research The experimental research was performed on real datasets from UCI repository in opposition to Hastie and Tibshirani method. There were 3 datasets used. Short dataset descriptions are presented at the table 1 as well as the results of experimental study. The timings are presented for pairwise coupling procedures only. The binary classier for both cases is SGD SVM with pairwised quadratic loss. Bold values mean better results. The experimental stand consisted of CPU Intel Core i5-2430M 2.4Ghz, 8 Gb RAM. The experiment was performed at single core. The experimental research shows that proposed approach has low computa- tional complexity as well as low error rate and it ts well for multiclass big data recognition tasks. Table 1. Summary table of the experimental research Hastie Proposed Dataset Classes, Features, # train # test & Tibshirani method name m n objects objects method Err, % Time, s Err, % Time, s Pendigits 10 16 7494 3498 4.49 1.46 4.69 0.39 Satimage 6 36 4435 2000 15.30 1.01 15.20 0.18 Kdd-cup 6 112 47120 20191 29.54 9.44 14.00 1.71 5 Conclusion Highly eective method for multi-class classication in big data was proposed. It based on the pairwise probability classiers coupling in accordance to AVA scheme. It's pretty easy to implement, but it has good recognition abilities. As the binary classier was proposed the modied Stochastic Gradient De- scent SVM method that have sublinear computational complexity relative to the number of training objects. The main drawback of the original SGD SVM method is inability of evaluating the bias as well as the normal vector of hyper- plane. To overcome it we proposed to use the piecewised quadratic loss function. Experimental research shows that developed method successfully handles with the multi-class classication task in big data with acceptable timing and accuracy. The main direction for future study is further decreasing of computational complexity by using methods for non-enumerative cross-validation based on the classical Akaike Information Criterion. Acknowledgments The work supported by grants 0018322 of the Foundation for Assistance to Small Innovative Enterprises (FASIE) and 14-07-00964, 16-37- 00399, 14-07-00527, 16-57-52042 of the Russian Foundation for Basic Research. References 1. Weston, Jason and Watkins, Chris. Support vector machines for multi-class pattern recognition, ESANN'1999 proceedings - European Symposium on Articial Neural Networks Bruges (Belgium), 21-23 April 1999, ISBN 2-600049-9-X, pp. 219-224 2. Yoonkyung Lee and Yi Lin and Grace Wahba. Multicategory Support Vector Ma- chines, theory, and application to the classication of microarray data and satellite radiance data, Journal of the American Statistical Association, 2004,99,6781 3. Crammer, Koby; and Singer, Yoram. On the Algorithmic Implementation of Multi- class Kernel-based Vector Machines. Journal of Machine Learning Research, 2001, 2, 265292. 4. Duan, K. B., Keerthi, S. S.. Which Is the Best Multiclass SVM Method? An Em- pirical Study. Multiple Classier Systems. LNCS 3541., 2005, pp. 278285. 5. Hsu, Chih-Wei and Lin, Chih-Jen. A Comparison of Methods for Multiclass Support Vector Machines. 2002, IEEE Transactions on Neural Networks. 6. Rifkin, R. MITMulticlass Classication, Available online: http://www.mit.edu/ 9.520/spring09/Classes/multiclass.pdf. 7. Sergey Dvoenko, Vadim Mottl, Oleg Seredin Multiclass pattern recognition proce- dure based on pairwise condence functions for pairs of classes. Izvestija TulGU, series Computer science, automation, management, volume 2 part 2, 1999, pp. 28-35. (in Russian) 8. S. Knerr, L. Personnaz, and G. Dreyfus. Single-layer learning revisited: a stepwise procedure for building and training a neural network. In J. Fogelman, editor, Neu- rocomputing: Algorithms, Architectures and Applications. Springer-Verlag, 1990. 9. J. Friedman. Another approach to polychotomous classication. Technical report, Department of Statistics, Stanford University, 1996. 10. T. Hastie and R. Tibshirani. Classication by pairwise coupling. The Annals of Statistics, 26(1):451471, 1998. 11. Bordes, Antoine. New algorithms for large-scale support vector machines. Diss. Université Pierre et Marie Curie-Paris VI, 2010. 12. Hosmer Jr D. W., Lemeshow S. Applied logistic regression. New York: John Wiley & Sons, 2004. 13. John Duchi and Yoram Singer Online and Batch Learning using Forward Looking Subgradients, 2008. Manuscript. 14. Kivinen J., Smola A. J., Williamson R. C. Online learning with kernels. Advances in neural information processing systems. 2001. pp. 785-792. 15. Menon, Aditya Krishna Large-scale support vector machines: algorithms and the- ory. Research Exam, University of California, San Diego (2009): 1-17. 16. P. Jain, A. Kapoor Active Learning for Large Multi-class Problems. Proc. the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009: 762-769. 17. Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. ICML 2007: Proceedings of the 24th International Conference on Machine learning, pages 807-814, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-793-3. 18. Weston J., Watkins C. Multi-class support vector machines. Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, May, 1998.