Comparison of Optimization Methods for L1-regularized Logistic Regression Aleksandar Jovanovich Department of Computer Science and Information Systems Youngstown State University Youngstown, OH 44555 aleksjovanovich@gmail.com Alina Lazar Department of Computer Science and Information Systems Youngstown State University Youngstown, OH 44555 alazar@ysu.edu Abstract and has been shown to generate sparse models (Yuan, Logistic regression with L1-regularization has been recognized as Chang, and Lin 2010). a prominent method for feature extraction in linear classification problems. Various optimization methods for L1 logistic Recently, there has been a large amount of research regression have been proposed in recent years. However there conducted to related regularization methods. Each method have been few studies conducted to compare such methods. This is differentiated by various aspects including: convergence paper reviews existing methods for optimization and then tests speed, implementation, and practicability. Therefore, there the methods over a binary dataset. Results are recorded and is significance in conducting a thorough comparison and comparisons are made. After analyzing the results, the conclusion is that the GLMNET method is the best in terms of evaluation (Yuan, Chang, and Lin 2010). In this paper, we time efficiency. review prevailing methods for L1-regularized logistic regression and give a detailed comparison. Introduction Background Digital information is growing at an extreme rate. Emerging technologies have created an environment that is Logistic regression is used for prediction of the probability information driven. From social media to medical records, of occurrence of an event by fitting data to a function. It is data is collected in all forms from around the world. a generalized linear model used for binomial regression. Current trends suggest a jump in information gathered and Like other forms of regression analysis, it makes use of one collected over the next decade and beyond. Never before or more predictor variables that may be either numerical or has there been an abundance of data and information as we categorical. The logistic regression problem is an see today. optimization problem, and can be solved by a wide variety of methods; such as gradient descent, steepest descent, and As the amount of data collected continues to grow so does Newton. Once optimization is complete and maximum the challenge of processing and gathering information. likelihood values are found, a prediction on the probability The data is growing wide, and the amount of attributes and of the two possible outcomes can be made (Koh, Kim, and features that can be derived sometimes outnumber the Boyd 2007). sample size. Now, more and more binary large objects are appearing in databases which require a different approach The logistic model has the form: to identifying and extracting information. Researchers have turned to regularized general linear [1] models to form relationships about the binary data. Where b (-1, +1) denotes the associated binary output and Regularization is required to avoid over-fitting when there where Prob(b|x) is the conditional probability of b. are a large number of parameters. In particular, L1- regularized regression is often used for feature selection, L1-regularized logistic regression has recently received attention. The main motivation is that L1-regularized logistic regression yields a sparse vector and has relatively successive coordinate descent iterations (Yuan, Chang, and few nonzero coefficients (Koh et al. 2007). A logistic Lin 2010). model with sparse vectors is simpler and more efficient when dealing with data having a smaller number of Continuous Generalized Gradient Descent observations than features. When compared to L2- An effective regularization strategy in generalized regularized logistic regression, L1-regularized logistic regression is using validation methods to choose a suitable regression outperforms L2-regularized logistic regression point in a trajectory or a family. Due to the use of gradient (Wainwright, Ravikumar, and Lafferty 2007). information, the number of iterations is less than cyclic The L1-regularized logistic regression problem minimizes coordinate descent methods. However, the cost per the following equation: iteration is higher (Zhang 2007). lavg(v,w)+l||w||1=(1=m) f(wTai+vbi)+l ||w|| [2] Least Angle Regression LARS relates to the classic model-selection method known Where λ > 0 is the regularization parameter. A solution as Forward Selection (described in Efron, Hastie, must exist, but it need not be exclusive. The objective Johnstone and Tibshirani 2004). Given a collection of function in the L1-regularized Logistic regression problem possible predictors, a selection is made based on the largest is not differentiable so solving the problem is a absolute correlation with the response y. Thereafter simple computational challenge (Koh, Kim, and Boyd 2007). linear regression is performed on the response y. This leaves a residual vector that can be considered the A regularization path is the set of solutions obtained from response. Projection is made over the other predictors L1-regularized linear regression problems while solving orthogonally to the response. The selection process is then for λ. In many cases, the entire regularization path needs repeated. After n steps this results in a set of predictors to be computed, in order to determine an appropriate value that are then used to construct a n-parameter linear model. of λ. The regularization path in a smaller L1-regularized linear regression problem can be computed efficiently Relaxed Lasso (Friedman, Hastie, and Tibshirani 2010). Hastie et al. Relaxo is a generalization of the Lasso shrinkage technique describe an algorithm for computing the entire for linear regression. Both variable selection and parameter regularization path for general linear models including estimation is achieved by regular Lasso, yet both steps do logistic regression models. Path-following methods can not necessarily use the same penalty parameter. The results be slow for large-scale problems, where the number of include all Lasso solutions but allow for sparser models observations is very large. while having similar predictive performance if many predictor variables are present. The package is based on the Optimization LARS package (Meinshausen 2007). Each method uses a type of optimization approach to find Datasets the regularization path as well as λ. The general model used in each method consists of iterations of the descent, All the experiments were done using the Leukemia dataset, where a chosen subset of variables is deemed the working a gene-expression data. This dataset was first mentioned in set and all other variables become fixed. With every step (Golub et al. 1999). The pre-processed dataset using the resulting sub-problem contains fewer variables and methods from (Dettling, 2004) was used. The datasets therefore solved easier. consists of 72 genes that are part of two classes 0 and 1. There are 47 genes are from class 0 and 25 are from class Coordinate Descent Method 1. Typically, a coordinate descent method sequentially goes through all variables and then repeats the same process. By solving the regression problem along an entire path of values, this method efficiently calculates the regularization parameters (Friedman, Hastie, and Tibshirani 2010). Generalized Linear Model with Elastic Net GLMNET applies a shrinking technique to solve smaller optimization problems. GLMNET conducts feature-wise normalization before solving the optimization problem. Then, GLMNET measures the relative step change in the Figure 1. Histogram for Predictor Variable1 values the AUC (area under the curve) is maximum at 0.992. These results are shown in Figure 3. Figure 2. Histogram for Predictor Variable 250 There are 3,571 predictor variables that have numeric values in the interval [-10, 10] with most of the values close to 0. Figure 3.Glmnet ROC curve for the grids search The two figures above represent the histograms for two of To run the experiments we used the GLMNET, CGGD, the variables, the first one and the 250th one. More than Relaxo, and LARS package in R. The LARS and Relaxo 75% of the values of variable 1 are in the [-1, 0] interval. packages fit lasso model paths, while the GLMNET The values of variable 250 are normally distributed in the package fits lasso and elastic-net model paths for logistic [-2.5, 2.5] interval. and multinomial regression using coordinate descent. The algorithms are extremely fast, because they exploit sparsity in the data matrix. The CGGD is used for performing Experiments regressions while continuously varying regularization. The So far, we have described several large-scale optimization method returns the models fit along the continuous paths of methods for solving L1-regularized logistic regression parameter modification. problems. In this section, we conduct experiments to investigate their individual and group performances. First The coefficients from step 1 to 100 were recorded and their we describe the experimental settings. Then the profile is plotted in figures 4, 5 and 6. Unfortunately we optimization methods are compared in terms of accuracy were unable to plot the coefficients of the Relaxo package. and time. To be able to provide good predictions using the GLMNET algorithm, the regularized parameter λ has to be found first. That can be done in R using a grid search and functions from the caret package (Kuhn, 2012). First, the trainControl function is used to set the training parameters. Bootstrap sampling is done 25 times to increase the chance of getting high accuracy results. model <- train(FL,data=trainset,method='glmnet', metric = "ROC", tuneGrid = expand.grid(.alpha=c(0,1), .lambda=seq(0.02,.4,length=20)), trControl=MyTrainControl) Figure 4: Profile of estimated coefficients for GLMNET The model is obtained by using the caret’s train function. method The search interval for λ is [0.02, .4] with a step of 0.02. Parameter α can take 2 values 0 or 1. For α = 0 and all λ Optimization 100 Steps GLMNET 0.036s Relaxo 0.064s LARS 0.116s CGGD 1.280s Cross-validation 100 Steps GLMNET 0.420s CGGD 1.38s LARS 1.932s Relaxo 4.076s Conclusions When compared, GLMNET is the more efficient algorithm. By the 100th step the predicted coefficients for GLMNET are stronger than both CGGD and LARS. When comparing the timings, GLMNET is almost 4 times as Figure 5: Profile of estimated coefficients for CGGD quick as CGGD in both optimization and cross validation. method Relaxo is the almost twice as slow as GLMNET when comparing optimization and almost 10 times as slow when cross validating. We can conclude that the most efficient method for L1-regularized logistic regression is GLMNET. The Leukemia dataset has a larger number of features compare to the number of instances. Linear models work well with datasets with such characteristics. The data while large however contained a small number of samples. Testing over a dataset with a large sample and small feature should be further investigated. References Dettling M. 2004. BagBoosting for Tumor Classification with Gene Expression Data. Bioinformatics, 20, 3583- 3593. Efron, B.; Hastie, T.; Johnstone, I.; and Tibshirani, R. 2004. Least angle regression. Annals of Statistics, 32:407- 499. Friedman, J.; Hastie, T.; and Tibshirani, R. 2010. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software Figure 6: Profile of estimated coefficients for LARS 33(1):1-22 method Golub T.; Slonim D.K.; Tamayo P.; Huard C.; Gaasenbeek M.; Mesirov J.P.; Coller H.; Loh M.L.; Downing J.R.; 10 Fold cross validation was used, and timings were Caligiuri M.A.; Bloomfield C.D.; Lander E.S. 1999. recorded. Timing in seconds for GLMNET, CGGD, Molecular Classification of Cancer: Class Discovery and Relaxo, and LARS over Leukemia data is presented. The Class Prediction by Gene Expression Monitoring. Science, timings were performed on one HP TX2000 series laptop. 286, 531-536. Hastie, T.; Rosset, S.; Tibshirani, R.; and Zhu, J. 2004. The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5:1391– 1415. Koh, K.; Kim, S. J.; and Boyd, M. 2007. An Interior-Point Method for Large-Scale L1-Regularized Logistic Regression. Journal of Machine Learning Research 8:1519-1555. Kuhn M. 2012. Package ‘caret’ http://cran.r- project.org/web/packages/caret/caret.pdf Meinshausen N. 2007. Relaxed Lasso. Computational Statistics and Data Analysis 52(1), 374-393 Wainwright, M.; Ravikumar, P.; and Lafferty, J. 2007. High-dimensional graphical model selection using L1- regularized logistic regressionn. Advances in Neural Information Processing Systems (NIPS) 19. Yuan, G. X.; Chang, K. W.; and Lin, C. J. 2010. A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification. Journal of Machine Learning Research 11: 3183-3234. Yuan, G. X.; Ho, C. H.; and Lin, C. J. 2011. . An Improved GLMNET for L1-regularized Logistic Regression. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Zhang, C. H. 2007. Continuous Generalized Gradient Descent. Journal of Computational and Graphical Statistics.