Comparison of Optimization Methods for L1-regularized
                               Logistic Regression
                                                      Aleksandar Jovanovich
                                     Department of Computer Science and Information Systems
                                                   Youngstown State University
                                                     Youngstown, OH 44555
                                                   aleksjovanovich@gmail.com

                                                          Alina Lazar
                                     Department of Computer Science and Information Systems
                                                   Youngstown State University
                                                    Youngstown, OH 44555
                                                        alazar@ysu.edu


                           Abstract                                  and has been shown to generate sparse models (Yuan,
Logistic regression with L1-regularization has been recognized as    Chang, and Lin 2010).
a prominent method for feature extraction in linear classification
problems.     Various optimization methods for L1 logistic           Recently, there has been a large amount of research
regression have been proposed in recent years. However there         conducted to related regularization methods. Each method
have been few studies conducted to compare such methods. This        is differentiated by various aspects including: convergence
paper reviews existing methods for optimization and then tests       speed, implementation, and practicability. Therefore, there
the methods over a binary dataset. Results are recorded and          is significance in conducting a thorough comparison and
comparisons are made.        After analyzing the results, the
conclusion is that the GLMNET method is the best in terms of
                                                                     evaluation (Yuan, Chang, and Lin 2010). In this paper, we
time efficiency.                                                     review prevailing methods for L1-regularized logistic
                                                                     regression and give a detailed comparison.

                       Introduction
                                                                                           Background
Digital information is growing at an extreme rate.
Emerging technologies have created an environment that is            Logistic regression is used for prediction of the probability
information driven. From social media to medical records,            of occurrence of an event by fitting data to a function. It is
data is collected in all forms from around the world.                a generalized linear model used for binomial regression.
Current trends suggest a jump in information gathered and            Like other forms of regression analysis, it makes use of one
collected over the next decade and beyond. Never before              or more predictor variables that may be either numerical or
has there been an abundance of data and information as we            categorical. The logistic regression problem is an
see today.                                                           optimization problem, and can be solved by a wide variety
                                                                     of methods; such as gradient descent, steepest descent, and
As the amount of data collected continues to grow so does            Newton. Once optimization is complete and maximum
the challenge of processing and gathering information.               likelihood values are found, a prediction on the probability
The data is growing wide, and the amount of attributes and           of the two possible outcomes can be made (Koh, Kim, and
features that can be derived sometimes outnumber the                 Boyd 2007).
sample size. Now, more and more binary large objects are
appearing in databases which require a different approach            The logistic model has the form:
to identifying and extracting information.

Researchers have turned to regularized general linear                                                                         [1]
models to form relationships about the binary data.
                                                                     Where b (-1, +1) denotes the associated binary output and
Regularization is required to avoid over-fitting when there
                                                                     where Prob(b|x) is the conditional probability of b.
are a large number of parameters. In particular, L1-
regularized regression is often used for feature selection,
                                                                     L1-regularized logistic regression has recently received
                                                                     attention. The main motivation is that L1-regularized
logistic regression yields a sparse vector and has relatively   successive coordinate descent iterations (Yuan, Chang, and
few nonzero coefficients (Koh et al. 2007). A logistic          Lin 2010).
model with sparse vectors is simpler and more efficient
when dealing with data having a smaller number of               Continuous Generalized Gradient Descent
observations than features. When compared to L2-                An effective regularization strategy in generalized
regularized logistic regression, L1-regularized logistic        regression is using validation methods to choose a suitable
regression outperforms L2-regularized logistic regression       point in a trajectory or a family. Due to the use of gradient
(Wainwright, Ravikumar, and Lafferty 2007).                     information, the number of iterations is less than cyclic
The L1-regularized logistic regression problem minimizes        coordinate descent methods. However, the cost per
the following equation:                                         iteration is higher (Zhang 2007).

lavg(v,w)+l||w||1=(1=m)       f(wTai+vbi)+l       ||w|| [2]     Least Angle Regression
                                                                LARS relates to the classic model-selection method known
Where λ > 0 is the regularization parameter. A solution         as Forward Selection (described in Efron, Hastie,
must exist, but it need not be exclusive. The objective         Johnstone and Tibshirani 2004). Given a collection of
function in the L1-regularized Logistic regression problem      possible predictors, a selection is made based on the largest
is not differentiable so solving the problem is a               absolute correlation with the response y. Thereafter simple
computational challenge (Koh, Kim, and Boyd 2007).              linear regression is performed on the response y. This
                                                                leaves a residual vector that can be considered the
A regularization path is the set of solutions obtained from     response. Projection is made over the other predictors
L1-regularized linear regression problems while solving         orthogonally to the response. The selection process is then
for λ. In many cases, the entire regularization path needs      repeated. After n steps this results in a set of predictors
to be computed, in order to determine an appropriate value      that are then used to construct a n-parameter linear model.
of λ. The regularization path in a smaller L1-regularized
linear regression problem can be computed efficiently           Relaxed Lasso
(Friedman, Hastie, and Tibshirani 2010). Hastie et al.          Relaxo is a generalization of the Lasso shrinkage technique
describe an algorithm for computing the entire                  for linear regression. Both variable selection and parameter
regularization path for general linear models including         estimation is achieved by regular Lasso, yet both steps do
logistic regression models. Path-following methods can          not necessarily use the same penalty parameter. The results
be slow for large-scale problems, where the number of           include all Lasso solutions but allow for sparser models
observations is very large.                                     while having similar predictive performance if many
                                                                predictor variables are present. The package is based on the
                     Optimization                               LARS package (Meinshausen 2007).

Each method uses a type of optimization approach to find                                Datasets
the regularization path as well as λ. The general model
used in each method consists of iterations of the descent,      All the experiments were done using the Leukemia dataset,
where a chosen subset of variables is deemed the working        a gene-expression data. This dataset was first mentioned in
set and all other variables become fixed. With every step       (Golub et al. 1999). The pre-processed dataset using
the resulting sub-problem contains fewer variables and          methods from (Dettling, 2004) was used. The datasets
therefore solved easier.                                        consists of 72 genes that are part of two classes 0 and 1.
                                                                There are 47 genes are from class 0 and 25 are from class
Coordinate Descent Method                                       1.
Typically, a coordinate descent method sequentially goes
through all variables and then repeats the same process.
By solving the regression problem along an entire path of
values, this method efficiently calculates the regularization
parameters (Friedman, Hastie, and Tibshirani 2010).

Generalized Linear Model with Elastic Net
GLMNET applies a shrinking technique to solve smaller
optimization problems. GLMNET conducts feature-wise
normalization before solving the optimization problem.
Then, GLMNET measures the relative step change in the
                                                                Figure 1. Histogram for Predictor Variable1
                                                                values the AUC (area under the curve) is maximum at
                                                                0.992. These results are shown in Figure 3.


     Figure 2. Histogram for Predictor Variable 250

There are 3,571 predictor variables that have numeric
values in the interval [-10, 10] with most of the values
close to 0.                                                          Figure 3.Glmnet ROC curve for the grids search

The two figures above represent the histograms for two of       To run the experiments we used the GLMNET, CGGD,
the variables, the first one and the 250th one. More than       Relaxo, and LARS package in R. The LARS and Relaxo
75% of the values of variable 1 are in the [-1, 0] interval.    packages fit lasso model paths, while the GLMNET
The values of variable 250 are normally distributed in the      package fits lasso and elastic-net model paths for logistic
[-2.5, 2.5] interval.                                           and multinomial regression using coordinate descent. The
                                                                algorithms are extremely fast, because they exploit sparsity
                                                                in the data matrix. The CGGD is used for performing
                      Experiments                               regressions while continuously varying regularization. The
So far, we have described several large-scale optimization      method returns the models fit along the continuous paths of
methods for solving L1-regularized logistic regression          parameter modification.
problems. In this section, we conduct experiments to
investigate their individual and group performances. First      The coefficients from step 1 to 100 were recorded and their
we describe the experimental settings.          Then the        profile is plotted in figures 4, 5 and 6. Unfortunately we
optimization methods are compared in terms of accuracy          were unable to plot the coefficients of the Relaxo package.
and time.

To be able to provide good predictions using the
GLMNET algorithm, the regularized parameter λ has to be
found first. That can be done in R using a grid search and
functions from the caret package (Kuhn, 2012). First, the
trainControl function is used to set the training parameters.
Bootstrap sampling is done 25 times to increase the chance
of getting high accuracy results.

model <- train(FL,data=trainset,method='glmnet',
           metric = "ROC",
           tuneGrid = expand.grid(.alpha=c(0,1),
           .lambda=seq(0.02,.4,length=20)),
           trControl=MyTrainControl)
                                                                 Figure 4: Profile of estimated coefficients for GLMNET
The model is obtained by using the caret’s train function.                                method
The search interval for λ is [0.02, .4] with a step of 0.02.
Parameter α can take 2 values 0 or 1. For α = 0 and all λ
                                                          Optimization 100 Steps
                                                          GLMNET         0.036s
                                                          Relaxo         0.064s
                                                          LARS           0.116s
                                                          CGGD           1.280s


                                                          Cross-validation 100 Steps
                                                          GLMNET         0.420s
                                                          CGGD           1.38s
                                                          LARS           1.932s
                                                          Relaxo         4.076s

                                                                                  Conclusions
                                                          When compared, GLMNET is the more efficient algorithm.
                                                          By the 100th step the predicted coefficients for GLMNET
                                                          are stronger than both CGGD and LARS.                 When
                                                          comparing the timings, GLMNET is almost 4 times as
  Figure 5: Profile of estimated coefficients for CGGD    quick as CGGD in both optimization and cross validation.
                          method                          Relaxo is the almost twice as slow as GLMNET when
                                                          comparing optimization and almost 10 times as slow when
                                                          cross validating. We can conclude that the most efficient
                                                          method for L1-regularized logistic regression is GLMNET.
                                                          The Leukemia dataset has a larger number of features
                                                          compare to the number of instances. Linear models work
                                                          well with datasets with such characteristics. The data while
                                                          large however contained a small number of samples.
                                                          Testing over a dataset with a large sample and small feature
                                                          should be further investigated.

                                                                                  References
                                                          Dettling M. 2004. BagBoosting for Tumor Classification
                                                          with Gene Expression Data. Bioinformatics, 20, 3583-
                                                          3593.

                                                          Efron, B.; Hastie, T.; Johnstone, I.; and Tibshirani, R.
                                                          2004. Least angle regression. Annals of Statistics, 32:407-
                                                          499.

                                                          Friedman, J.; Hastie, T.; and Tibshirani, R. 2010.
                                                          Regularization Paths for Generalized Linear Models via
                                                          Coordinate Descent.     Journal of Statistical Software
   Figure 6: Profile of estimated coefficients for LARS   33(1):1-22
                          method
                                                          Golub T.; Slonim D.K.; Tamayo P.; Huard C.; Gaasenbeek
                                                          M.; Mesirov J.P.; Coller H.; Loh M.L.; Downing J.R.;
10 Fold cross validation was used, and timings were       Caligiuri M.A.; Bloomfield C.D.; Lander E.S. 1999.
recorded. Timing in seconds for GLMNET, CGGD,             Molecular Classification of Cancer: Class Discovery and
Relaxo, and LARS over Leukemia data is presented. The     Class Prediction by Gene Expression Monitoring. Science,
timings were performed on one HP TX2000 series laptop.    286, 531-536.
Hastie, T.; Rosset, S.; Tibshirani, R.; and Zhu, J. 2004.
The entire regularization path for the support vector
machine. Journal of Machine Learning Research, 5:1391–
1415.

Koh, K.; Kim, S. J.; and Boyd, M. 2007. An Interior-Point
Method for Large-Scale L1-Regularized Logistic
Regression. Journal of Machine Learning Research
8:1519-1555.

Kuhn M. 2012. Package ‘caret’ http://cran.r-
project.org/web/packages/caret/caret.pdf

Meinshausen N. 2007. Relaxed Lasso. Computational
Statistics and Data Analysis 52(1), 374-393

Wainwright, M.; Ravikumar, P.; and Lafferty, J. 2007.
High-dimensional graphical model selection using L1-
regularized logistic regressionn. Advances in Neural
Information Processing Systems (NIPS) 19.

Yuan, G. X.; Chang, K. W.; and Lin, C. J. 2010. A
Comparison of Optimization Methods and Software for
Large-scale L1-regularized Linear Classification. Journal
of Machine Learning Research 11: 3183-3234.

Yuan, G. X.; Ho, C. H.; and Lin, C. J. 2011. . An
Improved GLMNET for L1-regularized Logistic
Regression. In Proceedings of the 17th ACM SIGKDD
International Conference on Knowledge Discovery and
Data Mining.

Zhang, C. H. 2007. Continuous Generalized Gradient
Descent. Journal of Computational and Graphical
Statistics.