=Paper= {{Paper |id=Vol-1928/paper5 |storemode=property |title=A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables |pdfUrl=https://ceur-ws.org/Vol-1928/paper5.pdf |volume=Vol-1928 |authors=Niharika Gauraha,Swapan Parui |dblpUrl=https://dblp.org/rec/conf/ki/GaurahaP17a }} ==A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables== https://ceur-ws.org/Vol-1928/paper5.pdf
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




    A New Combined Approach for Inference in
     High-Dimensional Regression Models with
              Correlated Variables

                      Niharika Gauraha and Swapan Parui

                             Indian Statistical Institute



      Abstract. We consider the problem of model selection and estimation in
      sparse high dimensional linear regression models with strongly correlated
      variables. First, we study the theoretical properties of the dual Lasso
      solution, and we show that joint consideration of the Lasso primal and its
      dual solutions are useful for selecting correlated active variables. Second,
      we argue that correlation among active predictors is not problematic,
      and we derive a new weaker condition on the design matrix, called
      Pseudo Irrepresentable Condition (PIC). Third, we present a new variable
      selection procedure, Dual Lasso Selector, and we show that PIC is a
      necessary and sufficient condition for consistent variable selection for
      the proposed method. Finally, by combining the dual Lasso selector
      further with the Ridge estimation even better prediction performance is
      achieved. We call the combination, DLSelect+Ridge. We illustrate the
      DLSelect+Ridge method and compare it with popular existing methods
      in terms of variable selection and prediction accuracy by considering a
      real dataset.


Keywords: Correlated Variable Selection, High-dimensional Regression, Lasso,
Dual Lasso, Ridge Regression


�   Introduction and Motivation
We start with the standard linear regression model as

                                    Y = X𝛽 + 𝜖,                                      (�)

with response vector Y𝑛×1 , design matrix X𝑛×𝑝 , true underlying coefficient
vector 𝛽𝑝×1 and error vector 𝜖𝑛×1 ∼ 𝑁𝑛 (0, 𝐼). In particular, we consider the case
of sparse high dimensional linear model (𝑝 ≫ 𝑛) with strong correlation among a
few variables. The Lasso is a widely used regularized regression method to find
sparse solutions, the Lasso estimator is defined as
                                     {︂                      }︂
                   ^                    1          2
                   𝛽𝐿𝑎𝑠𝑠𝑜 = arg min𝑝      ‖Y − X𝛽‖2 + 𝜆‖𝛽‖1 ,                  (�)
                                𝛽∈R     2
where 𝜆 ≥ 0 is the regularization parameter that controls the amount of regu-
larization. It is known that the Lasso tends to select a single variable from a

                                          56
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




group of strongly correlated variables even if many or all of these variables are
important. In the presence of correlated predictors, the concept of clustering
or grouping correlated predictors and then pursuing group-wise model fitting
was proposed (see [�] and [�]). When the dimension is very high or in case of
overlapping clusters, finding an appropriate group structure remains as difficult
as the original problem. An alternative approach is simultaneous clustering and
model fitting that involves combination of two different penalties. For example,
Elastic-Net [��] is a combination of two regularization techniques, the ℓ2 regular-
ization provides grouping effects and ℓ1 regularization produces sparse models.
Therefore, Elastic-Net selects or drops highly correlated variables together that
depends on the amount of ℓ1 and ℓ2 regularization.
    The influence of correlations on Lasso prediction has been studied in [�]
and [�], and it is shown that Lasso prediction works well in presence of any degree
of correlations with an appropriate amount of regularization. However, studies
show that correlations are problematic for parameter estimation and variable
selection. It has been proven that the design matrix must satisfy the following
two conditions for the Lasso to perform exact variable selection: irrepresentability
condition (IC) [��] and beta-min condition [�]. Having highly correlated variables
implies that the design matrix violates IC, and the Lasso solution is not stable.
When active covariates are highly correlated, the Lasso solution is not unique and
Lasso randomly selects one or a few variables from a correlated group. However,
even in case of highly correlated variables the corresponding dual Lasso solution
is always unique. The dual of the Lasso problem as given in equation (�), as
shown in [��] is given by

                                   1
                             sup     ‖Y‖22 − ‖𝜃 − Y‖22
                              𝜃    2                                            (�)
                      subject to |𝑋𝑗𝑇 𝜃| ≤ 𝜆 for 𝑗 = 1, ..., 𝑝,

where 𝜃 is the dual vector. The intuitions drawn from the articles [��] and [��]
further motivate us to consider the Lasso optimal and its dual optimal solution
together, that yields in selecting correlated active predictors.
    Exploiting the fact about uniqueness of the dual Lasso solution, we propose
a new variable selection procedure; the Dual Lasso Selector (DLS). For a given
Lasso estimator 𝛽^𝐿𝑎𝑠𝑠𝑜 (𝜆), we can compute the corresponding dual Lasso solution
by the following relationship between the Lasso solution and its dual (see [��] for
the derivation):

                             ^
                             𝜃(𝜆) = Y − X𝛽^𝐿𝑎𝑠𝑠𝑜 (𝜆).                           (�)

Basically, the DLS active set (to be defined later), corresponds to the predictors
that satisfy dual Lasso feasible boundary conditions (we discuss it in details
in a later section). We argue that correlation among active predictors is not
problematic, and we define a new weaker condition on the design matrix that
allows for correlation among active predictors, called Pseudo Irrepresentable
Condition (PIC). We show that the PIC is a necessary and sufficient condition

                                        57
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




for the proposed dual Lasso selector to select the true active set (under the
assumption of beta-min condition) with a high probability. Moreover, we use the
ℓ2 penalty (the Ridge regression, [�]) which is known to perform best in case of
correlated variables, to estimate the coefficients of the predictors selected by the
dual Lasso selector. We call the combination of the two, DLSelect+Ridge. The
DLSelect+Ridge resembles the Ridge post Lasso, but it is conceptually different
and behaves differently from the Lasso followed by the Ridge, especially in the
presence of highly correlated variables. Moreover, DLSelect+Ridge sounds like
Elastic-net, since both are combinations of ℓ1 and ℓ2 penalties but Elastic-net
is a combination of the Ridge Regression followed by the Lasso. In addition,
Elastic-Net needs to cross-validate on a two-dimensional surface 𝑂(𝑘 2 ) to select
its optimal regularization parameters, whereas DLSelect+Ridge needs to cross
validate twice on one-dimensional surface 𝑂(𝑘), where k is the length of the
search space for a regularization parameter.
    We have organized the rest of the paper in the following manner. We start
with background in Section �. In Section �, we present Dual Lasso Selector, we
define PIC and discuss variable selection consistency under this assumption on
the design matrix and we illustrate the proposed method on a real set. We shall
provide some concluding remarks in Section �.


�    Notations and Assumptions
In this section, we state notations and assumptions, used throughout the paper.
We consider usual sparse high-dimensional linear regression model as given in
equation (�) with 𝑝 ≫ 𝑛. For the design matrix X ∈ R𝑛×𝑝 , we represent rows
by 𝑥𝑇𝑖 ∈ R𝑝 , 𝑖 = 1, ..., 𝑛, and columns by 𝑋𝑗 ∈ R𝑛 , 𝑗 = 1, ..., 𝑝. We assume
that the design matrix X𝑛×𝑝 is fixed, the data is centred and the predictors
                             ∑︀𝑛            ∑︀𝑛
are standardized, so that 𝑖=1 Y𝑖 = 0, 𝑖=1 (𝑋𝑗 )𝑖 = 0 and 𝑛1 X𝑇𝑗 X𝑗 = 1 for
all 𝑗 = 1, ..., 𝑝. We denote by 𝑆 = {𝑗 ∈ {1, ..., 𝑝} : 𝛽𝑗 ̸= 0}, the true active set
and cardinality 𝑠 of the set 𝑆. We assume that the true coefficient vector 𝛽 is
sparse, that is 𝑠 ≪ 𝑝. We denote X𝑆 as the restriction of X to columns in 𝑆, and
𝛽𝑆 is the vector 𝛽 restricted to the support 𝑆, with zero outside the support 𝑆.
Without loss of generality we can assume that the first 𝑠 variables are the active
variables, and we partition the covariance matrix, 𝐶 = 𝑛1 X𝑇 X, for the active
and the redundant variables as follows.
                                      [︂         ]︂
                                         𝐶11 𝐶12
                                 𝐶=                                              (�)
                                         𝐶21 𝐶22

Similarly, the coefficient vector 𝛽 can be partitioned
                                               ∑︀𝑝 as (𝛽1 𝛽2 ) . 2The ∑︀
                                                               𝑇
                                                                      ℓ1 -norm
and ℓ2 -norm (square) are defined as ‖𝛽‖1 = 𝑗=1 |𝛽𝑗 | and ‖𝛽‖2 = 𝑗=1 𝛽𝑗2
                                                                         𝑝

respectively. The sub-gradient 𝜕‖𝛽‖1 and sign function 𝑠𝑖𝑔𝑛(𝛽) are defined as
follows.
                    ⎧                                ⎧
                    ⎨1        if 𝛽𝑖 > 0              ⎨ 1 if 𝛽𝑖 > 0
           𝜕‖𝛽‖1 = [−1, 1] if 𝛽𝑖 = 0 , 𝑠𝑖𝑔𝑛(𝛽) = 0 if 𝛽𝑖 = 0                (�)
                    ⎩                                ⎩
                      −1      if 𝛽𝑖 < 0                −1 if 𝛽𝑖 < 0

                                        58
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




�    Dual Lasso Selector

In this section, we present the dual Lasso selector, a variable selection method
for sparse high-dimensional regression models with correlated variables. First,
we state the basic properties of the Lasso and its dual, which have already been
derived and studied by various authors, see [��] and [��] for more details.

 �. Uniqueness of the Lasso-fit: There may not be a unique solution for the
    Lasso problem because for the criterion as given in equation (�) is not strictly
    convex in 𝛽. But the least square loss is strictly convex in X𝛽, hence there is
    always a unique fitted value X𝛽.   ^
 �. Uniqueness of the dual vector: The dual problem is strictly convex in 𝜃,
    therefore the dual optimal 𝜃^ is unique. Another argument for the uniqueness
    of 𝜃^ is that it is a function of X𝛽^ as given in equation (�), which itself is
    unique. The fact that the DLS can achieve consistent variable selection for
    situations (with correlated active predictors) when the Lasso is unstable for
    estimation of the true active set is related to the uniqueness of the dual Lasso
    solution.
 �. Uniqueness of the Sub-gradient: Sub-gradient of ℓ1 norm of any Lasso
    solution 𝛽^ is unique because it is a function of X𝛽.
                                                        ^ More specifically, suppose
    that 𝛽^ and 𝛽˜ are two Lasso solutions for a fixed 𝜆 value, then they must have
    the same signs 𝑠𝑖𝑔𝑛(𝛽)  ^ = 𝑠𝑖𝑔𝑛(𝛽).
                                       ˜ It is not possible that 𝛽^𝑗 > 0 and 𝛽^𝑗 < 0
    for some 𝑗.

Let 𝑆^𝐿𝑎𝑠𝑠𝑜 denote the support set or active set of the Lasso estimator 𝛽^ which
is given as 𝑆^𝐿𝑎𝑠𝑠𝑜 (𝜆) = {𝑗 ∈ {1, ..., 𝑝} : (𝛽^𝐿𝑎𝑠𝑠𝑜 )𝑗 ̸= 0}. Similarly, we define the
active set of the dual Lasso vector that corresponds to the active constraints
of the dual optimization problem, 𝑆^𝑑𝑢𝑎𝑙 (𝜆) = {𝑗 ∈ {1, ..., 𝑝} : |𝑋𝑗𝑇 𝜃| = 𝜆}.
Now, we state the following lemmas that will be used later for our mathematical
derivations.

Lemma �. The active set selected by the Lasso 𝑆^𝐿𝑎𝑠𝑠𝑜 (𝜆) is always contained in
the active set selected by the dual Lasso 𝑆^𝑑𝑢𝑎𝑙 (𝜆), that is 𝑆^𝐿𝑎𝑠𝑠𝑜 (𝜆) ⊆ 𝑆^𝑑𝑢𝑎𝑙 (𝜆).

Proof. The proof is rather easy. From KKT conditions (see [��]), we have

                               |𝑋𝑗𝑇 𝜃| < 𝜆 =⇒ 𝛽^𝑗 = 0                              (�)

The proof lies in the implication in the above equation (�).

It is known that IC (assuming beta-min conditions holds throughout the paper)
is a necessary and sufficient condition for the Lasso to select the true model
(see [��]).
Lemma �. Under the assumption of IC on the design matrix, the active set
selected by the Lasso 𝑆^𝐿𝑎𝑠𝑠𝑜 (𝜆) is equal to the active set selected by the dual Lasso
𝑆^𝑑𝑢𝑎𝑙 (𝜆), that is 𝑆^𝐿𝑎𝑠𝑠𝑜 (𝜆) = 𝑆^𝑑𝑢𝑎𝑙 (𝜆).

                                         59
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




For proof of the above lemma lies in the uniqueness of the Lasso solution under
IC assumption [��]. Suppose that we partition the covariance matrix as given in
equation (�), then IC is said to be met for the set 𝑆 with a constant 𝜂 > 0, if
the following holds:
                                −1
                          ‖𝐶12 𝐶11 𝑠𝑖𝑔𝑛(𝛽1 )‖∞ ≤ 1 − 𝜂.                         (�)

The IC may fail to hold due to violation of any one (or both) of the following
two conditions: �. When 𝐶11 is almost not invertible, and thus there is strong
correlation among variables of the true active set, �. The active predictors are
correlated with the noise features.
    When there is strong correlation among variables of the active set, then 𝐶11 is
(almost) not invertible and the IC does not hold, and the Lasso fails to do variable
selection. In the following, we argue that the dual Lasso can still perform variable
selection consistently even when 𝐶11 not invertible under the assumption of a
milder condition on the design matrix, called Pseudo Irrepresentable Condition.
The Pseudo Irrepresentable Condition is defined as follows.
Definition � (Pseudo Irrepresentable Condition (PIC)). We partition the
covariance matrix as given in equation (5). Then the PIC is said to be met for
the set 𝑆 with a constant 𝜂 > 0, if the following holds:

                    |𝑋𝑗𝑇 𝐺 𝑠𝑖𝑔𝑛(𝛽1 )| ≤ 1 − 𝜂, for all 𝑗 ∈ 𝑆 𝑐 ,                (�)
                                             [︂ −1 ]︂
                                               𝐶𝐴 0
where G is a generalized inverse of the form           , and equation (9) holds for
                                                0 0
each 𝐶𝐴 ∈ 𝐶𝑅 , where 𝐶𝑅 is defined as 𝐶𝑅 := {𝐶𝑟𝑟 : 𝑟𝑎𝑛𝑘(𝐶𝑟𝑟 ) = 𝑟𝑎𝑛𝑘(𝐶11 ) =
𝑟, 𝑟 ⊆ 𝑆}.
    The following lemma gives a sufficient condition for the dual Lasso for support
recovery. This lemma is similar in spirit to Lemma � defined in [��]. Here, we do
not assume that 𝐶11 is invertible.
Lemma � (Primal-dual Condition for Variable Selection). Suppose that
                               ^ 𝜃)
we can find a primal-dual pair (𝛽, ^ that satisfies the following conditions:

                            ^ + 𝜆^
                   X𝑇 (Y − X𝛽)                          ^
                                 𝑣 = 0, where 𝑣^ = 𝑠𝑖𝑔𝑛(𝛽)                     (��)
                                 𝜃^ = Y − X𝛽,
                                           ^                                    (��)
                                    𝛽^𝑗 = 0 for all 𝑗 ∈ 𝑆 𝑐 ,                  (��)
                                    𝑣𝑗 | < 1 for all 𝑗 ∈ 𝑆 𝑐 .
                                   |^                                          (��)

Then 𝜃^ is the unique optimal solution to the dual Lasso and 𝑆^𝑑𝑢𝑎𝑙 recovers the
true active set.
Proof. We have shown that the dual Lasso optimal 𝜃^ is always unique, and it
remains to show that 𝑆^𝑑𝑢𝑎𝑙 recovers the true active set 𝑆. Under the assumption
                                                   ^ < 𝜆 for all 𝑗 ∈ 𝑆 𝑐 . Therefore
as given in equation (��), we can derive that |𝑋𝑗𝑇 𝜃|
𝑆^𝑑𝑢𝑎𝑙 = 𝑆.

                                        60
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




Theorem �. Under the assumption of PIC on the design matrix X, the active
set selected by the dual Lasso 𝑆^𝑑𝑢𝑎𝑙 , is the same as the true active set 𝑆 with a
high probability, that is, 𝑆^𝑑𝑢𝑎𝑙 = 𝑆.

The proof of the above theorem is similar to the proof of Theorem �.� in [�], if
inverse of the matrix 𝐶11 is replaced with its generalized inverse . We note that
PIC may hold even when 𝐶11 is not invertible, which implies that PIC is weaker
than IC. It is illustrated with the following examples.
   Let 𝑆 = {1, 2, 3, 4} be the active set, and let[︃the covariance
                                                             ]︃      matrix 𝐶 =
                                                           1 0 0 0 𝜌
X𝑇 X                                                       0 1 0 0 𝜌
 𝑛     of the design matrix X is given as 𝐶            =   0 0 1 0 𝜌   .   Here, the active
                                                           0 0 0 1 𝜌
                                                           𝜌 𝜌 𝜌 𝜌 1
variables are uncorrelated and the noise variable is equally correlated with all
active covariates. First of all, it is easy to check that only for |𝜌| ≤ 12 , 𝐶 is
positive semi definite, and for |𝜌| < 14 , 𝐶 satisfies the IC. Now, we augment
this matrix with two additional columns, one copy of the first and one copy
of the second active variables,
                         ⎡       and we⎤ rearrange the columns such that we get
                              1 1 0 0 0 0 𝜌
                              1 1 0 0 0 0 𝜌

covariance matrix, 𝐶1 = ⎢
                           0 0 1 1 0 0 𝜌⎥
                        ⎣ 00 00 10 10 01 00 𝜌𝜌 ⎦. Suppose that the set of active variables
                              0 0 0 0 0 1 𝜌
                              𝜌 𝜌 𝜌 𝜌 𝜌 𝜌 1

is 𝑆 = {1, 2, 3, 4, 5, 6} and we assume that |𝜌| < 14 . We partition 𝐶1 as given
in equation (�), and it is clear that the corresponding sub-matrix 𝐶11 is not
invertible and IC does not hold, hence the Lasso may not perform variable
selection. The rank of the matrix 𝐶11 is �. Let us consider any (4 × 4) sub
matrix of the matrix 𝐶11 such that its rank is four (𝐴 ⊂ 𝑆, 𝑟𝑎𝑛𝑘(𝐶𝐴 ) = 4, Here
𝐶𝑅 = {{1, 3, 5, 6}, {1, 4, 5, 6}, {2, 3, [︂5, 6}, {2,]︂4, 5, 6}}). Further, we consider the
                                  +         𝐶𝐴 −1
                                                   0
generalized inverse of 𝐶11 as 𝐶11    =                 , where 𝐶𝐴 ∈ 𝐶𝑅 is invertible. With
                                               0 0
                        +
the above inverse 𝐶11     PIC holds for the design matrix X, and the dual Lasso
will select the true active set 𝑆 with a high probability and will set zero to the
coefficient of the noise features.

�.�    Dual Lasso Selection and Ridge Estimation
Now, we combine the dual Lasso selection with the Ridge estimation. Mainly, we
consider the ℓ2 penalty (Ridge penalty) which is known to perform best in case
of correlated variables, to estimate the coefficients of the predictors selected by
the dual Lasso. We develop an algorithm called DLSelect+Ridge, which is a two
stage procedure, the dual selection followed by the Ridge Regression.
    If model selection works perfectly (under strong assumptions, i.e. IC), then
the post-model selection estimators are the oracle estimators with well behaved
properties (see [�]). It has been already proven that the Lasso+OLS [�] estimator
performs at least as good as Lasso in terms of the rate of convergence, and
it has a smaller bias than the Lasso. Further Lasso+mLS (Lasso+ modified
OLS) or Lasso+Ridge estimator have been also proven to be asymptotically
unbiased under the IC, see [�]. Under the IC the Lasso solution is unique and

                                              61
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




 Algorithm �: DLSelect+Ridge
   Input: dataset (Y, X)
   Output: 𝑆:= ^ the set of selected variables, 𝛽^ := the estimated coefficient vector
   Steps:
   �. Perform Lasso on the data (Y, X). Denote the Lasso estimator as 𝛽^𝐿𝑎𝑠𝑠𝑜 .
   �. Compute the dual optimal as 𝜃^ = Y − X𝛽^𝐿𝑎𝑠𝑠𝑜 , and denote the dual Lasso
     active set as 𝑆^𝑑𝑢𝑎𝑙
   �. Compute the reduced design matrix as X𝑟𝑒𝑑 = {𝑋𝑗 : 𝑗 ∈ 𝑆^𝑑𝑢𝑎𝑙 }.
   �. Perform Ridge regression based on the data (Y, X𝑟𝑒𝑑 ) and obtain the ridge
     estimator 𝛽𝑗 for 𝑗 ∈ 𝑆𝑑𝑢𝑎𝑙 . Set the remaining coefficients to zero.
   return (𝑆,   ^
              ^ 𝛽)



the DLSelect+Ridge is the same as the Lasso+Ridge and the same argument
holds for the DLSelect+Ridge. In the following section, we empirically compare
the performance of DLSelect+Ridge with other methods.

�.�   Empirical Results with Riboflavin Dataset
The dataset of riboflavin consists of 𝑛 = 71 observations of 𝑝 = 4088 pre-
dictors (gene expressions) and univariate response, riboflavin production rate
(log-transformed), see [�] for details on riboflavin dataset. Since the ground
truth is not available, we consider Riboflavin data for the design matrix X with
synthetic parameters 𝛽 and simulated Gaussian errors 𝜖 ∼ N𝑛 (0, 𝜎 2 𝐼). We fix the
size of the active set to 𝑠 = 20 and 𝜎 = 1, and for the true active set 𝑆 select ten
predictors which are highly correlated with the response and another ten variables
which are{︂most correlated with those selected variables. The true coefficient vector
            1 if 𝑗 ∈ 𝑆
is: 𝛽𝑗 =                . Then we compute the response using equation (�). We
            0 if 𝑗 ̸∈ 𝑆
compute Mean Squared Error (MSE) and True Positive Rate        ∑︀𝑛(TPR) as perfor-
mance measures, which are defined as follows: 𝑀 𝑆𝐸 = 𝑛1 𝑖=1 (𝑦𝑖 − 𝑦^𝑖 )2 and
            ⋂︀
𝑇 𝑃 𝑅 = |𝑆^ 𝑆|/|𝑆|, where 𝑦^𝑖′ 𝑠 are estimated responses and 𝑆^ is the estimated
active set. The performance measures (the median MSE with standard deviation
for ��� runs, and the median TPR) are reported in Table �. From Table �, we
conclude that DLSelect+Ridge performs better than others in terms of prediction
performance, and DLSelect+Ridge is as good as Elastic-Net in terms of variable
selection.


               Table �: Performance measures for Riboflavin data
                        Method           MSE (SE)        TPR
                        Lasso            ���.��(��.��)   �.��
                        Ridge            ���.��(��.��)   NA
                        Enet             ���.��(��.��)   �.��
                        DLSelect+Ridge   ��.��(��)       �.��




                                         62
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




�    Concluding Remarks
The main achievements of this work are summarized as follows: we argued that
the correlation among active predictors is not problematic, as long as PIC is
satisfied by the design matrix. In particular, we showed that the dual Lasso
performs consistent variable selection under the assumption of PIC. Exploiting
this result we proposed DLSelect+Ridge method. We compared DLSelect+Ridge
with the popular existing methods by considering a real dataset. The numerical
studies show that the proposed method is very competitive in terms of variable
selection and prediction accuracy.


References
 �. Belloni, A., Chernozhukov, V.: Least squares after model selection in high-
    dimensional sparse models. Bernoulli ��, ���–��� (����)
 �. Bühlmann, P., van de Geer, S.: Statistics for High-Dimensional Data: Methods,
    Theory and Applications. Springer Verlag (����)
 �. Bühlmann, P., Kalisch, M., Meier, L.: High-dimensional statistics with a view
    towards applications in biology. Annual Review of Statistics and its Applications �,
    ���–���. (����)
 �. Bühlmann, P., Rütimann, P., van de Geer, S., Zhang, C.H.: Correlated variables in
    regression: clustering and sparse estimation. Journal of Statistical Planning and
    Inference ���, ����–���� (����)
 �. Gauraha, N.: Stability feature selection using cluster representative lasso. In: Pro-
    ceedings of the �th International Conference on Pattern Recognition Applications
    and Methods. pp. ���–��� (����)
 �. van de Geer, S., Lederer, J.: The Lasso, correlated design, and improved oracle
    inequalities, Collections, vol. Volume �, pp. ���–���. Institute of Mathematical
    Statistics, Beachwood, Ohio, USA (����)
 �. Hebiri, M., Lederer, J.: How correlations influence lasso prediction. IEEE Trans.
    Inf. Theor. ��(�), ����–���� (����)
 �. Hoerl, A.E., Kennard, T.W.: Ridge regression: biased estimation for nonorthogonal
    problems. Technometrics ��, ��–�� (����)
 �. Liu, H., Yu, B.: Asymptotic properties of lasso+mls and lasso+ridge in sparse
    high-dimensional linear regression. Electron. J. Statist. �, ����–���� (����)
��. Omidiran, D., Wainwright, M.J.: High-dimensional variable selection with sparse
    random projections: Measurement sparsity and statistical efficiency. J. Mach. Learn.
    Res. ��, ����–���� (����)
��. Osborne, M.R., Presnell, B., Turlach, B.A.: A new approach to variable selection in
    least squares problems. IMA Journal of Numerical Analysis ��(�), ���–��� (����)
��. Tibshirani, R., Taylor, J.: The solution path of the generalized lasso. Ann. Statist.
    ��(�), ����–���� (�� ����)
��. Wang, J., Zhou, J., Wonka, P., Ye, J.: Lasso screening rules via dual polytope
    projection. Journal of Machine Learning Research ��, ����–���� (����)
��. Zhao, P., Yu, B.: On model selection consistency of lasso. Journal of Machine
    Learning Research �, ����–���� (����)
��. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R.
    Statist. Soc ��, ���–��� (����)



                                          63