=Paper= {{Paper |id=Vol-1928/paper4 |storemode=property |title=Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models |pdfUrl=https://ceur-ws.org/Vol-1928/paper4.pdf |volume=Vol-1928 |authors=Niharika Gauraha,Swapan Parui |dblpUrl=https://dblp.org/rec/conf/ki/GaurahaP17 }} ==Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models== https://ceur-ws.org/Vol-1928/paper4.pdf

Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning

Pre-Selection in Cluster Lasso Methods for
Correlated Variable Selection in
High-Dimensional Linear Models

Niharika Gauraha and Swapan Parui

Indian Statistical Institute

Abstract. We consider variable selection problems in high dimensional
sparse regression models with strongly correlated variables. To handle
correlated variables, the concept of clustering or grouping variables and
then pursuing model ﬁtting is widely accepted. When the dimension is very
high, ﬁnding an appropriate group structure is as diﬃcult as the original
problem. We propose to use Elastic-net as a pre-selection step for Cluster
Lasso methods (i.e. Cluster Group Lasso and Cluster Representative
Lasso). The Elastic-net selects correlated relevant variables, but it fails to
reveal the correlation structure among the active variables. We use cluster
Lasso methods to address shortcoming of the Elastic-net, and the Elastic-
net is used to provide reduced feature set for the cluster Lasso methods.
We theoretically explore, the group selection consistency of the proposed
combination of algorithms under various conditions, i.e. Irrepresentable
Condition (IC), Elastic-net Irrepresentable Condition (EIC) and Group
Irrepresentable Condition (GIC). We support the theory using simulated
and real dataset examples.

Keywords: Correlated Variable Selection, Group Lasso, Cluster Lasso Methods,
High-dimensional Linear models

� Introduction

We consider the usual linear regression model

Y = X𝛽 0 + 𝜖, (�)

with response vector Y𝑛×1 , design matrix X𝑛×𝑝 , true underlying coeﬃcient
0
vector 𝛽𝑝×1 and error vector 𝜖𝑛×1 . When the number of predictors (p) is much
larger than the number of observations (n), 𝑝 ≫ 𝑛, the Lasso [��] and its variants
are mostly used for sparse estimation and variable selection. However, variable
selection in situations involving high empirical correlation remains one of the
most important issues. This problem is encountered in many applications such
as microarray data analysis and genome analysis, see [��].
It has been proven that the design matrix must satisfy the following conditions
for the Lasso to perform exact variable selection: irrepresentable condition (IC)

43
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning

[��] and beta-min condition [�]. Having highly correlated variables implies that
the design matrix violates the IC. To deal with correlated variables, mainly two
approaches have been proposed in literature: simultaneous clustering and model
ﬁtting (see [�]), and clustering followed by the sparse estimation (see [�]) and [�]).
The former approach imposes restrictive conditions on the design matrix. However,
the time complexity for clustering of variables severely limits the dimension of
data sets that can be processed by the later approach. Moreover, group selection
in models with a larger number of groups is more diﬃcult (see [��]). To overcome
the limitations of the later approach, we propose to use Elastic-net [��] as a
pre-selection procedure for the Cluster Lasso [�] methods. Basically, we try
to reduce the noise (dimension) ﬁrst using the Elastic-net which is known to
select correlated variables, before applying the Cluster Lasso (CL) methods. This
scheme allows for a signiﬁcant decrease in the cost of clustering, especially in the
high dimensional problems.
Basically, we propose to combine Elastic-net and Cluster Lasso methods to
improve both speed of computation and accuracy of the results in the case of
sparsity. This goal is achieved by using Elastic-net to ﬁrst reduce the number of
variables under consideration and then using CL methods on the reduced data
to select correlated variables. We theoretically explore, how the proposed combi-
nation of algorithms will perform under various conditions, i.e. Irrepresentable
Condition (IC), Elastic-net Irrepresentable Condition (EIC) and Group Irrepre-
sentable Condition (GIC). This theoretical analysis is validated by experiments
on simulated datasets by comparison with diﬀerent methods: Lasso, Elastic-net
and Cluster Group Lasso. Moreover, we shows that the algorithm is able to
improve the results compared to the mentioned methods on a real-world dataset.
The rest of this paper is organized as follows. In section �, we provide
notations, assumptions and background to be used later. In section �, we review
mathematical theory of the Lasso, the Elastic-net and the group Lasso. In section
�, we describe the proposed algorithm which mostly selects more adequate models
in terms of model interpretation and prediction performance. In section �, we
provide numerical results based on simulated and real dataset. Section � contains
the computational details and we shall provide conclusion in section �.

� Basic Assumptions, Notations and Concepts

In this section, we state notations and assumptions and we deﬁne the required
concepts.

�.� Notations and Assumptions

We consider the usual linear regression set up as given in Equation (�). We
assume that the components of the noise vector 𝜖 ∈ 𝑅𝑛 are i.i.d. 𝑁 (0, 𝜎 2 ). The
columns of the design matrix X are denoted by 𝑋 𝑗 . We assume that the design
matrix
∑︀𝑛 X is ﬁxed,
∑︀𝑛 the 𝑗data is centred and the predictors are standardized, so that
1
𝑖=1 𝑌𝑖 = 0, 𝑖=1 𝑋𝑖 = 0 and 𝑛 𝑋 𝑋 = 1 for all 𝑗 = 1, ..., 𝑝. The ℓ1 -norm is
𝑗′ 𝑗

44
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning

denoted by ‖.‖1 , the ℓ2 -norm is denoted by ‖.‖2 and the ℓ∞ norm is denoted by
‖.‖∞ . The minimum eigenvalue of a matrix A is denoted as 𝛬𝑚𝑖𝑛 (𝐴). The true
active set 𝑆0 denotes the support of the subset selection solution (𝑆0 = 𝑠𝑢𝑝𝑝(𝛽0 ))
and deﬁned as 𝑆0 = {𝑗; 𝛽𝑗0 ̸= 0}. For any given 𝑆 ⊂ {1, 2, ..., 𝑝}, the 𝛽𝑆 is a 𝑝 × 1
vector which has zeroes outside the set S, as given by

𝛽𝑆 = {𝛽𝑗 𝐼(𝑗 ∈ 𝑆)},

where I is the indicator function. Then we have

𝛽 = 𝛽 𝑆 + 𝛽𝑆 𝑐 .

The (scaled) Gram matrix (covariance matrix) is deﬁned as

^ = X X.
′
𝛴
𝑛
The covariance matrix can be partitioned for the subset S as
[︂ ]︂
𝛴11 = 𝛴(𝑆) 𝛴12 (𝑆)
𝛴= . (�)
𝛴21 (𝑆) 𝛴22 = 𝛴(𝑆 𝑐 )

Similarly, we partition the parameter vector for the subset S as
[︂ ]︂
𝛽1
𝛽𝑝×1 = . (�)
𝛽2

The sign function is deﬁned as
⎧
⎨ −1 if 𝑥 < 0
𝑠𝑖𝑔𝑛(𝑥) = 0 if 𝑥 = 0 (�)
⎩
1 if 𝑥 > 0

�.� Clustering of Variables

We use correlation based, bottom-up agglomerative hierarchical clustering meth-
ods to cluster predictors, which forms groups of variables based on correlations
between them. To determine number of clusters we use the bootstrap approach,
see stability feature in [�].

�.� The Regularized Regression Methods

In this section, we brieﬂy review the Lasso, the Group Lasso and the Elastic-net
regularized regression methods.
The Least Absolute Shrinkage and Selection Operator (Lasso) was introduced
by Tibshirani [��]. It is a penalized least squares method that imposes an ℓ1 -
penalty on the regression coeﬃcients, which does both shrinkage and automatic

45
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning

variable selection simultaneously due to the nature of the ℓ1 -penalty. We denote
𝛽^𝐿𝐴𝑆𝑆𝑂 , as a Lasso estimated parameter vector, which is deﬁned as:
1
𝛽^𝐿𝐴𝑆𝑆𝑂 ∈ 𝑎𝑟𝑔𝑚𝑖𝑛{ ‖y − X𝛽‖22 + 𝜆‖𝛽‖1 }, (�)
𝛽∈R𝑝 𝑛
where 𝜆 is the regularization parameter. The Lasso estimated active set is denoted
as 𝑆^𝐿𝐴𝑆𝑆𝑂 and deﬁned as 𝑆^𝐿𝐴𝑆𝑆𝑂 = {𝑗; (𝛽^𝐿𝐴𝑆𝑆𝑂 )𝑗 ̸= 0}.
The major disadvantages of the Lasso are: the Lasso tends to select single
variable from any group and, it can select at most n variables for 𝑝 ≫ 𝑛 case.
To overcome the above limitations of the Lasso, the Elastic-net was proposed
by [��]. The estimated parameter vector by naive Elastic-net is denote by 𝛽𝐸𝑁 ,
and deﬁned as
1
𝛽^𝐸𝑁 = 𝑎𝑟𝑔𝑚𝑖𝑛{ ‖y − X𝛽‖22 + 𝜆‖𝛽‖1 + 𝛼‖𝛽‖2 }, (�)
𝛽∈R𝑝 𝑛
where 𝛼 ∈ [0, 1] and 𝜆 ≥ 0 are the regularization parameters. Since the Elastic-net
estimate is (1 + 𝜆2 )𝛽^𝐸𝑁 , it selects the same variable as the 𝛽^𝐸𝑁 , we consider the
𝛽^𝐸𝑁 as the Elastic-net estimate. The Elastic-net estimated active set is denoted
as 𝑆^𝐸𝑁 and deﬁned as 𝑆^𝐸𝑁 = {𝑗; (𝛽^𝐸𝑁 )𝑗 ̸= 0}.
When the distinct groups or clusters among the variables are known a priory
and it is desirable to select or drop the whole group instead of single variables,
then the Group Lasso (see [��]) or its variants (i. e., Group Square-Root Lasso [�]
and Adaptive group Lasso [��] are used, that imposes an 𝐿2 -penalty on the
coeﬃcients within each group to achieve group sparsity.
Here we deﬁne some more notations and state assumptions for the group
Lasso. We may interchangeably use 𝛽 0 and 𝛽 for the true regression coeﬃcient
vector, the later one is without the superscript. Let us assume that the parameter
vector 𝛽 is structured into groups, 𝐺 = {𝐺1 , ..., 𝐺𝑞 }, where 𝑞 < 𝑛, denotes the
number of groups. The partition 𝐺 basically builds a partition of the index set
{1, ..., 𝑝} with ∪𝑞𝑟=1 𝐺𝑟 = {1, ..., 𝑝} and 𝐺𝑟 ∩ 𝐺𝑙 = ∅, for 𝑟 ̸= 𝑙. The parameter
vector 𝛽, then has the structure 𝛽 = {𝛽𝐺1 , ..., 𝛽𝐺𝑞 } where 𝛽𝐺𝑗 = {𝛽𝑟 : 𝑟 ∈ 𝐺𝑗 }.
The columns of each group are represented by X𝐺𝑗 , then the response vector Y
can also be written as
𝑞
∑︁
Y= X(𝐺𝑗 ) 𝛽𝐺𝑗 + 𝜖,
𝑗=1
The loss function of the group Lasso is the same as the loss function of the Lasso
1 2
𝑛 ‖Y − X𝛽‖2 . The group Lasso penalty is deﬁned as
∑︀𝑞 √︁
‖𝛽‖2,1 = 𝑗=1 ‖X𝐺𝑗 𝛽𝐺𝑗 ‖2 𝑛𝑗 ,
𝑚

where 𝑚𝑗 = |𝐺𝑗 | is the group size. Since the penalty is invariant under parametriza-
tions within-group. Therefore, without loss of generality, we can assume 𝛴𝑟𝑟 = 𝐼.
Hence the group Lasso penalty can be written as
𝑞
∑︁ √
‖𝛽‖2,1 = 𝑚𝑗 ‖𝛽𝐺𝑗 ‖2
𝑗=1

46
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning

The Group Lasso estimator (with known q groups) is deﬁned as
1
𝛽^𝑔𝑟𝑝 ∈ 𝑎𝑟𝑔𝑚𝑖𝑛{ ‖Y − X𝛽‖22 + 𝜆‖𝛽‖2,1 } (�)
𝛽 𝑛

Let W denote the active group set, 𝑊 ⊂ {1, ..., 𝑞}, with cardinality 𝑤 = |𝑊 |, then
we assume: (i) the size of the each group is less than the number of observations
𝑚𝑚𝑎𝑥 < 𝑛, and (ii) the number of active groups, w, is less than the number of
observations (sparsity assumption). We denote the clusters selected by the group
Lasso as 𝑆^𝑐𝑙𝑢𝑠𝑡 , which is deﬁned as

𝑆^𝑐𝑙𝑢𝑠𝑡 = {𝑟 : cluster 𝐺𝑟 is selected, 𝑟 = 1, ..., 𝑞} (�)

The union of the selected clusters gives the selected set of variables.

𝑆^𝑔𝑟𝑝 = ∪𝑟∈𝑆^𝑐𝑙𝑢𝑠𝑡 𝐺𝑟 (�)

�.4 The Cluster Lasso Methods
When group structure is not known, the cluster Lasso methods are preferred, they
perform clustering followed by the sparse estimation. The clusters 𝐺1 , ..., 𝐺𝑞 are
generated from the design matrix X ( i.e. using correlation based method etc.).
When the group Lasso is applied to the resulting clusters, it is called Cluster
Group Lasso, and when the Lasso is applied for the cluster representatives, it is
called Cluster Representative Lasso, see [�].

� Mathematical Theory of the Lasso and its Variants
In this section, we brieﬂy review the results required for proving consistent vari-
able selection (and group variable selection) in high dimensional linear models.
For more details on mathematical theory for the Lasso and the group Lasso, we
refer to: [�], [��], [��], [�] and [�], and for the Elastic-net we refer to [�].

Let us assume a ﬁxed set 𝑆 ⊂ {1, ..., 𝑝} with cardinality 𝑠 = |𝑆| and parti-
tions of the covariance matrix and parameter vector as given by the Equations
(�) and (�) respectively, for the following deﬁnitions.
Deﬁnition � (Irrepresentable Condition (IC)). The strong irrepresentable
condition is said to be met for the set S with a constant 𝜂 > 0, if the following
holds:
−1
‖𝛴12 𝛴11 (𝑆)𝑠𝑖𝑔𝑛(𝛽1 )‖∞ ≤ 1 − 𝜂. (��)

Suﬃcient conditions (eigenvalue and mutual incoherence) on the design matrix
to hold IC are discussed in [��] and [�].
Deﬁnition � (Beta-min Condition). The Beta Min Condition is met for the
regression coeﬃcient 𝛽, if min |𝛽𝑆 | ≥ 𝜑4𝜆𝑠
2 (𝑆) ,

47
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning

where 𝜑(𝑆) is the compatibility condition, see [�].
Lemma �. Under the following assumptions the Lasso selects the set 𝑆 with a
high probability:
– Irrepresentable condition holds for 𝑆.
– Beta-min condition holds for the parameter vector 𝛽.

Deﬁnition � (The Elastic-net Irrepresentable Condition (EIC)). The
EIC is met for the set 𝑆, with a constant 𝜂 > 0, if the following holds.

𝛼 −1 2𝛼 0
‖𝛴21 (𝛴11 + 𝐼) (𝑠𝑖𝑔𝑛(𝛽10 ) + 𝛽 )‖∞ ≤ 1 − 𝜂
𝑛 𝜆 1
Proposition �. For a given set 𝑆, if IC holds then it implies that for any 𝜆 > 0,
there exists 𝛼, such that EIC holds, but EIC does not imply IC.

When IC holds it is easy to show that the EIC also holds, since 𝛼 = 0 makes the
EIC same as IC. The EIC may hold even when 𝛴11 is not invertible, it proves
that the EIC does not imply the IC .
Now we discuss necessary conditions for the group variable selection. Our
error analysis for the group Lasso is based on the pure active group and pure
noise group assumptions, that is all variables are active variables within an active
group and no variables are active in a noise group and we also assume that the
clustering process identiﬁes the group structure correctly.
It is convenient to assume the following. Let 𝑊 ⊂ {1, ..., 𝑞} be a group index
set, say, 𝑊 = {1, ..., 𝑤} Consider the full index set corresponding to W as

𝑆 = {(1, 1), ..., (1, 𝑚1 ), ..., (𝑤, 1), ..., (𝑤, 𝑚𝑤 )} = {1, ..., 𝑠}
∑︀𝑤
where 𝑠 = 𝑗=1 𝑚𝑗 . We partition the 𝛴11 (𝑆) covariance matrix group wise, and
denote its inverse as 𝑅𝑆 . (here we assume that each 𝛴𝑟,𝑟 is non-singular, or one
may use the pseudo inverse)
⎡ −1/2 −1/2 −1/2 −1/2
⎤
𝐼 𝛴11 𝛴12 𝛴22 ... 𝛴11 𝛴1𝑤 𝛴𝑤𝑤
⎢ −1/2 −1/2 −1/2 −1/2 ⎥
⎢ 𝛴22 𝛴21 𝛴11 𝐼 ... 𝛴22 𝛴2𝑤 𝛴𝑤𝑤 ⎥
𝑅𝑆 = ⎢ ⎢ .. .. .. .. ⎥
⎥
⎣ . . . . ⎦
−1/2 −1/2 −1/2 −1/2
𝛴𝑤𝑤 𝛴𝑤1 𝛴11 𝛴𝑤𝑤 𝛴𝑤2 𝛴22 ... 𝐼

We note that diagonal elements are 𝐼𝑚𝑟 ×𝑚𝑟 identity matrix due to within group
parameterization invariance property.

Deﬁnition 4 (Group Irrepresentability Condition (GIC)). The GIC is
met for the set W with a constant 𝜂 > 0, if the following holds

‖(𝛴21 𝑅𝑆 𝑠𝑖𝑔𝑛(𝛽))𝐺𝑟 ‖ ≤ 1 − 𝜂 ∀𝑟 ̸∈ 𝑊, (��)

where the inequality holds group wise.

48
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning

We note that the GIC deﬁnition reduces to the IC for singleton groups.

Deﬁnition 5 (Group beta-min √
Condition). The group beta-min Condition
is met for 𝛽 , if ‖𝛽 𝐺𝑟 ‖∞ ≥ ∀𝑟 ∈ 𝑊 , where 𝐷 > 0 is a constant which
𝐷𝜆 𝑚𝑟
𝑛
depends on 𝜎, 𝜑𝑔𝑟𝑝 and other constants used in cone constraints and GIC.

For its exact form, we refer to [�]. We note that, only one component of the
𝛽 𝐺𝑟 , ∀𝑟 ∈ 𝑊 has to be suﬃciently large, because we aim to select groups as a
whole, and not individual variables.

Theorem �. Under the following assumptions the group Lasso selects the set of
active groups 𝑊 with a high probability.

– The GIC holds for 𝑊 .
– Group beta-min condition holds for 𝛽 𝐺𝑟 , ∀𝑟 ∈ 𝑊 .

Now we show that the IC implies the GIC, and the converse is not true.

Lemma �. If the IC holds for the set S, then the GIC holds for the set W, but
the converse is not true.

Here we give sketch of the proof. If the IC holds for the set S, then the GIC
holds for the singleton groups, therefore the GIC holds for any group structure
within S. It is easy to show that the 𝛴11 is not invertible when active variables
are correlated, but the 𝛴11 may be invertible after within group transformation
and GIC may hold. Therefore IC is more restrictive than GIC, hence converse of
the theorem is not true.

� Pre-Selection for the Cluster Lasso Methods

We consider high dimensional settings, where group of variables are highly
correlated. It is known that the Lasso tends to select one or few variables from
the group of highly correlated variables, even though many or all of them belong
to the active set. The CL methods have proven to be eﬀective in selecting group
variables and reducing false negatives. The two major drawbacks of the CL
methods are: selection of groups does not work well when there is a large number
of groups, and the time complexity for clustering when p is large is unacceptable.
We try to address these problems using Elastic-net as a pre-selection for the CL
methods, as described in the Algorithm �. The variable selection consistency for
the Elastic-net and the CL methods have been already proven under the EIC
and the GIC respectively, see [�] and [�]. In the following, we discuss various
situations where Elastic-net+CL consistently selects the active set with a high
probability, and situations where our scheme may fail to select the true variables.

49
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning

Algorithm �: The Elastic-net+CL Algorithm
Input: dataset (Y,X)
Output: 𝑆:=^ set of selected variables
Steps:
�. Perform Elastic-net on data (Y,X), and denote 𝑆^𝐸𝑁 as variable set selected.
^
Let X𝑟𝑒𝑑 = X𝑆𝐸𝑁 be the reduced design matrix.
�. Perform Clustering of variables on data X𝑟𝑒𝑑 , and denote the clusters as
𝐺1 , ..., 𝐺𝑞 .
�. Perform group Lasso on (Y, X𝑟𝑒𝑑 ) with group information 𝐺1 , ..., 𝐺𝑞 , and
denote the selected set of groups as
𝑆^𝑐𝑙𝑢𝑠𝑡𝑒𝑟 = {𝑟; cluster 𝐺𝑟 is selected, 𝑟 = 1, ..., 𝑞}.
The union of the selected groups is then the selected set of variables
𝑆^ = ∪𝑟 𝑟 ∈ 𝑆^𝑐𝑙𝑢𝑠𝑡𝑒𝑟
return 𝑆^

Case � The IC holds:
Suppose that for a given set 𝑆, the design matrix X satisﬁes the IC, which
implies that the EIC and the GIC will also be satisﬁed for the set 𝑆, see
Proposition � and Lemma �. It follows that with a high probability (w.h.p.)
the Elastic-net will select the true variables and similarly, for any group
structure within S will yield the group selection consistency for the group
Lasso, w.h.p. Therefore in this situation, the Elastic-net+CL shows variable
selection consistency.
Case � The EIC and the GIC both hold:
Apart from the previous case, this situation may also arise when two or more
variables within set 𝑆 are highly correlated, then IC does not hold for the
set S (since 𝛴11 is not invertible), but the EIC and the GIC will be satisﬁed.
Using the same argument as for the preceding case, it is easy to show that
the combination has variable selection consistency.
Case � The EIC holds but the GIC does not hold:
This situation may arise for the overlapping groups, when the active groups
are highly correlated, then the design matrix does not satisfy the GIC, but
the EIC may hold for the set 𝑆. In such cases, though the Elastic-net may
select true active set w.h.p., but the subsequent group Lasso may fail to
select the active set.
Case � The GIC holds but the EIC does not hold:
This situation can come when there are near linear dependences among set
of active variables, the EIC may not hold and Elastic-net tends to select
single or a few variables from the group of linearly dependent variables. If we
correctly put the linearly dependent variables in appropriate clusters (one
may use the canonical correlation based clustering, see [�]) then GIC may
hold. Since, the pre-selection step itself fails to select the true active set,
hence our scheme of pre-selection for CL methods may not work in this case.
We illustrate the above four cases with simulation studies in the next section.
We do not consider the case when the EIC and the GIC both do not hold, because

50
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning

neither the pre-selection nor the CL methods are guaranteed to select the true
variables.

5 Numerical Results

In this section, we consider four simulation settings and a real data example in
order to empirically compare performance of the proposed scheme with the other
existing methods; the Lasso [��], the Elastic-net [��] and the CGLCor [�].

5.� Simulation Studies

Four examples are considered in this simulation that correspond to the cases
discussed in the preceding section. In each example, data is simulated from
the linear model in Equation (�) with ﬁxed design X. For each example, our
simulated dataset consisted of a training and an independent validation set and
�� such datasets were simulated for each example. The models were ﬁtted on the
training data for each �� datasets and the model with the lowest test set Mean
Squared Error (MSE) was selected as the ﬁnal model. For model interpretation,
we consider the pair (TN, FN), where TN denotes true negative, that is the
number of zero coeﬃcients that are correctly estimated by zero and FN denotes
false negative, that is the number of non zero coeﬃcients that are incorrectly
estimated by zero. As our aim is to avoid false negatives, we mainly focus on FN
while analysing the diﬀerent estimators. The MSE and the (TN, FN) are deﬁned
as follows.

1 ∑︁
𝑛
𝑀 𝑆𝐸 = (𝑦𝑖 − 𝑦^𝑖 )2 (��)
𝑛 𝑖=1
⋂︁
𝑇 𝑁 = |𝑆^𝑐 𝑆 𝑐 | (��)
⋂︁
𝐹 𝑁 = |𝑆^𝑐 𝑆| (��)

We generate covariates from 𝑁𝑝 (0, 𝛴𝑖 ) with p=���� and n=���, where 𝛴𝑖
for four diﬀerent cases are deﬁned later. 𝜖 ∼ 𝑁𝑝 (0, 𝜎 2 𝐼), we set the 𝜎 = 3. For
the regression coeﬃcient we set 𝛽 0 = {3, ..., 3, 0, ..., 0} , where the ﬁrst twenty
variables are set as active variables, 𝑆 = {1, ..., 20}, and the remaining variables
are noise features.

The Orthogonal Model: In this case, the variables are uncorrelated that
implies 𝛴1 is a 40 × 40 identity matrix, hence the IC, EIC and GIC hold for the
set 𝑆.
The simulation results (the minimum MSE with standard deviation for ��
runs, and the minimum TN and FN) are presented in the Table �, from which it
is easy to interpret that all the estimators report no false negatives.

51
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning

Table 1. The Orthogonal Model

Model MSE TN FN
Lasso ��.��� (��.�) ��� �
Enet ��.��� (��.�) ��� �
CGLCor ��.��� (��.�) ��� �
Enet+CGLCor ��.��� (��.�) ��� �

Block Diagonal Model: We consider a block diagonal model, where we simulate
X ∼ 𝑁𝑝 (0, 𝛴2 ), where 𝛴2 consists of ��� block matrices of size 10 × 10, deﬁned
as {︂
1, 𝑗 = 𝑘
𝑏𝑙𝑜𝑐𝑘𝑗,𝑘 =
.9, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
The Table � shows the simulation results for the block diagonal model. As
expected, the Lasso fails to select the correlated active components. However,
the Elastic-net, CGLCor and Elastic-net+CGLCor report no false negatives.

Table 2. Block Diagonal Model

Model MSE TN FN
Lasso �.�� (�.�) ��� ��
Enet �.�� (�.�) ��� �
CGLCor ��.�� (�.�) ��� �
Enet+CGLCor �.�� (�.�) ��� �

Linear Dependency Model: This situation arises, when a covariate is near
linear dependent on two or more covariates. Here we simulate X ∼ 𝑁𝑝 (0, 𝐼), then
we set the linear dependency as follows.

𝑋10 = 𝑋1 + ... + 𝑋9 + 𝑁 (0, .01)
.. ..
. = .
𝑋1000 = 𝑋991 + ... + 𝑋999 + 𝑁 (0, .01)

Here we use the canonical correlation based clustering technique for CL methods,
see CGL method in [�]. The Table � shows the simulation results for the linear
dependency model, from which it is clear that the Lasso, the Elastic-net and the
Elastic-net+CGL could not avoid false negatives.

52
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning

Table 3. Linear Dependency Model

Model MSE TN FN
Lasso �.�� (�.�) ��� ��
Enet �.�� (�.�) ��� ��
CGL �.�� (�.�) ��� �
Enet+CGL �.�� (�.�) ��� ��

Correlation within active groups: We consider similar set up as block di-
agonal model except we make the ﬁrst two groups (active groups) correlated.
In this case, the smallest eigenvalue of the 𝛴11 is very small and the GIC does
not hold. But the EIC will be satisﬁed since the correlation is high within active
groups only. Though, the Elastic-net is consistent but the Elastic-net+CGLCor
may perform poorly.

Table 4. Results for Correlated Groups

Model MSE TN FN
Lasso �.�� (�.�) ��� ��
Enet �.�� (�.�) ��� �
CGLCor ��.�� (�.�) ��� �
Enet+CGLCor �.�� (�.�) ��� �

The simulation results are presented in the Table �, from which it is clear
that only Elastic-net method correctly identiﬁes the true active set.

5.� Real Data Example

In this section, we consider the gene selection problem in leukaemia data (see [��])
to compare the Elastic-net+CL scheme with other methods. The leukaemia data
consists of ���� genes and �� samples, among which �� are type � leukaemia and
�� are type � leukaemia. We consider part of the leukaemia dataset, we ﬁrst select
the �� most signiﬁcant genes and then the �� least signiﬁcant genes, according to
their t-statistic scores (see [��]), so that the true active set is 𝑆 = {1, ..., 50}. We
used tenfold Cross-Validation method for choosing the regularization parameters.
The simulation results (TN and FN along with cross-validation error) are
presented in the Table �, from which it is clear that the Elastic-net+CGLCor
outperforms in terms of model interpretation and variable selection.

53
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning

Table 5. Leukemia Gene Selection Results

Model CVError TN FN
Lasso �.��� �� ��
Enet �.��� �� �
CGLCor �.��� �� �
Enet+CGLCor �.��� �� �

6 Computational Details
Statistical analysis was performed in R �.�.�. We used the package glmnet for
penalized regression methods (the Lasso and the Elastic-net), the package gglasso
to perform group Lasso and the package ClustOfVar for clustering of variables.

7 Conclusion
In this article, we proposed pre-selection using Elastic-net for CL methods for
variable selection in high-dimensional linear model with strongly correlated
variables. We proved that the variable set selected by the Elastic-net+CL method
contains the true active set consistently under IC or EIC+GIC and beta-min
conditions. We discussed that reducing dimensionality improves the speed and
accuracy of the clustering process, and we explored theoretically how the proposed
combination of algorithms performed under various conditions. Our simulation
studies showed that combining Elastic-net and Cluster Lasso methods improved
variable selection and predictive performance.

Acknowledgment

The authors would like to thank the anonymous reviewers of this paper for their
suggestions and constructive comments.

54
Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning

References
�. Bühlmann, P., van de Geer, S.: Statistics for High-Dimensional Data: Methods,
Theory and Applications. Springer Verlag (����)
�. Bühlmann, P., Rütimann, P., van de Geer, S., Zhang, C.H.: Correlated variables in
regression: clustering and sparse estimation. Journal of Statistical Planning and
Inference ���, ����–���� (����)
�. Bunea, F., Lederer, J., She, Y.: The group square-root lasso: Theoretical properties
and fast algorithms. nformation Theory, IEEE Transactions on ��, ����–����
(����)
�. Chavent, M., Kuentz, V., Liquet, B., Saracco, J.: Clustofvar: an r package for the
clustering of variables. The R user conference, University of Warwick Coventry UK
pp. ��–�� (����)
�. Gauraha, N.: Stability feature selection using cluster representative lasso. In Pro-
ceedings of the �th International Conference on Pattern Recognition Applications
and Methods pp. ���–��� (����)
�. van de Geer, S., Bühlmann, P.: On the conditions used to prove oracle results for
the lasso. Electronic Journal of Statistics �, ����–���� (����)
�. H., B., B., R.: Simultaneous regression shrinkage, variable selection and clustering
of predictors with oscar. Biometrics pp. ���–��� (����)
�. Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The
Lasso and Generalizations. CRC Press (����)
�. Jia, J., Yu, B.: On model selection consistency of the elastic net when 𝑝 ≫ 𝑛.
Technical Report (����)
��. Segal, M., Dahlquist, K., Conklin, B.: Regression approaches for microarray data
analysis. Journal of Computational Biology ��, ���–��� (����)
��. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Statist. Soc
��, ���–��� (����)
��. Tibshirani, R., Hastie, T., Narasimhan, B., , Chu, G.: Diagnosis of multiple cancer
types by shrunken centroids of gene expression. PNAS pp. ����–���� (����)
��. TR, G., DK, S., P, T., C, H., M, G., JP, M., H, C., ML, L., JR, D., MA, C., CD,
B., ES, L.: Molecular classiﬁcation of cancer: Class discovery and class prediction
by gene expression monitoring. Science pp. ���–��� (����)
��. Wei F, H.J.: Consistent group selection in high-dimensional linear regression.
Bernoulli ��, ����–���� (����)
��. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped
variables. J. R. Statist. Soc ��(�), ��–�� (����)
��. Zhao, P., Yu, B.: On model selection consistency of lasso. Journal of Machine
Learning Research �, ����–���� (����)
��. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R.
Statist. Soc ��, ���–��� (����)