=Paper= {{Paper |id=Vol-1928/paper4 |storemode=property |title=Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models |pdfUrl=https://ceur-ws.org/Vol-1928/paper4.pdf |volume=Vol-1928 |authors=Niharika Gauraha,Swapan Parui |dblpUrl=https://dblp.org/rec/conf/ki/GaurahaP17 }} ==Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models== https://ceur-ws.org/Vol-1928/paper4.pdf
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




    Pre-Selection in Cluster Lasso Methods for
         Correlated Variable Selection in
         High-Dimensional Linear Models

                       Niharika Gauraha and Swapan Parui

                              Indian Statistical Institute



      Abstract. We consider variable selection problems in high dimensional
      sparse regression models with strongly correlated variables. To handle
      correlated variables, the concept of clustering or grouping variables and
      then pursuing model fitting is widely accepted. When the dimension is very
      high, finding an appropriate group structure is as difficult as the original
      problem. We propose to use Elastic-net as a pre-selection step for Cluster
      Lasso methods (i.e. Cluster Group Lasso and Cluster Representative
      Lasso). The Elastic-net selects correlated relevant variables, but it fails to
      reveal the correlation structure among the active variables. We use cluster
      Lasso methods to address shortcoming of the Elastic-net, and the Elastic-
      net is used to provide reduced feature set for the cluster Lasso methods.
      We theoretically explore, the group selection consistency of the proposed
      combination of algorithms under various conditions, i.e. Irrepresentable
      Condition (IC), Elastic-net Irrepresentable Condition (EIC) and Group
      Irrepresentable Condition (GIC). We support the theory using simulated
      and real dataset examples.


Keywords: Correlated Variable Selection, Group Lasso, Cluster Lasso Methods,
High-dimensional Linear models


�   Introduction

We consider the usual linear regression model

                                    Y = X𝛽 0 + 𝜖,                                      (�)

with response vector Y𝑛×1 , design matrix X𝑛×𝑝 , true underlying coefficient
         0
vector 𝛽𝑝×1  and error vector 𝜖𝑛×1 . When the number of predictors (p) is much
larger than the number of observations (n), 𝑝 ≫ 𝑛, the Lasso [��] and its variants
are mostly used for sparse estimation and variable selection. However, variable
selection in situations involving high empirical correlation remains one of the
most important issues. This problem is encountered in many applications such
as microarray data analysis and genome analysis, see [��].
    It has been proven that the design matrix must satisfy the following conditions
for the Lasso to perform exact variable selection: irrepresentable condition (IC)

                                           43
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




[��] and beta-min condition [�]. Having highly correlated variables implies that
the design matrix violates the IC. To deal with correlated variables, mainly two
approaches have been proposed in literature: simultaneous clustering and model
fitting (see [�]), and clustering followed by the sparse estimation (see [�]) and [�]).
The former approach imposes restrictive conditions on the design matrix. However,
the time complexity for clustering of variables severely limits the dimension of
data sets that can be processed by the later approach. Moreover, group selection
in models with a larger number of groups is more difficult (see [��]). To overcome
the limitations of the later approach, we propose to use Elastic-net [��] as a
pre-selection procedure for the Cluster Lasso [�] methods. Basically, we try
to reduce the noise (dimension) first using the Elastic-net which is known to
select correlated variables, before applying the Cluster Lasso (CL) methods. This
scheme allows for a significant decrease in the cost of clustering, especially in the
high dimensional problems.
    Basically, we propose to combine Elastic-net and Cluster Lasso methods to
improve both speed of computation and accuracy of the results in the case of
sparsity. This goal is achieved by using Elastic-net to first reduce the number of
variables under consideration and then using CL methods on the reduced data
to select correlated variables. We theoretically explore, how the proposed combi-
nation of algorithms will perform under various conditions, i.e. Irrepresentable
Condition (IC), Elastic-net Irrepresentable Condition (EIC) and Group Irrepre-
sentable Condition (GIC). This theoretical analysis is validated by experiments
on simulated datasets by comparison with different methods: Lasso, Elastic-net
and Cluster Group Lasso. Moreover, we shows that the algorithm is able to
improve the results compared to the mentioned methods on a real-world dataset.
    The rest of this paper is organized as follows. In section �, we provide
notations, assumptions and background to be used later. In section �, we review
mathematical theory of the Lasso, the Elastic-net and the group Lasso. In section
�, we describe the proposed algorithm which mostly selects more adequate models
in terms of model interpretation and prediction performance. In section �, we
provide numerical results based on simulated and real dataset. Section � contains
the computational details and we shall provide conclusion in section �.


�     Basic Assumptions, Notations and Concepts

In this section, we state notations and assumptions and we define the required
concepts.


�.�   Notations and Assumptions

We consider the usual linear regression set up as given in Equation (�). We
assume that the components of the noise vector 𝜖 ∈ 𝑅𝑛 are i.i.d. 𝑁 (0, 𝜎 2 ). The
columns of the design matrix X are denoted by 𝑋 𝑗 . We assume that the design
matrix
∑︀𝑛    X is fixed,
              ∑︀𝑛 the 𝑗data is centred and the predictors are standardized, so that
                                  1
  𝑖=1 𝑌𝑖 = 0,   𝑖=1 𝑋𝑖 = 0 and 𝑛 𝑋 𝑋 = 1 for all 𝑗 = 1, ..., 𝑝. The ℓ1 -norm is
                                      𝑗′ 𝑗



                                         44
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




denoted by ‖.‖1 , the ℓ2 -norm is denoted by ‖.‖2 and the ℓ∞ norm is denoted by
‖.‖∞ . The minimum eigenvalue of a matrix A is denoted as 𝛬𝑚𝑖𝑛 (𝐴). The true
active set 𝑆0 denotes the support of the subset selection solution (𝑆0 = 𝑠𝑢𝑝𝑝(𝛽0 ))
and defined as 𝑆0 = {𝑗; 𝛽𝑗0 ̸= 0}. For any given 𝑆 ⊂ {1, 2, ..., 𝑝}, the 𝛽𝑆 is a 𝑝 × 1
vector which has zeroes outside the set S, as given by

                                𝛽𝑆 = {𝛽𝑗 𝐼(𝑗 ∈ 𝑆)},

where I is the indicator function. Then we have

                                  𝛽 = 𝛽 𝑆 + 𝛽𝑆 𝑐 .

The (scaled) Gram matrix (covariance matrix) is defined as

                                    ^ = X X.
                                             ′
                                    𝛴
                                         𝑛
The covariance matrix can be partitioned for the subset S as
                          [︂                          ]︂
                             𝛴11 = 𝛴(𝑆)    𝛴12 (𝑆)
                     𝛴=                                  .                       (�)
                              𝛴21 (𝑆)   𝛴22 = 𝛴(𝑆 𝑐 )

Similarly, we partition the parameter vector for the subset S as
                                        [︂ ]︂
                                          𝛽1
                                 𝛽𝑝×1 =        .                                 (�)
                                          𝛽2

The sign function is defined as
                                     ⎧
                                     ⎨ −1 if 𝑥 < 0
                            𝑠𝑖𝑔𝑛(𝑥) = 0 if 𝑥 = 0                                 (�)
                                     ⎩
                                       1 if 𝑥 > 0


�.�   Clustering of Variables

We use correlation based, bottom-up agglomerative hierarchical clustering meth-
ods to cluster predictors, which forms groups of variables based on correlations
between them. To determine number of clusters we use the bootstrap approach,
see stability feature in [�].


�.�   The Regularized Regression Methods

In this section, we briefly review the Lasso, the Group Lasso and the Elastic-net
regularized regression methods.
    The Least Absolute Shrinkage and Selection Operator (Lasso) was introduced
by Tibshirani [��]. It is a penalized least squares method that imposes an ℓ1 -
penalty on the regression coefficients, which does both shrinkage and automatic

                                        45
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




variable selection simultaneously due to the nature of the ℓ1 -penalty. We denote
𝛽^𝐿𝐴𝑆𝑆𝑂 , as a Lasso estimated parameter vector, which is defined as:
                                       1
                    𝛽^𝐿𝐴𝑆𝑆𝑂 ∈ 𝑎𝑟𝑔𝑚𝑖𝑛{ ‖y − X𝛽‖22 + 𝜆‖𝛽‖1 },                   (�)
                               𝛽∈R𝑝    𝑛
where 𝜆 is the regularization parameter. The Lasso estimated active set is denoted
as 𝑆^𝐿𝐴𝑆𝑆𝑂 and defined as 𝑆^𝐿𝐴𝑆𝑆𝑂 = {𝑗; (𝛽^𝐿𝐴𝑆𝑆𝑂 )𝑗 ̸= 0}.
    The major disadvantages of the Lasso are: the Lasso tends to select single
variable from any group and, it can select at most n variables for 𝑝 ≫ 𝑛 case.
To overcome the above limitations of the Lasso, the Elastic-net was proposed
by [��]. The estimated parameter vector by naive Elastic-net is denote by 𝛽𝐸𝑁 ,
and defined as
                                  1
                𝛽^𝐸𝑁 = 𝑎𝑟𝑔𝑚𝑖𝑛{ ‖y − X𝛽‖22 + 𝜆‖𝛽‖1 + 𝛼‖𝛽‖2 },                   (�)
                          𝛽∈R𝑝    𝑛
where 𝛼 ∈ [0, 1] and 𝜆 ≥ 0 are the regularization parameters. Since the Elastic-net
estimate is (1 + 𝜆2 )𝛽^𝐸𝑁 , it selects the same variable as the 𝛽^𝐸𝑁 , we consider the
𝛽^𝐸𝑁 as the Elastic-net estimate. The Elastic-net estimated active set is denoted
as 𝑆^𝐸𝑁 and defined as 𝑆^𝐸𝑁 = {𝑗; (𝛽^𝐸𝑁 )𝑗 ̸= 0}.
    When the distinct groups or clusters among the variables are known a priory
and it is desirable to select or drop the whole group instead of single variables,
then the Group Lasso (see [��]) or its variants (i. e., Group Square-Root Lasso [�]
and Adaptive group Lasso [��] are used, that imposes an 𝐿2 -penalty on the
coefficients within each group to achieve group sparsity.
    Here we define some more notations and state assumptions for the group
Lasso. We may interchangeably use 𝛽 0 and 𝛽 for the true regression coefficient
vector, the later one is without the superscript. Let us assume that the parameter
vector 𝛽 is structured into groups, 𝐺 = {𝐺1 , ..., 𝐺𝑞 }, where 𝑞 < 𝑛, denotes the
number of groups. The partition 𝐺 basically builds a partition of the index set
{1, ..., 𝑝} with ∪𝑞𝑟=1 𝐺𝑟 = {1, ..., 𝑝} and 𝐺𝑟 ∩ 𝐺𝑙 = ∅, for 𝑟 ̸= 𝑙. The parameter
vector 𝛽, then has the structure 𝛽 = {𝛽𝐺1 , ..., 𝛽𝐺𝑞 } where 𝛽𝐺𝑗 = {𝛽𝑟 : 𝑟 ∈ 𝐺𝑗 }.
The columns of each group are represented by X𝐺𝑗 , then the response vector Y
can also be written as
                                        𝑞
                                       ∑︁
                                Y=        X(𝐺𝑗 ) 𝛽𝐺𝑗 + 𝜖,
                                    𝑗=1
The loss function of the group Lasso is the same as the loss function of the Lasso
1           2
𝑛 ‖Y − X𝛽‖2 . The group Lasso penalty is defined as
                                 ∑︀𝑞               √︁
                        ‖𝛽‖2,1 = 𝑗=1 ‖X𝐺𝑗 𝛽𝐺𝑗 ‖2 𝑛𝑗 ,
                                                      𝑚


where 𝑚𝑗 = |𝐺𝑗 | is the group size. Since the penalty is invariant under parametriza-
tions within-group. Therefore, without loss of generality, we can assume 𝛴𝑟𝑟 = 𝐼.
Hence the group Lasso penalty can be written as
                                         𝑞
                                        ∑︁ √
                             ‖𝛽‖2,1 =          𝑚𝑗 ‖𝛽𝐺𝑗 ‖2
                                        𝑗=1


                                          46
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




The Group Lasso estimator (with known q groups) is defined as
                                   1
                    𝛽^𝑔𝑟𝑝 ∈ 𝑎𝑟𝑔𝑚𝑖𝑛{ ‖Y − X𝛽‖22 + 𝜆‖𝛽‖2,1 }                     (�)
                               𝛽   𝑛

Let W denote the active group set, 𝑊 ⊂ {1, ..., 𝑞}, with cardinality 𝑤 = |𝑊 |, then
we assume: (i) the size of the each group is less than the number of observations
𝑚𝑚𝑎𝑥 < 𝑛, and (ii) the number of active groups, w, is less than the number of
observations (sparsity assumption). We denote the clusters selected by the group
Lasso as 𝑆^𝑐𝑙𝑢𝑠𝑡 , which is defined as

                𝑆^𝑐𝑙𝑢𝑠𝑡 = {𝑟 : cluster 𝐺𝑟 is selected, 𝑟 = 1, ..., 𝑞}          (�)

The union of the selected clusters gives the selected set of variables.

                               𝑆^𝑔𝑟𝑝 = ∪𝑟∈𝑆^𝑐𝑙𝑢𝑠𝑡 𝐺𝑟                           (�)

�.4   The Cluster Lasso Methods
When group structure is not known, the cluster Lasso methods are preferred, they
perform clustering followed by the sparse estimation. The clusters 𝐺1 , ..., 𝐺𝑞 are
generated from the design matrix X ( i.e. using correlation based method etc.).
When the group Lasso is applied to the resulting clusters, it is called Cluster
Group Lasso, and when the Lasso is applied for the cluster representatives, it is
called Cluster Representative Lasso, see [�].


�     Mathematical Theory of the Lasso and its Variants
In this section, we briefly review the results required for proving consistent vari-
able selection (and group variable selection) in high dimensional linear models.
For more details on mathematical theory for the Lasso and the group Lasso, we
refer to: [�], [��], [��], [�] and [�], and for the Elastic-net we refer to [�].

Let us assume a fixed set 𝑆 ⊂ {1, ..., 𝑝} with cardinality 𝑠 = |𝑆| and parti-
tions of the covariance matrix and parameter vector as given by the Equations
(�) and (�) respectively, for the following definitions.
Definition � (Irrepresentable Condition (IC)). The strong irrepresentable
condition is said to be met for the set S with a constant 𝜂 > 0, if the following
holds:
                              −1
                        ‖𝛴12 𝛴11 (𝑆)𝑠𝑖𝑔𝑛(𝛽1 )‖∞ ≤ 1 − 𝜂.                      (��)

    Sufficient conditions (eigenvalue and mutual incoherence) on the design matrix
to hold IC are discussed in [��] and [�].
Definition � (Beta-min Condition). The Beta Min Condition is met for the
regression coefficient 𝛽, if min |𝛽𝑆 | ≥ 𝜑4𝜆𝑠
                                        2 (𝑆) ,




                                        47
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




where 𝜑(𝑆) is the compatibility condition, see [�].
Lemma �. Under the following assumptions the Lasso selects the set 𝑆 with a
high probability:
 – Irrepresentable condition holds for 𝑆.
 – Beta-min condition holds for the parameter vector 𝛽.

Definition � (The Elastic-net Irrepresentable Condition (EIC)). The
EIC is met for the set 𝑆, with a constant 𝜂 > 0, if the following holds.

                              𝛼 −1               2𝛼 0
                ‖𝛴21 (𝛴11 +     𝐼) (𝑠𝑖𝑔𝑛(𝛽10 ) +   𝛽 )‖∞ ≤ 1 − 𝜂
                              𝑛                   𝜆 1
Proposition �. For a given set 𝑆, if IC holds then it implies that for any 𝜆 > 0,
there exists 𝛼, such that EIC holds, but EIC does not imply IC.

When IC holds it is easy to show that the EIC also holds, since 𝛼 = 0 makes the
EIC same as IC. The EIC may hold even when 𝛴11 is not invertible, it proves
that the EIC does not imply the IC .
    Now we discuss necessary conditions for the group variable selection. Our
error analysis for the group Lasso is based on the pure active group and pure
noise group assumptions, that is all variables are active variables within an active
group and no variables are active in a noise group and we also assume that the
clustering process identifies the group structure correctly.
    It is convenient to assume the following. Let 𝑊 ⊂ {1, ..., 𝑞} be a group index
set, say, 𝑊 = {1, ..., 𝑤} Consider the full index set corresponding to W as

             𝑆 = {(1, 1), ..., (1, 𝑚1 ), ..., (𝑤, 1), ..., (𝑤, 𝑚𝑤 )} = {1, ..., 𝑠}
           ∑︀𝑤
where 𝑠 = 𝑗=1 𝑚𝑗 . We partition the 𝛴11 (𝑆) covariance matrix group wise, and
denote its inverse as 𝑅𝑆 . (here we assume that each 𝛴𝑟,𝑟 is non-singular, or one
may use the pseudo inverse)
               ⎡                          −1/2         −1/2         −1/2        −1/2
                                                                                     ⎤
                         𝐼            𝛴11 𝛴12 𝛴22              ... 𝛴11 𝛴1𝑤 𝛴𝑤𝑤
               ⎢ −1/2           −1/2                                −1/2        −1/2 ⎥
               ⎢ 𝛴22 𝛴21 𝛴11                     𝐼             ... 𝛴22 𝛴2𝑤 𝛴𝑤𝑤 ⎥
       𝑅𝑆 = ⎢  ⎢          ..                     ..           ..         ..          ⎥
                                                                                     ⎥
               ⎣           .                      .              .        .          ⎦
                  −1/2        −1/2    −1/2      −1/2
                𝛴𝑤𝑤 𝛴𝑤1 𝛴11          𝛴𝑤𝑤 𝛴𝑤2 𝛴22       ...         𝐼

We note that diagonal elements are 𝐼𝑚𝑟 ×𝑚𝑟 identity matrix due to within group
parameterization invariance property.

Definition 4 (Group Irrepresentability Condition (GIC)). The GIC is
met for the set W with a constant 𝜂 > 0, if the following holds

                     ‖(𝛴21 𝑅𝑆 𝑠𝑖𝑔𝑛(𝛽))𝐺𝑟 ‖ ≤ 1 − 𝜂      ∀𝑟 ̸∈ 𝑊,                  (��)

where the inequality holds group wise.

                                         48
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




We note that the GIC definition reduces to the IC for singleton groups.

Definition 5 (Group beta-min    √
                                   Condition). The group beta-min Condition
is met for 𝛽 , if ‖𝛽 𝐺𝑟 ‖∞ ≥         ∀𝑟 ∈ 𝑊 , where 𝐷 > 0 is a constant which
                             𝐷𝜆 𝑚𝑟
                               𝑛
depends on 𝜎, 𝜑𝑔𝑟𝑝 and other constants used in cone constraints and GIC.

For its exact form, we refer to [�]. We note that, only one component of the
𝛽 𝐺𝑟 , ∀𝑟 ∈ 𝑊 has to be sufficiently large, because we aim to select groups as a
whole, and not individual variables.

Theorem �. Under the following assumptions the group Lasso selects the set of
active groups 𝑊 with a high probability.

 – The GIC holds for 𝑊 .
 – Group beta-min condition holds for 𝛽 𝐺𝑟 , ∀𝑟 ∈ 𝑊 .

    Now we show that the IC implies the GIC, and the converse is not true.

Lemma �. If the IC holds for the set S, then the GIC holds for the set W, but
the converse is not true.

Here we give sketch of the proof. If the IC holds for the set S, then the GIC
holds for the singleton groups, therefore the GIC holds for any group structure
within S. It is easy to show that the 𝛴11 is not invertible when active variables
are correlated, but the 𝛴11 may be invertible after within group transformation
and GIC may hold. Therefore IC is more restrictive than GIC, hence converse of
the theorem is not true.


�    Pre-Selection for the Cluster Lasso Methods

We consider high dimensional settings, where group of variables are highly
correlated. It is known that the Lasso tends to select one or few variables from
the group of highly correlated variables, even though many or all of them belong
to the active set. The CL methods have proven to be effective in selecting group
variables and reducing false negatives. The two major drawbacks of the CL
methods are: selection of groups does not work well when there is a large number
of groups, and the time complexity for clustering when p is large is unacceptable.
We try to address these problems using Elastic-net as a pre-selection for the CL
methods, as described in the Algorithm �. The variable selection consistency for
the Elastic-net and the CL methods have been already proven under the EIC
and the GIC respectively, see [�] and [�]. In the following, we discuss various
situations where Elastic-net+CL consistently selects the active set with a high
probability, and situations where our scheme may fail to select the true variables.



                                       49
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




 Algorithm �: The Elastic-net+CL Algorithm
   Input: dataset (Y,X)
   Output: 𝑆:=^ set of selected variables
   Steps:
   �. Perform Elastic-net on data (Y,X), and denote 𝑆^𝐸𝑁 as variable set selected.
                           ^
           Let X𝑟𝑒𝑑 = X𝑆𝐸𝑁 be the reduced design matrix.
   �. Perform Clustering of variables on data X𝑟𝑒𝑑 , and denote the clusters as
           𝐺1 , ..., 𝐺𝑞 .
   �. Perform group Lasso on (Y, X𝑟𝑒𝑑 ) with group information 𝐺1 , ..., 𝐺𝑞 , and
     denote the selected set of groups as
           𝑆^𝑐𝑙𝑢𝑠𝑡𝑒𝑟 = {𝑟; cluster 𝐺𝑟 is selected, 𝑟 = 1, ..., 𝑞}.
   The union of the selected groups is then the selected set of variables
           𝑆^ = ∪𝑟 𝑟 ∈ 𝑆^𝑐𝑙𝑢𝑠𝑡𝑒𝑟
   return 𝑆^



Case � The IC holds:
   Suppose that for a given set 𝑆, the design matrix X satisfies the IC, which
   implies that the EIC and the GIC will also be satisfied for the set 𝑆, see
   Proposition � and Lemma �. It follows that with a high probability (w.h.p.)
   the Elastic-net will select the true variables and similarly, for any group
   structure within S will yield the group selection consistency for the group
   Lasso, w.h.p. Therefore in this situation, the Elastic-net+CL shows variable
   selection consistency.
Case � The EIC and the GIC both hold:
   Apart from the previous case, this situation may also arise when two or more
   variables within set 𝑆 are highly correlated, then IC does not hold for the
   set S (since 𝛴11 is not invertible), but the EIC and the GIC will be satisfied.
   Using the same argument as for the preceding case, it is easy to show that
   the combination has variable selection consistency.
Case � The EIC holds but the GIC does not hold:
   This situation may arise for the overlapping groups, when the active groups
   are highly correlated, then the design matrix does not satisfy the GIC, but
   the EIC may hold for the set 𝑆. In such cases, though the Elastic-net may
   select true active set w.h.p., but the subsequent group Lasso may fail to
   select the active set.
Case � The GIC holds but the EIC does not hold:
   This situation can come when there are near linear dependences among set
   of active variables, the EIC may not hold and Elastic-net tends to select
   single or a few variables from the group of linearly dependent variables. If we
   correctly put the linearly dependent variables in appropriate clusters (one
   may use the canonical correlation based clustering, see [�]) then GIC may
   hold. Since, the pre-selection step itself fails to select the true active set,
   hence our scheme of pre-selection for CL methods may not work in this case.
   We illustrate the above four cases with simulation studies in the next section.
We do not consider the case when the EIC and the GIC both do not hold, because

                                        50
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




neither the pre-selection nor the CL methods are guaranteed to select the true
variables.


5     Numerical Results

In this section, we consider four simulation settings and a real data example in
order to empirically compare performance of the proposed scheme with the other
existing methods; the Lasso [��], the Elastic-net [��] and the CGLCor [�].


5.�   Simulation Studies

Four examples are considered in this simulation that correspond to the cases
discussed in the preceding section. In each example, data is simulated from
the linear model in Equation (�) with fixed design X. For each example, our
simulated dataset consisted of a training and an independent validation set and
�� such datasets were simulated for each example. The models were fitted on the
training data for each �� datasets and the model with the lowest test set Mean
Squared Error (MSE) was selected as the final model. For model interpretation,
we consider the pair (TN, FN), where TN denotes true negative, that is the
number of zero coefficients that are correctly estimated by zero and FN denotes
false negative, that is the number of non zero coefficients that are incorrectly
estimated by zero. As our aim is to avoid false negatives, we mainly focus on FN
while analysing the different estimators. The MSE and the (TN, FN) are defined
as follows.

                                      1 ∑︁
                                          𝑛
                             𝑀 𝑆𝐸 =        (𝑦𝑖 − 𝑦^𝑖 )2                        (��)
                                     𝑛 𝑖=1
                                         ⋂︁
                               𝑇 𝑁 = |𝑆^𝑐 𝑆 𝑐 |                                (��)
                                         ⋂︁
                               𝐹 𝑁 = |𝑆^𝑐 𝑆|                                   (��)

    We generate covariates from 𝑁𝑝 (0, 𝛴𝑖 ) with p=���� and n=���, where 𝛴𝑖
for four different cases are defined later. 𝜖 ∼ 𝑁𝑝 (0, 𝜎 2 𝐼), we set the 𝜎 = 3. For
the regression coefficient we set 𝛽 0 = {3, ..., 3, 0, ..., 0} , where the first twenty
variables are set as active variables, 𝑆 = {1, ..., 20}, and the remaining variables
are noise features.


The Orthogonal Model: In this case, the variables are uncorrelated that
implies 𝛴1 is a 40 × 40 identity matrix, hence the IC, EIC and GIC hold for the
set 𝑆.
    The simulation results (the minimum MSE with standard deviation for ��
runs, and the minimum TN and FN) are presented in the Table �, from which it
is easy to interpret that all the estimators report no false negatives.

                                        51
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




                        Table 1. The Orthogonal Model

                      Model         MSE           TN FN
                      Lasso       ��.��� (��.�) ���       �
                      Enet        ��.��� (��.�) ���       �
                      CGLCor      ��.��� (��.�) ���       �
                      Enet+CGLCor ��.��� (��.�) ���       �




Block Diagonal Model: We consider a block diagonal model, where we simulate
X ∼ 𝑁𝑝 (0, 𝛴2 ), where 𝛴2 consists of ��� block matrices of size 10 × 10, defined
as                                    {︂
                                          1, 𝑗 = 𝑘
                          𝑏𝑙𝑜𝑐𝑘𝑗,𝑘 =
                                         .9, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
   The Table � shows the simulation results for the block diagonal model. As
expected, the Lasso fails to select the correlated active components. However,
the Elastic-net, CGLCor and Elastic-net+CGLCor report no false negatives.


                        Table 2. Block Diagonal Model

                       Model          MSE        TN FN
                       Lasso       �.�� (�.�) ��� ��
                       Enet        �.�� (�.�) ��� �
                       CGLCor      ��.�� (�.�) ��� �
                       Enet+CGLCor �.�� (�.�) ��� �




Linear Dependency Model: This situation arises, when a covariate is near
linear dependent on two or more covariates. Here we simulate X ∼ 𝑁𝑝 (0, 𝐼), then
we set the linear dependency as follows.


                       𝑋10 = 𝑋1 + ... + 𝑋9 + 𝑁 (0, .01)
                         ..            ..
                          . =           .
                     𝑋1000 = 𝑋991 + ... + 𝑋999 + 𝑁 (0, .01)

Here we use the canonical correlation based clustering technique for CL methods,
see CGL method in [�]. The Table � shows the simulation results for the linear
dependency model, from which it is clear that the Lasso, the Elastic-net and the
Elastic-net+CGL could not avoid false negatives.

                                      52
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




                         Table 3. Linear Dependency Model

                            Model       MSE        TN FN
                            Lasso    �.�� (�.�) ��� ��
                            Enet     �.�� (�.�) ��� ��
                            CGL      �.�� (�.�) ��� �
                            Enet+CGL �.�� (�.�) ��� ��




Correlation within active groups: We consider similar set up as block di-
agonal model except we make the first two groups (active groups) correlated.
In this case, the smallest eigenvalue of the 𝛴11 is very small and the GIC does
not hold. But the EIC will be satisfied since the correlation is high within active
groups only. Though, the Elastic-net is consistent but the Elastic-net+CGLCor
may perform poorly.


                       Table 4. Results for Correlated Groups

                          Model           MSE        TN FN
                          Lasso       �.�� (�.�) ��� ��
                          Enet        �.�� (�.�) ��� �
                          CGLCor      ��.�� (�.�) ��� �
                          Enet+CGLCor �.�� (�.�) ��� �




   The simulation results are presented in the Table �, from which it is clear
that only Elastic-net method correctly identifies the true active set.



5.�   Real Data Example


In this section, we consider the gene selection problem in leukaemia data (see [��])
to compare the Elastic-net+CL scheme with other methods. The leukaemia data
consists of ���� genes and �� samples, among which �� are type � leukaemia and
�� are type � leukaemia. We consider part of the leukaemia dataset, we first select
the �� most significant genes and then the �� least significant genes, according to
their t-statistic scores (see [��]), so that the true active set is 𝑆 = {1, ..., 50}. We
used tenfold Cross-Validation method for choosing the regularization parameters.
    The simulation results (TN and FN along with cross-validation error) are
presented in the Table �, from which it is clear that the Elastic-net+CGLCor
outperforms in terms of model interpretation and variable selection.

                                          53
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




                    Table 5. Leukemia Gene Selection Results

                         Model          CVError TN FN
                         Lasso       �.���        �� ��
                         Enet        �.���        �� �
                         CGLCor      �.���        �� �
                         Enet+CGLCor �.���        �� �




6    Computational Details
Statistical analysis was performed in R �.�.�. We used the package glmnet for
penalized regression methods (the Lasso and the Elastic-net), the package gglasso
to perform group Lasso and the package ClustOfVar for clustering of variables.


7   Conclusion
In this article, we proposed pre-selection using Elastic-net for CL methods for
variable selection in high-dimensional linear model with strongly correlated
variables. We proved that the variable set selected by the Elastic-net+CL method
contains the true active set consistently under IC or EIC+GIC and beta-min
conditions. We discussed that reducing dimensionality improves the speed and
accuracy of the clustering process, and we explored theoretically how the proposed
combination of algorithms performed under various conditions. Our simulation
studies showed that combining Elastic-net and Cluster Lasso methods improved
variable selection and predictive performance.


Acknowledgment

The authors would like to thank the anonymous reviewers of this paper for their
suggestions and constructive comments.




                                       54
 Proceedings of the KI 2017 Workshop on Formal and Cognitive Reasoning




References
 �. Bühlmann, P., van de Geer, S.: Statistics for High-Dimensional Data: Methods,
    Theory and Applications. Springer Verlag (����)
 �. Bühlmann, P., Rütimann, P., van de Geer, S., Zhang, C.H.: Correlated variables in
    regression: clustering and sparse estimation. Journal of Statistical Planning and
    Inference ���, ����–���� (����)
 �. Bunea, F., Lederer, J., She, Y.: The group square-root lasso: Theoretical properties
    and fast algorithms. nformation Theory, IEEE Transactions on ��, ����–����
    (����)
 �. Chavent, M., Kuentz, V., Liquet, B., Saracco, J.: Clustofvar: an r package for the
    clustering of variables. The R user conference, University of Warwick Coventry UK
    pp. ��–�� (����)
 �. Gauraha, N.: Stability feature selection using cluster representative lasso. In Pro-
    ceedings of the �th International Conference on Pattern Recognition Applications
    and Methods pp. ���–��� (����)
 �. van de Geer, S., Bühlmann, P.: On the conditions used to prove oracle results for
    the lasso. Electronic Journal of Statistics �, ����–���� (����)
 �. H., B., B., R.: Simultaneous regression shrinkage, variable selection and clustering
    of predictors with oscar. Biometrics pp. ���–��� (����)
 �. Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The
    Lasso and Generalizations. CRC Press (����)
 �. Jia, J., Yu, B.: On model selection consistency of the elastic net when 𝑝 ≫ 𝑛.
    Technical Report (����)
��. Segal, M., Dahlquist, K., Conklin, B.: Regression approaches for microarray data
    analysis. Journal of Computational Biology ��, ���–��� (����)
��. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Statist. Soc
    ��, ���–��� (����)
��. Tibshirani, R., Hastie, T., Narasimhan, B., , Chu, G.: Diagnosis of multiple cancer
    types by shrunken centroids of gene expression. PNAS pp. ����–���� (����)
��. TR, G., DK, S., P, T., C, H., M, G., JP, M., H, C., ML, L., JR, D., MA, C., CD,
    B., ES, L.: Molecular classification of cancer: Class discovery and class prediction
    by gene expression monitoring. Science pp. ���–��� (����)
��. Wei F, H.J.: Consistent group selection in high-dimensional linear regression.
    Bernoulli ��, ����–���� (����)
��. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped
    variables. J. R. Statist. Soc ��(�), ��–�� (����)
��. Zhao, P., Yu, B.: On model selection consistency of lasso. Journal of Machine
    Learning Research �, ����–���� (����)
��. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R.
    Statist. Soc ��, ���–��� (����)




                                          55