=Paper= {{Paper |id=Vol-1807/05_ISW-LOD2016_45_54 |storemode=property |title=A Method for Analyzing High-dimensional Datasets[A first approach] |pdfUrl=https://ceur-ws.org/Vol-1807/05_ISW-LOD2016_45_54.pdf |volume=Vol-1807 |authors=Edwin Aldana-Bobadilla,Ivan López-Arevalo,J.L. González,Ana B. Rios-Alvarado |dblpUrl=https://dblp.org/rec/conf/iberamia/Aldana-Bobadilla16 }} ==A Method for Analyzing High-dimensional Datasets[A first approach]== https://ceur-ws.org/Vol-1807/05_ISW-LOD2016_45_54.pdf
     A Method for Analyzing High-dimensional
                    Datasets
                              [A first approach]

 Edwin Aldana-Bobadilla1 , Ivan Lopez-Arevalo1 , and J.L. Gonzalez1 and Ana
                              B. Rios-Alvarado2
                    1
                     CINVESTAV Tamaulipas, Victoria, México
                {ealdana,ilopez,jgonzalez}@tamps.cinvestav.mx
             2
               Autonomous University of Tamaulipas, Victoria, México
                              arios@uat.edu.mx



       Abstract. Due to technological advances to massively collect data, the
       last decades have raised several challenges for data analysis. Data reduc-
       tion is one of the most important. Typically such reduction can be made
       in two ways: decrease the number of instances (elements) of the dataset
       and decrease the number of required attributes (columns) to describe
       each of these instances. In the last case, methods as Principal Compo-
       nent Analysis (PCA) and Low Variance Filter are usually applied. They
       are based on statistical measures that allow us to obtain the set of fea-
       tures or attributes with the minimal correlation between them. Since the
       dataset may have uncorrelated variables that cannot be eliminated, the
       number of obtained features might not always be appropriate. A dataset
       with this characteristic may represent a performance problem when there
       are constraints of time or space. To avoid this, we propose a method that
       allows us to represent a n-dimensional instance as a numerical data in
       one-dimensional space. We show that this transformation preserves the
       properties of the original dataset and thus, it can be suitable for many
       applications where a high reduction is required.


Keywords: Feature Selection, Dimensionality Reduction, Density Estimation


1    Introduction

Multivariate data analysis represents challenges in both theoretical and empirical
levels. Until now, several methods for dimensionality reduction like Principal
Component Analysis [6], Low Variance Filter [1] and High Correlated Columns
[7] has been proposed. In this regard, we propose a method that allows us to
represent a n-dimensional instance as a numerical data in one-dimensional space.
It is compulsory that such representation preserves the information conveyed
by the original data. Our proposal is based on a ”discretization” of the data
space. We resort to the idea of quantiles which are cutpoints dividing a set of
observations or instances into equal sized intervals [4]. Usually the quantiles




                                          45
_______
Copyright © 2016 for this paper by its authors. Copying permitted for private and academic
purposes.
are defined over one-dimensional space. A set of instances in such space may
be grouped by the quantile to which them belong. In this sense, a quantile
represents all those instances that are close to each other. Many operations
with an instance may be approximated by the range defined by its quantile. For
example, assumming many instances xi ∈ R which belongs to the quantile q
defined by the interval [0.25, 0.30). The operation f (xi ) = sin(xi )∀xi ∈ q may
be approximated by f (q) = sin(0.25) = 0.004, f (q) = sin(0.29) = 0.005 or
even f (q) = f (q) + f (q)/2 ≈ 0.004. We can see that the effectiveness of this
approximation depends on the size of the interval that defines to q. Thus an
appropriate size value must be determined.
    Based on the above, we assume that the instances of a dataset may be repre-
sented by a set of quantiles, which preserves the properties of such instances when
several operations are applied. As mentioned, typically the quantiles are defined
over one-dimensional space. We propose a methodology that extends such defi-
nition to n-dimensional space. We want to project every instance of the dataset
to its corresponding quantile. Every quantile is identified by an unique numerical
code. The projection of an instance to its corresponding quantile allows us to
obtain the numerical representation of such instance (encoding instances). The
set of encoded instances will be the new dataset, we show experimentally that
this set preserves the information and properties of the original dataset when
several operations are applied.
    The rest of this work is organized as follows: In Section 2 we present the main
idea about dataset discretization based on quantiles. In Section 3 we explain how
to map a data instance into one-dimensional space through its corresponding
quantile. Then, in Section 4, we show the experimental methodology to measure
the effectiveness of our method and present the results. Finally, in Section 5 we
conclude and mention some future work.

2    Data Discretization
Given a dataset Y in a one-dimensional space, we can divide its space into a set
of quantiles as it is shown in Figure 1.
In this case, a quantile qi is an interval of the form qi = [yi , yi ] where yi and yi
are the lower and upper limits of qi respectively. To determine the values of yi
and yi a quantile width δ must be defined. Such value is given by:
                                  |max(Y ) − min(Y )|
                             δ=                                              (1)
                                            N
where N is a prior value of the desired number of quantiles. Based on the above,
the first quantile is defined as a half-closed interval of the form:
                            q1 = [min(Y ), min(Y ) + δ)                           (2)
and the subsequent quantiles are defined as:
                          (
                            [y i−1 , y i−1 + δ] if i = N
                     qi =                                                         (3)
                            [y i−1 , y i−1 + δ) otherwise




                                         46
Fig. 1: A possible division of the space of an one-dimensional random variable
Y . The quantile qk contains a proportion of instances of Y .




The idea could be extended to higher dimensional data, in which case, a quantile
will be a n-dimensional partition of the data space. In this case, given a dataset
Y ∈ Rn with instances of the form y = [y1 , y2 , ..., yn ], we can divide the data
space into a set of n-dimensional quantiles.
   Each n-dimensional quantile is composed by a set of intervals that determine
the upper and lower limits for each dimension. Such definition is expressed as:


                      qi = [[y i1 , y i1 ], [y i2 , y i2 ], . . . , [y in , y in ]]   (4)


where y i,k and y i,k are the lower and upper limit of qi in the k th dimension and
the width of each interval is now given by:


                                       |max(Yk ) − min(Yk )|
                              δk =                                                    (5)
                                                N


The variable Yk corresponds to the data in the k th dimension. We can generalize
the way to determine the limits of a quantile when Y ∈ Rn as:


                                                        T
                              [min(Y1 ), min(Y1 ) + δ1 )
                             [min(Y2 ), min(Y2 ) + δ2 ) 
                       q1 =                                                          (6)
                                                        
                                          ..             
                                          .             
                                    [min(Yn ), min(Yn ) + δn )




                                                  47
for the first quantile, and:
                                                        T
                       
                       
                          [y (i−1),1 , y (i−1),1 + δ1 ]
                       
                           [y         ,y          + δ2 ] 
                       
                                                         
                          (i−1),2 (i−1),2
                       
                       
                       
                                          ..              if i = N
                       
                                                        
                       
                       
                       
                                           .             
                           [y         , y         + δ   ]
                       
                       
                       
                             (i−1),n (i−1),1         n
                 qi =                                                          (7)
                       
                                                        T
                        [y (i−1),1 , y (i−1),1 + δ1 )
                       
                       
                       
                       
                          [y (i−1),2 , y (i−1),2 + δ2 ) 
                       
                                                        
                                                          otherwise
                                            ..
                       
                       
                                                         
                                             .
                       
                       
                       
                                                        
                       
                        [y
                              (i−1),n , y (i−1),1 + δn )

for subsequent quantiles. Note that in general a partial order is formed corre-
sponding to the left-to-right precedence relation where qi < qj if ∃k such that
y i,k < y j,k , for k ≤ n. Thanks to the precedence relation, we can assign to
each quantile a numerical code that preserves the partial order. To illustrate the
above idea, in Figure 1 the leftmost quantile can be identified as 1 while the
rightmost quantile can be identified as 10. Even any other numerical basis can
be used (instead of 10 base), as long as the order is preserved.
     Now consider Figure 2 which illustrates a possible encoding of three-dimensional
quantiles. The quantile code is formed by combining of the sequence number of
the intervals that define qi in each dimension. It means, that a three-dimensional
quantile encoded as 111 is defined by the leftmost intervals per dimension. Like-
wise, a three-dimensional quantile encoded as 333 is defined by the rightmost
intervals per dimension. In such encoding, we can see that the quantile encoded
as 113 precedes the quantile encoded as 323. The evident precedence order given
in one-dimensional space is preserved in higher order spaces.

3    Mapping High-dimensional Data to One-dimensional
     Space
So far, we have shown how to divide the data space into a set of n-dimensional
quantiles. Now, the idea is to map the dataset instances to the quantile to which
they belong. Those instances that belong to qi can be represented (encoded)
through the code assigned to it. Thus, given a set of quantiles Q defined from
a n-dimensional dataset Y, a one dimensional dataset Y 0 , composed only by
encoded instances, can be obtained. Table 1 shows a set of quantiles Q for an
hypothetical dataset Y ∈ R.
Possible instances of Y have been encoded, according to the quantile code to
which they belong, as it is shown in Table 2.
From this example the encoded dataset Y 0 is:
                      Y 0 = {01, 01, 02, 02, 03, 03, 04, 04, 05, 05,
                                                                               (8)
                             07, 07, 08, 08, 09, 09, 10, 10, 10, 06}




                                           48
      Fig. 2: Possible encoding of three-dimensional quantiles




                         Code Lower Upper
                          01   [-0.63 -0.60)
                          02   [-0.60 -0.58)
                          03   [-0.58 -0.55)
                          04   [-0.55 -0.53)
                          05   [-0.53 -0.50)
                          06   [-0.50 -0.48)
                          05   [-0.48 -0.45)
                          08   [-0.45 -0.43)
                          09   [-0.43 -0.40)
                          10   [-0.40 -0.38]
Table 1: Example of quantile encoding for a one-dimensional dataset.




                                49
                                 Y      Code Y       Code
                                -0.63   01   -0.48    07
                                -0.62   01   -0.46    07
                                -0.60   02   -0.45    08
                                -0.59   02   -0.44    08
                                -0.58   03   -0.43    09
                                -0.57   03   -0.41    09
                                -0.55   04   -0.40    10
                                -0.54   04   -0.39    10
                                -0.53   05   -0.38    10
                                -0.52   05   -0.50    06
        Table 2: Example of an encoding of a one-dimensional dataset.



The above idea can be extended to n-dimensional case. For instance, let Y be a
dataset in R3 . Assume that the space of the dataset in the k th dimension (Yk )
is divided into five intervals (for illustrative purposes) where min(Y1 ) = −1.22,
max(Y1 ) = 1.94, min(Y2 ) = −1.21, max(Y2 ) = 2.70, min(Y3 ) = −1.26 and
max(Y3 ) = 1.12. Also, assume that δ1 = 0.64, δ2 = 0.78 and δ3 = 0.48. In this
way, the leftmost quantile is comprised by the intervals set [−1.22, −0.58), [−1.21−
0.43), [−1.26 − 0.78), which correspond to the leftmost intervals per dimension.
Using a decimal numerical code of two digits to identify every interval, the cor-
responding quantile can be encoded as 01 − 01 − 01 or merely 010101. In Table 3,
this and other quantiles are illustrated including their corresponding boundaries
per dimension.


                Quantile Code        Y1              Y2         Y3
                   010101       [-1.22 -0.58) [-1.21 -0.43) [-1.26 -0.78)
                   010103       [-1.21 -0.57) [-1.21 -0.43) [-0.79 -0.31)
                   010104       [-1.20 -0.56) [-1.21 -0.43) [ 0.17 0.65)
                   010201       [-1.21 -0.57) [-0.43 0.35) [-1.26 -0.78)
  Table 3: Sample of quantiles for an hypothetical three-dimensional dataset.



Given the set of quantiles Q, assume an instance y ∈ Y of the form [−1.15, −0.5, −0.4].
Since this instance lies into the quantile 010103, it can be represented by the
quantile code.
    Based on this representation process, we can obtain a one-dimensional dataset
denoted as Y 0 . Note that the above representation method may involve a loss of
information allegedly depending on the number of quantiles. This value is implic-
itly associated to the number of bins in which the space of Yk (the dataset in the
k th dimension) is divided. In this regard, an appropriate selection of this value
is compulsory. Typically, Sturges’s rule is used [3,4]. Alternative rules, which




                                          50
attempt to improve the performance of Sturges’s rule without a normality as-
sumption, are Doane’s formula [2] and the Rice rule [5]. In this paper, we prefer
the Rice rule, which is to set the number of intervals to twice the cube root of
the number of instances. In the case of 1000 instances, the Rice rule yields 20
intervals instead of the 11 recommended by Sturges’ rule. One of our main goals
in near future is to drive experiments to minimize the loss information.
    Having defined the way to represent a n-dimensional data, in the next sections
we show the experimental results which allow us to confirm that our proposal is
promissory.


4     Experiments and Results

In this section, first we show in subsection 4.1 some preliminary results of the
effectiveness of our method to preserve the properties of the original dataset in
clustering analysis. Subsequently, in subsection 4.2, we evaluate the effectiveness
through a wide sample of systematic experiments that allow us to generalize the
results.


4.1   Preliminary Results: preserving clustering properties

We applied a non supervised classification algorithm over the well known Iris
dataset (available at https://archive.ics.uci.edu/ml/datasets/Iris). This
dataset (in what follows Yiris ) contains 3 classes of 50 instances each, where each
class refers to a type of iris plant.
    Figure 3 shows the iris data classes obtained by the k-means algorithm for
                                                         0
three centers, using Yiris and the encoded dataset Yiris    . The plot was obtained
using the plotcluster function of fpc R module with the default parameters.
    Table 4 is the confusion matrix obtained for three classes. We can see that
in this case the ratio of well classified instances is 80%.


                                          Reference class (Yiris )
                                      0
                    Predicted class Yiris Class 1 Class 2 Class 3
                           class 1          39       6       0
                           class 2          11      35       4
                           class 3           0       9      46
                     Table 4: Confusion Matrix of k-means.



    The above results show that the encoded dataset achieves to retain the infor-
mation conveyed by the original data. However, these results are not enough to
generalize this observation. In the following subsection, we present an experimen-
tal methodology that allows us to generalize the effectiveness of our proposal.




                                         51
                                                                    0
Fig. 3: Classification of k-means using Yiris vs. encoded dataset Yiris .




                                 52
4.2   General Results
In order to generalize the above results, we generated a wide set of synthetic
datasets (about 5000) in Rn for n = 2, 3, ...100. To generate each dataset, a
number of classes was defined a prior. For each dataset an encoded dataset (based
on our proposal) was obtained. Subsequently k-means algorithm is applied to
both non-encoded (Y ) and encoded dataset (Y 0 ). As mention the effectiveness is
given by the percentage matching between the class labels found with Y dataset
and the class labels found with Y 0 . The average result is shown in Table 5. For
completeness, we show the confidence interval of the results with a -value of 0.05.



                         Table 5: Average Effectiveness
                                              matching(%)
                         Average                   0.8895
                         Standard Deviation        0.2181
                         Lower Limit               0.8806
                         Upper Limit               0.8983
                         Confidence Level            95%



   The experiments show that in average the encoded data allows us to preserve
the properties (at least those associated to the proximity of instances) of the
dataset in more than 80%.

5     Conclusions and future work
In this paper we have described a method to encode multivariate data. Such
encoding, could be interpreted as a data reduction method, in a way that using
the codes we are able to apply data mining methods and obtain similar results
as using the original data. We applied the reduction method over iris data and
results show good performance in classification. Also we show that statistically
our method achieves to preserve about 80% of information conveyed by the
original data.
    However, there are some more hypotheses to prove in order to generalize the
reduction method. For instance, we suspect that precision in classification tasks
could improve if a finer number of quantiles is used; more experiments are needed
in order to explore this idea. Another future research will be the inclusion of non
numerical data types.

6     Acknowledgments
The authors acknowledge the support of Consejo Nacional de Ciencia y Tec-
nologı́a (CONACyT), Centro de Investigación y Estudios Avanzados-CINVESTAV
and Autonomous University of Tamaulipas.




                                       53
References
1. K. A. Cox, H. M. Dante, and R. J. Maher. Product appearance inspection methods
   and apparatus employing low variance filter, Aug. 17 1993. US Patent 5,237,621.
2. D. P. Doane. Aesthetic frequency classifications. The American Statistician,
   30(4):181–183, 1976.
3. R. J. Hyndman. The problem with sturges’ rule for constructing histograms. Monash
   University, 1995.
4. R. J. Hyndman and Y. Fan. Sample quantiles in statistical packages. The American
   Statistician, 50(4):361–365, 1996.
5. D. M. Lane. Online statistics education: An interactive multimedia course
   of study. http://onlinestatbook.com/2/graphing_distributions/histograms.
   html, 2015. Accessed: 2015-12-03.
6. J. Shlens.      A tutorial on principal component analysis.         arXiv preprint
   arXiv:1404.1100, 2014.
7. L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-
   based filter solution. In ICML, volume 3, pages 856–863, 2003.




                                        54