=Paper= {{Paper |id=Vol-1837/paper26 |storemode=property |title=Data Type Detection for Choosing an Appropriate Correlation Coefficient in the Bivariate Case |pdfUrl=https://ceur-ws.org/Vol-1837/paper26.pdf |volume=Vol-1837 |authors=Anastasiia Yu. Timofeeva |dblpUrl=https://dblp.org/rec/conf/ysip/Timofeeva17 }} ==Data Type Detection for Choosing an Appropriate Correlation Coefficient in the Bivariate Case== https://ceur-ws.org/Vol-1837/paper26.pdf
          Data Type Detection for Choosing an Appropriate
            Correlation Coefficient in the Bivariate Case

                                            Anastasiia Yu. Timofeeva
                                  Novosibirsk State Technical University, Russia
                                            a.timofeeva@corp.nstu.ru




                                                            Abstract
                         The data scientists usually define a data type based on a nature of
                         variables and select an appropriate correlation measure. However, this
                         is not convenient and very time-consuming in data intensive domains.
                         I propose to detect the types of variables and choose the appropriate
                         correlation coefficient in order to automate the statistical procedure of
                         correlation estimating from mixed data. This should lead to a reduc-
                         tion of time spent on correlation analysis and to increase the accuracy
                         of estimation of correlation coefficients. The continuity index is used to
                         detect whether a variable is continuous or ordered categorical. Based
                         on simulation study I have estimated the cutoff level for the continuity
                         index to choose the Pearson correlation, the polychoric, or the polyse-
                         rial correlation coefficient.




1     Introduction
As a measure of the linear dependence the Pearson correlation coefficient is most commonly used due to computa-
tionally ease and good statistical properties of the estimate under standard assumption of normality. Applications
of Pearson correlation coefficient are limited to quantitative data in a continuous scale of measurement. In prac-
tice, the analyst has to deal with a set of data measured in different scales. Consequently, mixed data is subjected
to statistical procedures such as calculating the correlation coefficient. Therefore it is necessary to choose an
appropriate correlation measure which allows to handle such data.
   Types of input variables can be various, e.g. binary, integer, ordered categorical (e.g. item response), and
continuous. For each combination of measurement scales a certain bivariate correlation coefficient is used:

    • tetrachoric correlation between two binary variables,
    • polychoric correlation between two ordered categorical variables,
    • biserial correlation between a continuous variable and a dichotomous variable,
    • polyserial correlation between a continuous variable and a ordered categorical variable.

  The use of various correlation coefficients for the same set of data may lead to significantly different conclusions.
The authors of [Hol2010] have shown that when construct validity is analysed according to ordinal data obtained
Copyright ⃝
Copyright    cc by
                2017
                   thebypaper’s
                          the paper’s authors.
                                authors.       Copying
                                          Copying        permitted
                                                   permitted        for private
                                                              for private       and academic
                                                                          and academic        purposes.
                                                                                         purposes.
In: S.A.Hölldobler,
         Editor, B.A.Coeditor
                         Malikov, (eds.):  Proceedings
                                   C. Wernhard   (eds.): of the XYZ
                                                         YSIP2         Workshop,
                                                                 – Proceedings     Location,
                                                                                of the Second Country,  DD-MMM-YYYY,
                                                                                              Young Scientist’s            published
                                                                                                                International        at
                                                                                                                              Workshop
http://ceur-ws.org
on Trends in Information Processing, Dombai, Russian Federation, May 16–20, 2017, published at http://ceur-ws.org.




                                                                  1
                                                                 188
from Likert scales, the factor results show a better fit to the theoretical model when the factorization is carried
out using the polychoric rather than the Pearson correlation matrix.
   In carrying out a correlation the scientist himself usually defines the data type and selects an appropriate
correlation measure. However, this is not convenient and very time-consuming in data intensive domains. I
propose to detect the types of variables and choose the appropriate correlation coefficient in order to automate
the statistical procedure of correlation estimating from mixed data. This should lead to a reduction of time spent
on correlation analysis and to increase the accuracy of estimation of correlation coefficients.
   It is further assumed that the generating data process is based on multivariate normal distribution. The
continuous data is discretized, i.e. converted to ordered categorical variables. The number of categories provides
information to automatically detect whether a variable is discrete or continuous.

2     Measures of Bivariate Association
In some cases it is difficult to measure precisely the value a variable on a quantitative scale, but very easy to
place observation into ordered categories [Dra1988]. So in addition to the well-known measure of correlation
between continuous variables Karl Pearson proposed polychoric and polyserial correlations [Pea1913]. Let us
take a closer look at their definition and methods of estimation.

2.1   Pearson Correlation Coefficient
Pearson correlation coefficient are widely used both in the practice, and in the sciences. It measures the linear
dependence between two variables. The variables should be measured on an interval scale. If you analyse the
ordinal data, a simple and naive plug-in strategy would be to use the discrete values as if they were continuous
and to calculate Pearson correlation coefficient. However, this approach is inferior to other methods for analyzing
discrete data, such as using the polychoric correlations [Hol2010], [Kol2004].

2.2   Polychoric Correlation Coefficient
Briefly, let us suppose that x1 and x2 are two ordinal items with n1 and n2 categories. It can be assumed that
underlying these items are variables ξ1 and ξ2 . Their joint distribution is a bivariate standard normal distribution
with a correlation ρ between random variables ξ1 and ξ2 .
    The discrete random variables x1 and x2 are obtained by grouping, i.e. the partition of the range of values
of random variables ξ1 and ξ2 into intervals. It is assumed that x1 takes values from 1 to n1 , x2 — from 1 to
n2 . The bounds of these intervals αi1 , i = 0, 1, . . . , n1 , αj2 , j = 0, 1, . . . , n2 are called discretizing thresholds.
They are unknown and α0k = −∞, αnk k = +∞. Then the relation between between xk and ξk is given by the
expression
                                       xk = i if α(i−1)k < ξk < αik , k = 1, 2.                                            (1)
The sample distribution of x1 and x2 is given by the contingency table. It contains the relative frequencies dij ,
i.e. the number of cases in category i of item 1 and in category j of item 2 to the sample size.
    The theoretical probability pij = P (x1 = i, x2 = j) corresponding to dij is defined as

                             pij = P (x1 = i, x2 = j) = P (α(i−1)1 < ξ1 < αi1 , α(j−1)2 < ξ2 < αj2 ) =
                                                                                                                          (2)
                = Φ2 (αi1 , αj2 , ρ) − Φ2 (α(i−1)1 , αj2 , ρ) − Φ2 (αi1 , α(j−1)2 , ρ) + Φ2 (α(i−1)1 , α(j−1)2 , ρ)

where Φ2 (z1 , z2 , ρ) is bivariate standard normal distribution function with correlation ρ between random variables
ξ1 and ξ2 .
   The problem is to estimate the unknown parameters of the bivariate distribution of random variables x1 and
x2 based on observed values dij . The estimate of ρ of this model is called the polychoric correlation coefficient.
   In this study I consider a two-step approach [Ols1979]. The first step is to find estimates for thresholds αik
as quantiles of corresponding marginal distributions:
                                                                              ( j n      )
                             ∑ i ∑n2                                             ∑∑   1

             α̂i1 = Φ−1              dlj  , i = 1, . . . , n1 − 1, α̂j2 = Φ−1         dil , j = 1, . . . , n2 − 1
                            l=1 j=1                                               l=1 i=1


where Φ(·) is the standard normal distribution function, Φ−1 (·) is the quantile function.




                                                                2
                                                               189
  In the second step the estimates of thresholds are substituted into (2) and theoretical probabilities are
considered as a function of unknown parameter ρ. For its estimation the maximum likelihood method is used.
For the joint discrete distribution of random variables x1 and x2 under the assumption of independence of
observations the average log-likelihood of the sample [Ols1979] is
                                                     ∑
                                                ℓ̂ =     dij ln pij                                     (3)
                                                          i,j∈U


where a finite set U = {i, j : dij ̸= 0&pij ̸= 0} is to avoid infinite value of the function ℓ̂ by dij = pij = 0. In (3)
each dij is a fixed value for the given sample.
   Both Pearson correlation coefficient, and polychoric correlation coefficient have similar properties. The cor-
relation coefficient has a value between +1 and −1 inclusive. The coefficient is symmetric. This means between
x1 and x2 is the same as the correlation between x2 and x1 .

2.3   Polyserial Correlation Coefficient
If the latent correlation between a continuous variable and a ordered categorical variable is assumed, then the
polyserial correlation coefficient is the most appropriate correlation measure. In this case one variable x1 with
underlying standard normal ξ1 is assumed to be discrete. It is expressed by (1) with k = 1. Another observed
variable x2 is considered to be continuous and standard normally distributed. In practice the variable defined as
continuous should be standardized, so that its mean becomes zero and its standard deviation becomes one.
   According to [Dra1988] the log-likelihood function for the joint distribution of random vector (x1 , x2 ) from a
sample of n observations (xi1 , xi2 ) is
                                            n
                                            ∑
                                  log L =         log ϕ(xi2 ) + log P (x1 = xi1 |x2 = xi2 )                         (4)
                                            i=1

where ϕ(·) is the standard normal density function.
   The conditional distribution of ξ1 given x2 = xi2 is normal with mean ρxi2 and variance 1 − ρ2 . Then if
xi1 = j with categories j = 1, . . . , n1 , the resulting conditional probability is
                                                       (                )    (             )
                                                         α(j−1)1 − ρxi2         αj1 − ρxi2
                         P (x1 = j|x2 = xi2 ) = Φ           √             −Φ √               .          (5)
                                                              1 − ρ2                1 − ρ2

A two-step approach to estimation the polyserial correlation assumes that discretizing thresholds in (5) can be
computed by formula
                                               ( j    )
                                                ∑
                                            −1
                                   α̂j1 = Φ         dl , j = 1, . . . , n1 − 1
                                                        l=1

where dl is the relative frequency, i.e. the number of cases in category l of item 1 to the sample size.
   As a result of maximization of the log-likelihood function (4) with argument ρ, the polyserial correlation
estimate is obtained. It is clear that the polyserial correlation coefficient is not symmetric. For estimating
polyserial correlation it is important which of the variables is assumed to be continuous or discrete.

3     Data Type Detection
All actual sample spaces are discrete, and all observable random variables have discrete distributions [Pit1979].
To detect whether a variable is continuous or discrete you need to understand the nature of the data. The
variable can be considered as continuous if there is an infinite number of possible values that the variable can
take between any two different points in the range. Any measurement of these variables will be discrete. In actual
practice, a variable is often treated as continuous when it can take on a sufficiently large number of different
values. It sometimes makes sense to treat continuous variable as ordered categorical. This is usually just another
kind of binning.
   Discrete data can only take particular values. If a variable can take on one of a limited number of possible
values referred to as levels, it is a categorical variable. A categorical data type where the variable has natural,




                                                               3
                                                              190
ordered categories is ordinal (ordered categorical) data. The distance between the categories is considered as a
unknown. It is generally not correct to consider ordered categorical data as continuous.
   In data intensive analysis it is almost impossible to determine the nature of each variable. It is necessary to
formulate a simple rule that would allow to detect whether a variable can be considered as continuous or ordered
categorical. It seems logical to count the number of unique values of the variable and relate it to sample size. If
a number of different values is a sufficiently large, then a variable can be considered as continuous.
   Let me introduce the continuity index of k-th variable defined as the ratio the number nk of categories (unique
values) to sample size n:
                                                              nk
                                                       γk =      .
                                                              n
The problem is to define cutoff at which the discrete variable will be considered as continuous. The author did
not find the detailed recommendations on this subject. Documentation to the package ‘treeplyr’ of statistical
environment R gives a default value of cutoff = 0.1 for deciding if numeric data might actually be descrete. The
continuity index γk should exceed cutoff, or the data will be classified as discrete. However, this cutoff value is
not justified. It is therefore necessary to carry out simulation studies to identify the most appropriate coefficient
for different values of the continuity indices γ1 and γ2 .

4      Software Implementation
Both the polyserial, and the polychoric correlations unfortunately are not typically used in the statistical analysis.
Nevertheless the functions (or packages) to calculate their are available in many statistical programs such as SPSS,
SAS, Stata, R. The first three of these programs are proprietary software, so the authors decided to focus on
free software for statistical computing R.
    There are two R-packages, which have the functions to calculate polychoric and polyserial correlations, polycor
and psych. The functions polyserial{psych} and polychoric{psych} have the drawback that the calculation is
not performed if there are more than 8 categories for any item. This is a very substantial limitation. It is not
possible to use these functions to simulation study.
    The functions polyserial{polycor} and polychor{polycor} do not have explicit restrictions on the dimen-
sion of contingency tables. However, polychor works very slow with a large number of categories. The most
computationally expensive stage is calculation of the function of two-dimensional normal distribution at each
iteration of the optimization algorithm. Here it is performed by internal function binBvn which uses the func-
tion pmvnorm{mvtnorm}. The function pmvnorm for a given interval in a p-dimensional space (in our case - a
two-dimensional) returns a scalar value, meaning it is not vectorized. For the calculation of probabilities for each
combination α̂i1 , α̂j2 the function binBvn uses a nested loop over indices i, j. It is an unfortunate fact and leads
to polychor functions work slowdown for a large number of categories, since loops are very slow in interpreted
language R.
    For this reason, I have implemented the calculation of polychoric correlation coefficient using a standard set
of R-packages. To do this, several user-functions have been written.
    The function FuncCalcPolychor has inputs: the current value of the correlation coefficient ρt , threshold values
α̂i1 , α̂j2 , frequency table dij . The algorithm can be divided into two steps.

    1. Based on the values of ρt , α̂i1 , α̂j2 two-dimensional array is calculated containing the values of the two-
       dimensional standard normal distribution for all combinations α̂i1 , α̂j2 . The last row of the array values
       Φ(αj2 ) are added. The last column of the array values Φ(αi1 ) are added. At their intersection unit is placed.
       At the beginning of the array one zero row and one zero column are added. On the basis of this array an
       array of probabilities pij is constructed according to (2). Negative values of probabilities that may occur
       due to inaccuracies the calculation, are reset to zero.

    2. Based on the values of dij and obtained in step 1 values pij calculates and returns the value ℓ̂.

   In Step 1, to calculate the values of the two-dimensional standard normal distribution function you can use
pmnorm function. This requires installation of package mnormt. It works by making a suitable call to Fortran-77
routine written by Alan Genz. The function pmnorm is vectorized. It returns a vector of values for input matrix
N × 2 where N is the number of points at which probabilities are calculated. In our case N = n1 n2 . To create
all combinations a standard function expand.grid is used.




                                                           4
                                                          191
    Alternatively (without additional packages), I have implemented a number of user-functions for calculating
the two-dimensional matrix of values of the bivariate standard normal distribution function. They are written
on the basis of the algorithm proposed in [Mey2013]. I have used the C-code presented in the article [Mey2013]
and rewrote it in the language R with the addition of vectorization. In other words, all functions processing
scalar values (or vectors) are replaced by functions processing vectors (or matrix), such as ifelse, apply, outer,
pmin. Also some errors was corrected, such as division by zero, if any of the values α̂i1 , α̂j2 is zero. Zero is
replaced by 10−5 .
    Finally, the function PolychorEst on the input vectors has sampled values of the observed variables x1 and
x2 . Its work can also be divided into two steps.
                                                                                                     ∑i ∑n2
  1. Filling a table of relative frequencies dij for given vectors x1 and x2 . Calculation of vectors l=1 j=1    dlj ,
      ∑j ∑n1
         l=1   i=1 dil . Calculation based on them α̂i1 , α̂j2 using a standard function qnormstats to calculate the
      quantile of the normal distribution.
    2. Optimization of FuncCalcPolychor function with respect to ρt for given values α̂i1 , α̂j2 , dij calculated in
       step 1. A one-parameter optimization was carried out using a basic function optimize{stats} in the interval
       ρ ∈ (−1, 1).

   In addition, I have implemented the calculation of polyserial correlation coefficient using a standard set of
R-packages. The user-function PolyserialEst is quite simple. It standardizes the variable which is assumed to
be continuous. Further, it optimizes FuncCalcPolyserial function with respect to ρt for given sample values
of x1 and standardized values of x2 . The function FuncCalcPolyserial calculates the log-likelihood function L
according to (4). It uses standard functions pnorm, qnorm of calculating distribution function, quantile function
for the normal distribution.

5      Simulation Study
For simulation study the following model example was used. The random variable ξ1 was simulated from a
standard normal distribution. The random variable ξ2 was defined as

                                                       ξ2 = ξ1 + ε

where ε is normally distributed random variable with mean zero and standard deviation σε . The value of σε
depended on the value of the correlation coefficient ρ specified by the scheme of the experiment. These values
are related by the relationship
                                                     √
                                                        1
                                             σε = σx       − 1.
                                                        ρ2
   Further grouping of variables ξ1 and ξ2 was carried out. The equidistant intervals were used with boundaries
defined by sample quantiles at probabilities 0, n1k , . . . , nkn−1
                                                                 k
                                                                    , 1. As the value of the variable xk at the i-th level
the sample mean of all the values in i-th interval is taken.
   The value of ρ is set to 0.5. The value of n is set to 500. The values of nk were taken from 5, 10, 25, 50, 100, 250,
and the correlation coefficients were estimated. Results were averaged 1, 000 repetitions. Figure 1 shows the
average value ρ̂ of Pearson correlation coefficient. Figure 2 shows the average value ρ̂ of polyserial correlation
coefficient when the first variable is considered as ordered categorial and second variable is assumed to be
continuous. Figure 3 shows the average value ρ̂ of polyserial correlation coefficient when the second variable is
considered as ordered categorial and first variable is assumed to be continuous.
   The average values of polychoric correlation are very close to 0.5 for all combinations of the continuity index
values. It means that the polychoric correlation coefficient gives an estimate that is the closest to the true value.
However, when the continuity index γk is high the calculation is very slow. Table 1 shows the user time the
running R-functions for calculating one value of polychoric correlation. It is clear that developed user function
PolychorEst runs six times faster than the function polychor{polycor} in the case where γk ≥ 0.05. However,
when γ2 = 0.05, γ1 = 0.50 calculations take nearly a minute. Therefore it is recommended to use a simple
correlation coefficients (in particular, Pearson correlation) if it provides acceptable accuracy for results.
   Figure 1 shows that by fixed value of γ2 and γ1 exceeding 0.05 the change in γ1 values does not significantly
decrease a bias in the Pearson correlation coefficient. The values of γ1 > 0.05 and γ2 > 0.05 lead to a small bias




                                                            5
                                                           192
                    Table 1: The total user CPU time of the R process, in seconds, γ2 = 0.05

            Function            γ1 = 0.01         γ1 = 0.02      γ1 = 0.05     γ1 = 0.10   γ1 = 0.20   γ1 = 0.50
            polychor{polycor}        0.22              0.78           3.07         15.01       53.75      319.21
            PolychorEst              0.17              0.22           0.52          1.95        7.92       52.75




                                    0.50
                                                                                   γ2 = 0.05
                                                                                   γ2 = 0.02

                                    0.48
                                ρ
                                ^


                                                                                   γ2 = 0.01
                                    0.46




                                           0.0    0.1      0.2           0.3     0.4       0.5
                                                                    γ1

                                           Figure 1: Pearson correlation coefficient

in the Pearson correlation coefficient. Thereby exceeding the continuity index cutoff of 0.05 allows to estimate
the Pearson correlation coefficient.
   Figure 2 shows that the values of polyserial correlation are weakly dependent on the values of γ1 . This is
because the first variable x1 is assumed categorical. Therefore the quality of correlation estimation depends on
the continuity of the second variable. The results shown in Fig. 3 does not depend on the values of γ2 . In these
cases also, if the continuity index exceeds cutoff of 0.05 a bias in correlation coefficient is quite small.

6   Conclusions
It proposed to choose an appropriate correlation measure based on data type detection. The continuity index γk
allows to detect whether variable is continuous or ordered categorical. Simulation study revealed that exceeding
the γk cutoff of 0.05 for both variables allows to classify the variables as continuous, and the Pearson correlation
coefficient can be calculated. If only γ1 should exceed cutoff of 0.05 then you can use the polyserial correlation of
ordered categorical x2 and continuous x1 . If only γ2 should exceed cutoff of 0.05 it makes sense to estimate the
polyserial correlation of ordered categorical x1 and continuous x2 . The polychoric correlation coefficient provides
best quality of estimation regardless of continuity index. But the calculation of the polychoric correlation
coefficient is very slow if the number of categories is large. Therefore it is recommended to use the polychoric
correlation if both variables have the continuity index less than 0.05.

Acknowledgements
The reported study was funded by Russian Ministry of Education and Science, according to the research project
No. 2.2327.2017/ПЧ.

References
[Hol2010]    F. P. Holgado-Tello, et. al. Polychoric versus Pearson correlations in exploratory and confirmatory
             factor analysis of ordinal variables. Quality & Quantity, 44(1):153–166, 2010.

[Dra1988] F. Drasgow. Encyclopedia of statistical sciences / Polychoric and polyserial correlations. John Wiley
          & Sons, 1988.




                                                               6
                                                              193
                                 0.50
                                                                        γ2 = 0.05




                                 0.49
                                                                        γ2 = 0.02




                             ρ
                             ^
                                 0.48
                                                                        γ2 = 0.01


                                 0.47   0.0   0.1    0.2         0.3   0.4     0.5
                                                            γ1

                  Figure 2: Polyserial correlation of ordered categorical x1 and continuous x2
                                 0.50
                                 0.49
                             ρ
                             ^
                                 0.48
                                 0.47




                                        0.0   0.1    0.2         0.3   0.4     0.5
                                                            γ1

                  Figure 3: Polyserial correlation of ordered categorical x2 and continuous x1

[Pea1913] K. Pearson, and D. Heron. On theories of association. Biometrika, 9(1/2):159–315, 1913.
[Ols1979]   U. Olsson. Maximum Likelihood Estimation of the Polychoric Correlation Coefficient. Psychometrica,
            44(4):443–460, 1979.
[Kol2004]   S. Kolenikov, and G. Angeles. The use of discrete data in PCA: theory, simulations, and applications
            to socioeconomic indices. Chapel Hill, Carolina Population Center, University of North Carolina,
            2004.

[Pit1979]   E. J. G. Pitman. Some basic theory for statistical inference. London, Chapman and Hall, 1979.
[Mey2013] C. Meyer. Recursive Numerical Evaluation of the Cumulative Bivariate Normal Distribution. Journal
          of Statistical Software, 52(10):1–14, 2013.




                                                       7
                                                      194