=Paper= {{Paper |id=Vol-1837/paper26 |storemode=property |title=Data Type Detection for Choosing an Appropriate Correlation Coefficient in the Bivariate Case |pdfUrl=https://ceur-ws.org/Vol-1837/paper26.pdf |volume=Vol-1837 |authors=Anastasiia Yu. Timofeeva |dblpUrl=https://dblp.org/rec/conf/ysip/Timofeeva17 }} ==Data Type Detection for Choosing an Appropriate Correlation Coefficient in the Bivariate Case== https://ceur-ws.org/Vol-1837/paper26.pdf

Data Type Detection for Choosing an Appropriate
Correlation Coeﬃcient in the Bivariate Case

Anastasiia Yu. Timofeeva
Novosibirsk State Technical University, Russia
a.timofeeva@corp.nstu.ru

Abstract
The data scientists usually define a data type based on a nature of
variables and select an appropriate correlation measure. However, this
is not convenient and very time-consuming in data intensive domains.
I propose to detect the types of variables and choose the appropriate
correlation coeﬃcient in order to automate the statistical procedure of
correlation estimating from mixed data. This should lead to a reduc-
tion of time spent on correlation analysis and to increase the accuracy
of estimation of correlation coeﬃcients. The continuity index is used to
detect whether a variable is continuous or ordered categorical. Based
on simulation study I have estimated the cutoﬀ level for the continuity
index to choose the Pearson correlation, the polychoric, or the polyse-
rial correlation coeﬃcient.

1 Introduction
As a measure of the linear dependence the Pearson correlation coeﬃcient is most commonly used due to computa-
tionally ease and good statistical properties of the estimate under standard assumption of normality. Applications
of Pearson correlation coeﬃcient are limited to quantitative data in a continuous scale of measurement. In prac-
tice, the analyst has to deal with a set of data measured in diﬀerent scales. Consequently, mixed data is subjected
to statistical procedures such as calculating the correlation coeﬃcient. Therefore it is necessary to choose an
appropriate correlation measure which allows to handle such data.
Types of input variables can be various, e.g. binary, integer, ordered categorical (e.g. item response), and
continuous. For each combination of measurement scales a certain bivariate correlation coeﬃcient is used:

• tetrachoric correlation between two binary variables,
• polychoric correlation between two ordered categorical variables,
• biserial correlation between a continuous variable and a dichotomous variable,
• polyserial correlation between a continuous variable and a ordered categorical variable.

The use of various correlation coeﬃcients for the same set of data may lead to significantly diﬀerent conclusions.
The authors of [Hol2010] have shown that when construct validity is analysed according to ordinal data obtained
Copyright ⃝
Copyright cc by
2017
thebypaper’s
the paper’s authors.
authors. Copying
Copying permitted
permitted for private
for private and academic
and academic purposes.
purposes.
In: S.A.Hölldobler,
Editor, B.A.Coeditor
Malikov, (eds.): Proceedings
C. Wernhard (eds.): of the XYZ
YSIP2 Workshop,
– Proceedings Location,
of the Second Country, DD-MMM-YYYY,
Young Scientist’s published
International at
Workshop
http://ceur-ws.org
on Trends in Information Processing, Dombai, Russian Federation, May 16–20, 2017, published at http://ceur-ws.org.

1
188
from Likert scales, the factor results show a better fit to the theoretical model when the factorization is carried
out using the polychoric rather than the Pearson correlation matrix.
In carrying out a correlation the scientist himself usually defines the data type and selects an appropriate
correlation measure. However, this is not convenient and very time-consuming in data intensive domains. I
propose to detect the types of variables and choose the appropriate correlation coeﬃcient in order to automate
the statistical procedure of correlation estimating from mixed data. This should lead to a reduction of time spent
on correlation analysis and to increase the accuracy of estimation of correlation coeﬃcients.
It is further assumed that the generating data process is based on multivariate normal distribution. The
continuous data is discretized, i.e. converted to ordered categorical variables. The number of categories provides
information to automatically detect whether a variable is discrete or continuous.

2 Measures of Bivariate Association
In some cases it is diﬃcult to measure precisely the value a variable on a quantitative scale, but very easy to
place observation into ordered categories [Dra1988]. So in addition to the well-known measure of correlation
between continuous variables Karl Pearson proposed polychoric and polyserial correlations [Pea1913]. Let us
take a closer look at their definition and methods of estimation.

2.1 Pearson Correlation Coeﬃcient
Pearson correlation coeﬃcient are widely used both in the practice, and in the sciences. It measures the linear
dependence between two variables. The variables should be measured on an interval scale. If you analyse the
ordinal data, a simple and naive plug-in strategy would be to use the discrete values as if they were continuous
and to calculate Pearson correlation coeﬃcient. However, this approach is inferior to other methods for analyzing
discrete data, such as using the polychoric correlations [Hol2010], [Kol2004].

2.2 Polychoric Correlation Coeﬃcient
Briefly, let us suppose that x1 and x2 are two ordinal items with n1 and n2 categories. It can be assumed that
underlying these items are variables ξ1 and ξ2 . Their joint distribution is a bivariate standard normal distribution
with a correlation ρ between random variables ξ1 and ξ2 .
The discrete random variables x1 and x2 are obtained by grouping, i.e. the partition of the range of values
of random variables ξ1 and ξ2 into intervals. It is assumed that x1 takes values from 1 to n1 , x2 — from 1 to
n2 . The bounds of these intervals αi1 , i = 0, 1, . . . , n1 , αj2 , j = 0, 1, . . . , n2 are called discretizing thresholds.
They are unknown and α0k = −∞, αnk k = +∞. Then the relation between between xk and ξk is given by the
expression
xk = i if α(i−1)k < ξk < αik , k = 1, 2. (1)
The sample distribution of x1 and x2 is given by the contingency table. It contains the relative frequencies dij ,
i.e. the number of cases in category i of item 1 and in category j of item 2 to the sample size.
The theoretical probability pij = P (x1 = i, x2 = j) corresponding to dij is defined as

pij = P (x1 = i, x2 = j) = P (α(i−1)1 < ξ1 < αi1 , α(j−1)2 < ξ2 < αj2 ) =
(2)
= Φ2 (αi1 , αj2 , ρ) − Φ2 (α(i−1)1 , αj2 , ρ) − Φ2 (αi1 , α(j−1)2 , ρ) + Φ2 (α(i−1)1 , α(j−1)2 , ρ)

where Φ2 (z1 , z2 , ρ) is bivariate standard normal distribution function with correlation ρ between random variables
ξ1 and ξ2 .
The problem is to estimate the unknown parameters of the bivariate distribution of random variables x1 and
x2 based on observed values dij . The estimate of ρ of this model is called the polychoric correlation coeﬃcient.
In this study I consider a two-step approach [Ols1979]. The first step is to find estimates for thresholds αik
as quantiles of corresponding marginal distributions:
  ( j n )
∑ i ∑n2 ∑∑ 1

α̂i1 = Φ−1  dlj  , i = 1, . . . , n1 − 1, α̂j2 = Φ−1 dil , j = 1, . . . , n2 − 1
l=1 j=1 l=1 i=1

where Φ(·) is the standard normal distribution function, Φ−1 (·) is the quantile function.

2
189
In the second step the estimates of thresholds are substituted into (2) and theoretical probabilities are
considered as a function of unknown parameter ρ. For its estimation the maximum likelihood method is used.
For the joint discrete distribution of random variables x1 and x2 under the assumption of independence of
observations the average log-likelihood of the sample [Ols1979] is
∑
ℓ̂ = dij ln pij (3)
i,j∈U

where a finite set U = {i, j : dij ̸= 0&pij ̸= 0} is to avoid infinite value of the function ℓ̂ by dij = pij = 0. In (3)
each dij is a fixed value for the given sample.
Both Pearson correlation coeﬃcient, and polychoric correlation coeﬃcient have similar properties. The cor-
relation coeﬃcient has a value between +1 and −1 inclusive. The coeﬃcient is symmetric. This means between
x1 and x2 is the same as the correlation between x2 and x1 .

2.3 Polyserial Correlation Coeﬃcient
If the latent correlation between a continuous variable and a ordered categorical variable is assumed, then the
polyserial correlation coeﬃcient is the most appropriate correlation measure. In this case one variable x1 with
underlying standard normal ξ1 is assumed to be discrete. It is expressed by (1) with k = 1. Another observed
variable x2 is considered to be continuous and standard normally distributed. In practice the variable defined as
continuous should be standardized, so that its mean becomes zero and its standard deviation becomes one.
According to [Dra1988] the log-likelihood function for the joint distribution of random vector (x1 , x2 ) from a
sample of n observations (xi1 , xi2 ) is
n
∑
log L = log ϕ(xi2 ) + log P (x1 = xi1 |x2 = xi2 ) (4)
i=1

where ϕ(·) is the standard normal density function.
The conditional distribution of ξ1 given x2 = xi2 is normal with mean ρxi2 and variance 1 − ρ2 . Then if
xi1 = j with categories j = 1, . . . , n1 , the resulting conditional probability is
( ) ( )
α(j−1)1 − ρxi2 αj1 − ρxi2
P (x1 = j|x2 = xi2 ) = Φ √ −Φ √ . (5)
1 − ρ2 1 − ρ2

A two-step approach to estimation the polyserial correlation assumes that discretizing thresholds in (5) can be
computed by formula
( j )
∑
−1
α̂j1 = Φ dl , j = 1, . . . , n1 − 1
l=1

where dl is the relative frequency, i.e. the number of cases in category l of item 1 to the sample size.
As a result of maximization of the log-likelihood function (4) with argument ρ, the polyserial correlation
estimate is obtained. It is clear that the polyserial correlation coeﬃcient is not symmetric. For estimating
polyserial correlation it is important which of the variables is assumed to be continuous or discrete.

3 Data Type Detection
All actual sample spaces are discrete, and all observable random variables have discrete distributions [Pit1979].
To detect whether a variable is continuous or discrete you need to understand the nature of the data. The
variable can be considered as continuous if there is an infinite number of possible values that the variable can
take between any two diﬀerent points in the range. Any measurement of these variables will be discrete. In actual
practice, a variable is often treated as continuous when it can take on a suﬃciently large number of diﬀerent
values. It sometimes makes sense to treat continuous variable as ordered categorical. This is usually just another
kind of binning.
Discrete data can only take particular values. If a variable can take on one of a limited number of possible
values referred to as levels, it is a categorical variable. A categorical data type where the variable has natural,

3
190
ordered categories is ordinal (ordered categorical) data. The distance between the categories is considered as a
unknown. It is generally not correct to consider ordered categorical data as continuous.
In data intensive analysis it is almost impossible to determine the nature of each variable. It is necessary to
formulate a simple rule that would allow to detect whether a variable can be considered as continuous or ordered
categorical. It seems logical to count the number of unique values of the variable and relate it to sample size. If
a number of diﬀerent values is a suﬃciently large, then a variable can be considered as continuous.
Let me introduce the continuity index of k-th variable defined as the ratio the number nk of categories (unique
values) to sample size n:
nk
γk = .
n
The problem is to define cutoﬀ at which the discrete variable will be considered as continuous. The author did
not find the detailed recommendations on this subject. Documentation to the package ‘treeplyr’ of statistical
environment R gives a default value of cutoﬀ = 0.1 for deciding if numeric data might actually be descrete. The
continuity index γk should exceed cutoﬀ, or the data will be classified as discrete. However, this cutoﬀ value is
not justified. It is therefore necessary to carry out simulation studies to identify the most appropriate coeﬃcient
for diﬀerent values of the continuity indices γ1 and γ2 .

4 Software Implementation
Both the polyserial, and the polychoric correlations unfortunately are not typically used in the statistical analysis.
Nevertheless the functions (or packages) to calculate their are available in many statistical programs such as SPSS,
SAS, Stata, R. The first three of these programs are proprietary software, so the authors decided to focus on
free software for statistical computing R.
There are two R-packages, which have the functions to calculate polychoric and polyserial correlations, polycor
and psych. The functions polyserial{psych} and polychoric{psych} have the drawback that the calculation is
not performed if there are more than 8 categories for any item. This is a very substantial limitation. It is not
possible to use these functions to simulation study.
The functions polyserial{polycor} and polychor{polycor} do not have explicit restrictions on the dimen-
sion of contingency tables. However, polychor works very slow with a large number of categories. The most
computationally expensive stage is calculation of the function of two-dimensional normal distribution at each
iteration of the optimization algorithm. Here it is performed by internal function binBvn which uses the func-
tion pmvnorm{mvtnorm}. The function pmvnorm for a given interval in a p-dimensional space (in our case - a
two-dimensional) returns a scalar value, meaning it is not vectorized. For the calculation of probabilities for each
combination α̂i1 , α̂j2 the function binBvn uses a nested loop over indices i, j. It is an unfortunate fact and leads
to polychor functions work slowdown for a large number of categories, since loops are very slow in interpreted
language R.
For this reason, I have implemented the calculation of polychoric correlation coeﬃcient using a standard set
of R-packages. To do this, several user-functions have been written.
The function FuncCalcPolychor has inputs: the current value of the correlation coeﬃcient ρt , threshold values
α̂i1 , α̂j2 , frequency table dij . The algorithm can be divided into two steps.

1. Based on the values of ρt , α̂i1 , α̂j2 two-dimensional array is calculated containing the values of the two-
dimensional standard normal distribution for all combinations α̂i1 , α̂j2 . The last row of the array values
Φ(αj2 ) are added. The last column of the array values Φ(αi1 ) are added. At their intersection unit is placed.
At the beginning of the array one zero row and one zero column are added. On the basis of this array an
array of probabilities pij is constructed according to (2). Negative values of probabilities that may occur
due to inaccuracies the calculation, are reset to zero.

2. Based on the values of dij and obtained in step 1 values pij calculates and returns the value ℓ̂.

In Step 1, to calculate the values of the two-dimensional standard normal distribution function you can use
pmnorm function. This requires installation of package mnormt. It works by making a suitable call to Fortran-77
routine written by Alan Genz. The function pmnorm is vectorized. It returns a vector of values for input matrix
N × 2 where N is the number of points at which probabilities are calculated. In our case N = n1 n2 . To create
all combinations a standard function expand.grid is used.

4
191
Alternatively (without additional packages), I have implemented a number of user-functions for calculating
the two-dimensional matrix of values of the bivariate standard normal distribution function. They are written
on the basis of the algorithm proposed in [Mey2013]. I have used the C-code presented in the article [Mey2013]
and rewrote it in the language R with the addition of vectorization. In other words, all functions processing
scalar values (or vectors) are replaced by functions processing vectors (or matrix), such as ifelse, apply, outer,
pmin. Also some errors was corrected, such as division by zero, if any of the values α̂i1 , α̂j2 is zero. Zero is
replaced by 10−5 .
Finally, the function PolychorEst on the input vectors has sampled values of the observed variables x1 and
x2 . Its work can also be divided into two steps.
∑i ∑n2
1. Filling a table of relative frequencies dij for given vectors x1 and x2 . Calculation of vectors l=1 j=1 dlj ,
∑j ∑n1
l=1 i=1 dil . Calculation based on them α̂i1 , α̂j2 using a standard function qnormstats to calculate the
quantile of the normal distribution.
2. Optimization of FuncCalcPolychor function with respect to ρt for given values α̂i1 , α̂j2 , dij calculated in
step 1. A one-parameter optimization was carried out using a basic function optimize{stats} in the interval
ρ ∈ (−1, 1).

In addition, I have implemented the calculation of polyserial correlation coeﬃcient using a standard set of
R-packages. The user-function PolyserialEst is quite simple. It standardizes the variable which is assumed to
be continuous. Further, it optimizes FuncCalcPolyserial function with respect to ρt for given sample values
of x1 and standardized values of x2 . The function FuncCalcPolyserial calculates the log-likelihood function L
according to (4). It uses standard functions pnorm, qnorm of calculating distribution function, quantile function
for the normal distribution.

5 Simulation Study
For simulation study the following model example was used. The random variable ξ1 was simulated from a
standard normal distribution. The random variable ξ2 was defined as

ξ2 = ξ1 + ε

where ε is normally distributed random variable with mean zero and standard deviation σε . The value of σε
depended on the value of the correlation coeﬃcient ρ specified by the scheme of the experiment. These values
are related by the relationship
√
1
σε = σx − 1.
ρ2
Further grouping of variables ξ1 and ξ2 was carried out. The equidistant intervals were used with boundaries
defined by sample quantiles at probabilities 0, n1k , . . . , nkn−1
k
, 1. As the value of the variable xk at the i-th level
the sample mean of all the values in i-th interval is taken.
The value of ρ is set to 0.5. The value of n is set to 500. The values of nk were taken from 5, 10, 25, 50, 100, 250,
and the correlation coeﬃcients were estimated. Results were averaged 1, 000 repetitions. Figure 1 shows the
average value ρ̂ of Pearson correlation coeﬃcient. Figure 2 shows the average value ρ̂ of polyserial correlation
coeﬃcient when the first variable is considered as ordered categorial and second variable is assumed to be
continuous. Figure 3 shows the average value ρ̂ of polyserial correlation coeﬃcient when the second variable is
considered as ordered categorial and first variable is assumed to be continuous.
The average values of polychoric correlation are very close to 0.5 for all combinations of the continuity index
values. It means that the polychoric correlation coeﬃcient gives an estimate that is the closest to the true value.
However, when the continuity index γk is high the calculation is very slow. Table 1 shows the user time the
running R-functions for calculating one value of polychoric correlation. It is clear that developed user function
PolychorEst runs six times faster than the function polychor{polycor} in the case where γk ≥ 0.05. However,
when γ2 = 0.05, γ1 = 0.50 calculations take nearly a minute. Therefore it is recommended to use a simple
correlation coeﬃcients (in particular, Pearson correlation) if it provides acceptable accuracy for results.
Figure 1 shows that by fixed value of γ2 and γ1 exceeding 0.05 the change in γ1 values does not significantly
decrease a bias in the Pearson correlation coeﬃcient. The values of γ1 > 0.05 and γ2 > 0.05 lead to a small bias

5
192
Table 1: The total user CPU time of the R process, in seconds, γ2 = 0.05

Function γ1 = 0.01 γ1 = 0.02 γ1 = 0.05 γ1 = 0.10 γ1 = 0.20 γ1 = 0.50
polychor{polycor} 0.22 0.78 3.07 15.01 53.75 319.21
PolychorEst 0.17 0.22 0.52 1.95 7.92 52.75

0.50
γ2 = 0.05
γ2 = 0.02

0.48
ρ
^

γ2 = 0.01
0.46

0.0 0.1 0.2 0.3 0.4 0.5
γ1

Figure 1: Pearson correlation coeﬃcient

in the Pearson correlation coeﬃcient. Thereby exceeding the continuity index cutoﬀ of 0.05 allows to estimate
the Pearson correlation coeﬃcient.
Figure 2 shows that the values of polyserial correlation are weakly dependent on the values of γ1 . This is
because the first variable x1 is assumed categorical. Therefore the quality of correlation estimation depends on
the continuity of the second variable. The results shown in Fig. 3 does not depend on the values of γ2 . In these
cases also, if the continuity index exceeds cutoﬀ of 0.05 a bias in correlation coeﬃcient is quite small.

6 Conclusions
It proposed to choose an appropriate correlation measure based on data type detection. The continuity index γk
allows to detect whether variable is continuous or ordered categorical. Simulation study revealed that exceeding
the γk cutoﬀ of 0.05 for both variables allows to classify the variables as continuous, and the Pearson correlation
coeﬃcient can be calculated. If only γ1 should exceed cutoﬀ of 0.05 then you can use the polyserial correlation of
ordered categorical x2 and continuous x1 . If only γ2 should exceed cutoﬀ of 0.05 it makes sense to estimate the
polyserial correlation of ordered categorical x1 and continuous x2 . The polychoric correlation coeﬃcient provides
best quality of estimation regardless of continuity index. But the calculation of the polychoric correlation
coeﬃcient is very slow if the number of categories is large. Therefore it is recommended to use the polychoric
correlation if both variables have the continuity index less than 0.05.

Acknowledgements
The reported study was funded by Russian Ministry of Education and Science, according to the research project
No. 2.2327.2017/ПЧ.

References
[Hol2010] F. P. Holgado-Tello, et. al. Polychoric versus Pearson correlations in exploratory and confirmatory
factor analysis of ordinal variables. Quality & Quantity, 44(1):153–166, 2010.

[Dra1988] F. Drasgow. Encyclopedia of statistical sciences / Polychoric and polyserial correlations. John Wiley
& Sons, 1988.

6
193
0.50
γ2 = 0.05

0.49
γ2 = 0.02

ρ
^
0.48
γ2 = 0.01

0.47 0.0 0.1 0.2 0.3 0.4 0.5
γ1

Figure 2: Polyserial correlation of ordered categorical x1 and continuous x2
0.50
0.49
ρ
^
0.48
0.47

0.0 0.1 0.2 0.3 0.4 0.5
γ1

Figure 3: Polyserial correlation of ordered categorical x2 and continuous x1

[Pea1913] K. Pearson, and D. Heron. On theories of association. Biometrika, 9(1/2):159–315, 1913.
[Ols1979] U. Olsson. Maximum Likelihood Estimation of the Polychoric Correlation Coeﬃcient. Psychometrica,
44(4):443–460, 1979.
[Kol2004] S. Kolenikov, and G. Angeles. The use of discrete data in PCA: theory, simulations, and applications
to socioeconomic indices. Chapel Hill, Carolina Population Center, University of North Carolina,
2004.

[Pit1979] E. J. G. Pitman. Some basic theory for statistical inference. London, Chapman and Hall, 1979.
[Mey2013] C. Meyer. Recursive Numerical Evaluation of the Cumulative Bivariate Normal Distribution. Journal
of Statistical Software, 52(10):1–14, 2013.

7
194