Measurement Error and Causal Discovery


                                           Richard Scheines & Joseph Ramsey
                                                Department of Philosophy
                                               Carnegie Mellon University
                                               Pittsburgh, PA 15217, USA


1       Introduction                                            noise or recording errors or worse. Historically, the de-
                                                                velopment of statistics as a discipline was partly spurred
Algorithms for causal discovery emerged in the early            by the need to handle measurement error in astronomy. In
1990s and have since proliferated [4, 10]. After di-            many cases measurement error can be substantial. For ex-
rected acyclic graphical representations of causal struc-       ample, in epidemiology, cumulative exposure to environ-
tures (causal graphs) were connected to conditional inde-       mental pollutants or toxic chemicals is often measured by
pendence relations (the Causal Markov Condition1 and d-         proxies only loosely correlated with exposure, like “dis-
separation2 ), graphical characterizations of Markov equiv-     tance from an industrial pollutant emitter” – or an air qual-
alence classes of causal graphs (patterns) soon followed,       ity monitor within a few miles of the subject.
along with pointwise consistent algorithms to search for        In its simplest form, measurement error is random noise:
patterns. Researchers in Philosophy, Statistics, and Com-       Xmeasure = X + , where  is as an aggregate represent-
puter Science have produced constraint-based algorithms,        ing many small but independent sources of error and thus
score-based algorithms, information-theoretic algorithms,       by the central limit theorem at least approximately Gaus-
algorithms for linear models with non-Gaussian errors, al-      sian. In other cases measurement error is systematic, for
gorithms for systems that involve causal feedback, algo-        example it is well known that people under-report socially
rithms for equivalence classes that contain unmeasured          undesirable activities like cheating, and since the more they
common causes, algorithms for time-series, algorithms for       engage in the activity the more they under-report, this type
handling both experimental and non-experimental data, al-       of error is not random. In this paper we will concern our-
gorithms for dealing with datasets that overlap on a proper     selves only with random noise error. Here we explore the
subset of their variables, and algorithms for discovering the   impact of random noise measurement error on the overall
measurement model structure for psychometric models in-         accuracy of causal discovery algorithms.
volving dozens of “indicators”. In many cases we have
proofs of the asymptotic reliability of these algorithms, and
in almost all cases we have simulation studies that give        2   Parameterizing Measurement Error
us some sense of the finite-sample accuracy of these al-
gorithms. The FGES algorithm (Fast Greedy Equivalence
Search, [6]), which we feature here, is highly accurate in      We consider linear structural equation models (SEMs) in
a wide variety of circumstances and is computationally          which each variable V is a linear combination of its direct
tractable on a million variables for sparse graphs. Many        causes and an “error” term V that represents an aggregate
algorithms have been applied to serious scientific problems     of all other causes of V . In Figure 1, we show a simple
like distinguishing between Autistic and neurotypical sub-      model involving a causal chain from X to Z to Y . Each
jects from fMRI data [2], and interest in the field seems to    variable has a structural equation, and the model can be
be exploding.                                                   parameterized by assigning real values to β1 and β2 , and a
                                                                joint normal distribution to {X , Z , Y } ∼ N (0, Σ2 ), with
Amazingly, work to assess the finite-sample reliability of      Σ2 diagonal to reflect the independence among the “error
causal discovery algorithms has proceeded under the as-         terms” X , Z and Y .
sumption that the variables given are measured without er-
ror. In almost any empirical context, however, some of the      For any values of its free parameters, the model in Fig. 1
variance of any measured variable comes from instrument         entails the vanishing partial correlation ρXY.Z = 0, which
                                                                in the Gaussian case is also the conditional independence:
    1
        Spirtes et al. [10]                                     X ⊥ ⊥ Y | Z. In Fig. 2 we show the same model, but with
    2
        Pearl [4]                                               Z “measured” by Zm , with “measurement error” Zm .
                                                                the population, i.e., |ρXY | > 0, but the sample correlation
                                                                ρXm ,Ym would mislead us into rejecting independence and
                                                                accepting ρXm Ym = 0. Thus an algorithm that would make
                                                                X and Y adjacent because a statistical inference concludes
                                                                that |ρXY | > 0, the same procedure might conclude that
                               X :=X                           Xm and Ym are not adjacent because a statistical inference
                                                                concludes that ρXm Ym = 0. Worse, this decision will in
            Equations:         Z :=β1 X + Z
                                                                many cases affect other decisions that will in turn affect the
                               Y :=β2 Z + Y                    output of the procedure.
                                                                Second, if a variable Z “separates” two other variables X
            Figure 1: Causal Model for X, Z, Y .                and Y , that is X and Y are d-separated by Z and thus
                                                                X ⊥ ⊥ Y | Z, then a measure of Z with error Zm does
                                                                not separate X and Y , i.e., X 6⊥⊥ Y | Zm . In Fig. 2, for
                                                                example, the model implies that X ⊥   ⊥ Y | Z, but it does
                                                                not imply that X ⊥   ⊥ Y | Zm . If the measurement error
                                                                is small, we still might be able to statistically infer that
                                                                X ⊥ ⊥ Y | Zm , but in general measurement error in a sep-
                                                                arator fails to preserve conditional independence. Again,
                                                                judgments regarding X ⊥   ⊥ | Zm , will affect decisions in-
                                                                volving relationships between X, Y, Zm , but it will also
                                                                have non-local consequences involving other variables in
                               X :=X                           the graph.
            Equations:         Z :=β1 X + Z                    For example, consider a case in which we simulated data
                               Y :=β2 Z + Y                    from a model with six variables in which only two of
                                                                the six were measured with error. In Fig. 3, the gener-
                              Zm :=Z + Zm                      ating model on the left is a standardized SEM (all vari-
                                                                ables mean 0 and variance 1), with L1 and L2 measured
             Figure 2: Measurement Error on Z.                  with 1% error as L1m and L2m. Running FGES on a
                                                                sample of size 1, 000 drawn from the measured variables
                                                                {L1m, L2m, X1, X2, X3, X4}, we obtained the pattern
The variance of Zm is just the sum of the variances of Z        output on the right, which is optimal even if we had
and the measurement error Zm , i.e., var(Zm ) = var(Z)+        the population distribution for {L1, L2, X1, X2, X3, X4}.
var(Zm ), so we can parameterize the “amount” of mea-          Measuring L1 and L2 with 20% error produced a pattern
surement error by the fraction of var(Zm ) that comes           nothing like the true one (Fig. 4), and the errors in the out-
from var(Zm ). If var(Zm ) = 0.5var(Zm ), then half           put are not restricted to errors involving relations involving
of the variation in the measure is from measurement noise.      L1m and L2m as potential separators.
For simplicity, and without loss of generality, consider
                                                                For example, FGES found, in error, that X1 and X4 are ad-
the re-parameterization of Fig. 2 in which all variables
                                                                jacent because L2m did not separate them where L2 would
{X, Z, Zm , Y } are standardized to have mean 0 and vari-
                                                                have. Similarly, X1 and X2 were made adjacent because
ance 1. In that case Zm = λZ + Zm , where ρZ,Zm = λ2
                                                                L1m failed to separate them where L1 would have. X2
and var(Zm ) = 1 − λ2 .3
                                                                and X4 were found to be separated, but not by X1, thus
                                                                the algorithm oriented the false adjacencies X2 — X1 and
3   The Problem for Causal Discovery                            X4 — X1 as X2 → X1 and X4 → X1, which in turn
                                                                caused it to orient the X1 — L1m adjacency incorrectly
Measurement error presents at least two sorts of problems       as X1 → L1. Thus errors made as a result of measure-
for causal discovery. First, for two variables X and Y that     ment error are not contained to local decisions. Because
are measured with error by Xm and Ym , if X and Y have          of this non-locality, judging the overall problem for causal
correlation ρXY , then this correlation will attenuate in the   discovery algorithms posed by measurement error is almost
measures Xm and Ym . That is, |ρXY | > |ρXm Ym |.               impossible to handle purely theoretically.

Thus it might be the case that a sample correlation ρXY
would lead us to believe that X and Y are correlated in
                                                                4   Causal Discovery Accuracy
    3
      var(Zm ) = λ2 var(Z) + var(Zm ) = λ2 + var(Zm ) = 1,    Causal researchers in several fields recognize that mea-
so var(Zm ) = 1 − λ2 .                                         surement error is a serious problem, and strategies have
                                           Figure 3: FGES Output with 1% Measurement Error.


                                          Figure 4: FGES Output with 20% Measurement Error.


emerged for dealing with the problem theoretically. In mul-          rameters can be estimated from these same data. In dis-
tivariate regression, it is known that a) measurement error          cussing the reliability of the causal discovery algorithm, we
in the response variable Y inflates the standard errors of           care about how the equivalence class output by the algo-
the coefficient estimates relating independent variables to          rithm compares to the equivalence class of which the true
Y but does not change their expectation, b) measurement              causal graph is a member.
error in an independent variable X attenuates the coeffi-
                                                                     For example, if the causal graph in the upper left of Fig. 5
cient estimate for X toward 0, and c) measurement error
                                                                     is the true model, then the pattern on the upper right rep-
in a “covariate” Z produces partial omitted variable bias in
                                                                     resents the Markov equivalence class of which the true
any estimate on a variable X for which Z is also included in
                                                                     graph is a member. It represents all that can be discovered
the regression. In cases b) and c), there is a literature, pri-
                                                                     about the causal structure under the assumption that the ev-
marily in economics, known as “errors-in-variables”, for
                                                                     idence is non-experimental, confined to independence re-
handling measurement error in the independent variables
                                                                     lations, and the connection between the graph and data is
and covariates [1]. If the precise amount of measurement
                                                                     the Causal Markov and Faithfulness assumptions. On the
error is known, then parameter estimates can be adjusted
                                                                     bottom of Fig. 5 we show the patterns output by the FGES
accordingly. If the amount is unknown, then one can still
                                                                     algorithm on a sample of 50 (bottom left), and on a sam-
use sensitivity analysis or a Bayesian approach [7]. In gen-
                                                                     ple of 1,000 (bottom right). The pattern on the bottom right
eral, in cases in which researchers are confident in the spec-
                                                                     matches the pattern for the true graph, so this output is max-
ification of a model, the effect of measurement error on pa-
                                                                     imally accurate, even though it did not “orient” the edge
rameter estimation has been well studied.4
                                                                     between X2 and X4.
In causal discovery algorithms, however, the input is typ-
                                                                     Patterns output by search procedures can be scored on ac-
ically assumed to be an i.i.d. sample for a set of mea-
                                                                     curacy with respect to adjacencies and accuracy with re-
sured variables V drawn from a population with variables
                                                                     spect to arrowhead orientation. The true pattern has three
V ∪ L that satisfies some general assumptions (e.g., Causal
                                                                     adjacencies: {X1 — X3, X2 — X3, X2 — X4}, and
Markov and/or Faithfulness Axiom) and/or some paramet-
                                                                     two arrowhead orientations: {X1 → X3, X2 → X3}.
ric assumption (e.g., linearity). The output is often an
                                                                     The pattern on the bottom left contains two of the three
equivalence class of causal structures – any representative
                                                                     adjacencies, but it missed the adjacency between X2 and
of which might be an identified statistical model whose pa-
                                                                     X3. The adjacency precision (AP) reflects the proportion,
   4                                                                 among those guessed to be adjacencies, of correct adja-
       For example, see Pearl [5], and Kuroki and Pearl [3].
                                        Figure 5: Graphs, Patterns, and Search Accuracy


cencies. In this case the search algorithm guessed three         ables to have mean 0 and variance 1, and we then ap-
adjacencies, two of which were correct, so the adjacency         plied varying degrees of measurement error to all of the
precision is 2/3 = .67. The adjacency recall (AR) is the         measured variables. For each variable X, we replaced X
proportion of true adjacencies found by the algorithm. In        with Xm = X + x , where x ∼ N (0, σ 2 ), for values of
this case the search output two of three, so AR = .67.           σ 2 ∈ {0, .01, .1, .2, .3, .4, .5, 1, 2, 3, 4, 5}. Since the vari-
                                                                 ance of all the original variables is 1.0, when σ 2 = .1 ap-
We can also compute the “arrowhead precision” (AHP )
                                                                 proximately 10% of the variance of Xm was from measure-
and “arrowhead recall” (AHR). In the bottom left of Fig.3
                                                                 ment error. When σ 2 = 1, 50% of the variance of Xm was
the algorithm output two arrowheads, but we only count
                                                                 from measurement error, and when σ 2 = 5, over 83% of
the one which also corresponds to a true adjacency, so its
                                                                 the variance of Xm was measurement error.
AHP is 1/1 = 1.0. As there were two arrowheads in the
true pattern that it could have found, one of which it missed,
its AHP = 1/2 = .5.                                              6    Results

5   The Simulation Study                                         First consider the effect of sample size on FGES accuracy
                                                                 when there is no measurement error at all. In Fig. 6 we
                                                                 show the performance for sparse graphs (average degree =
As predicting the global effects of measurement error on
                                                                 2) and for fairly dense graphs (average degree = 6). For
search accuracy is hopelessly complicated by the non-
                                                                 sparse graphs, accuracy is high even for samples of only
locality of the problem, we instead performed a somewhat
                                                                 100, and by N = 1, 000 is almost maximal.
systematic simulation study. We generated causal graphs
randomly, parameterized them randomly, drew pseudo-              For denser graphs, which are much harder to discover, per-
random samples of varying size, added varying degrees of         formance is still not ideal even at sample size 5,000. Fig. 7
measurement error, and then ran the FGES causal discov-          shows the effects of measurement error on accuracy for
ery algorithm and computed the four measures of accuracy         both sparse and dense graphs at sample size 100, 500, and
we discussed above: AP , AR, AHP , and AHR.                      5,000.
We generated random causal graphs, each with 20 vari-            For sparse graphs, at sample size 100 the of accuracy of
ables. In half of our simulations the “average degree” of        FGES decays severely for measurement errors of less than
the graph (the average number of adjacencies for a vari-         17% (ME = .2), especially for orientation accuracy. The
able) was 2 and in half it was 6. In a graph with 20             results are quite poor at 50% (ME = 1) and reaches near 0
variables with an average degree of 2, the graph is fairly       when two thirds of the variance of each variable is noise.
sparse. With an average degree of 6 the graph is fairly          Surprisingly, the decay in performance of FGES from mea-
dense. We randomly chose edge coefficients, and for each         surement error is much slower at larger sample sizes. For
SEM we generated 10 samples of size 100, 500, 1,000,             N = 500, the algorithm is still reasonably accurate at 50%
and 5,000. For each sample we standardized the vari-             measurement error, and adjacency precision remains rea-
Figure 6: Baseline: Zero Measurement Error


Figure 7: Accuracy and Measurement Error
sonably high even when 2/3 of the variance of all variables        compute a posterior over each independence decision. This
is just noise. If FGES outputs an adjacency at sample size         would involve serious computational challenges, and we do
500, it is still over 70% likely to actually be an adjacency in    not know many variables would be feasible with this ap-
the true pattern, even with 67% measurement error. On the          proach. It is reasonable in principle, and a variant of it was
other hand, the same can’t be said about non-adjacencies           tried on the effect of Lead on IQ among children [8].
(AR) in 67% measurement error case. If FGES outputs a
                                                                   Second, if each variable studied is measured with at least
pattern in which X and Y are not adjacent, then there is a
                                                                   two “measurement” variables with independent measure-
lower than 40% chance that X and Y are not adjacent in
                                                                   ment error, then the independence relations among the “la-
the true pattern – even in patterns with average degree of 2.
                                                                   tent” variables in a linear system can be directly tested by
Does sample size help FGES’s performance in the presence           estimating a structural equation model and performing a
of measurement error? Roughly yes. For dense graphs                hypothesis test on a particular parameter, which is 0 in the
the performance of FGES improves somewhat with sam-                population if and only if the independence holds in the pop-
ple size at almost all levels of measurement error. Notably,       ulation.5 This strategy depends on the measurement error
even for cases in which 5/6 of the variance in the measured        in each of the measures being independent, an assumption
variables was due to measurement error (ME = 5), most              that is often false. Techniques to find measurement vari-
measures of FGES accuracy improved dramatically when               ables that do have independent error have been developed,
sample size was increased from N = 500 to N = 5, 000.              however [9], and these techniques can be used to help iden-
                                                                   tify measures with independent error in situations where
                                                                   measurement is a serious worry.
7   Discussion, Conclusions and Future
                                                                   A different measurement problem that is likely to produce
    Research                                                       results similar to Gaussian measurement noise is discretiza-
                                                                   tion. We know that, if, for continuous variables X, Y , and
We explored the effect of the simplest form of Gaussian            Z, X ⊥ ⊥ Y | Z, then for discrete projections of X, Y , and
measurement error on FGES, a simple causal discovery al-           Z: Xd , Yd , and Zd , in almost every case the independence
gorithm, in a simple causal context: no latent confounders,        fails. That is, it is often the case that X ⊥  ⊥ Y | Z but
all relations are linear, and the variables are distributed nor-   Xd 6⊥ ⊥ Yd | Zd . This means that using discrete measures
mally. In this context, even small amounts of measurement          of variables that are more plausibly continuous likely leads
error seriously diminish the algorithm’s accuracy on small         to the same problems for causal discovery as measuring
samples, especially its accuracy with respect to orientation.      continuous variables with error. As far as we are aware,
Surprisingly, at large sample sizes (N = 5, 000) FGES is           no one has any clear idea of the severity of the problem,
still somewhat accurate even when over 80% of the vari-            even though using discrete measures of plausibly continu-
ance in each variable is random noise.                             ous variables is commonplace.
One way to conceive of the situation we describe is as
latent variable model discovery, and use discovery algo-           Acknowledgments
rithms built to search for PAGs (partial ancestral graphs that
represent models that include latent confounding) instead          This research was supported by NIH grant #U54HG008540
of patterns. As the variables being measured with error are        (the Center for Causal Discovery), and by NSF grant
in some sense latent variables, this is technically appropri-      #1317428.
ate. The kind of latent confounding for which PAGs are
meant, however, is not the kind typically produced by mea-         References
surement error on individual variables. If the true graph, for
                                                                    [1] Wayne A. Fuller. Measurement Error Models. John Wiley
example, is X ← Y → Z, and we measure X, Y , and Z                      & Sons, 1987.
with error, then we still want the output of a PAG search to        [2] C. Hanson, S. J. Hanson, J. Ramsey, and C. Glymour. Atypi-
be: Xm o–o Ym o–o Zm , which we would interpret to mean                 cal effective connectivity of social brain networks in individ-
that any connection between Xm and Zm goes through Ym                   uals with autism. Brain Connectivity, 3(6):578–589, 2013.
(and by proxie Y ). If measurement error caused a problem,          [3] M. Kuroki and J. Pearl. Measurement bias and effect
however, then the output would likely be a PAG with ev-                 restoration in causal inference. Biometrika, 101(2):423–
                                                                        437, 2014.
ery pair of variables connected by the “o–o” adjacency, a
                                                                    [4] J. Pearl. Probabilistic Reasoning in Intelligent Systems:
result that is technically accurate but which carries almost            Networks of Plausible Inference. Morgan Kaufmann, 1988.
no information.                                                     [5] J. Pearl. Measurement bias in causal inference. Proceedings
                                                                        of 26th Conference in Uncertainty in Artificial Intelligence,
There are two strategies that might work and that we will               pages 425–432, 2010.
try on this problem in future research. First, prior knowl-         [6] J. Ramsey. Scaling up greedy causal search for continu-
edge about the degree of measurement error on each vari-
                                                                      5
able might be used in a Bayesian approach in which we                     See chapter 12, section 12.5 in Spirtes et al. [11].
                              Figure 8: Sample Size vs. Accuracy in the presence of measurement error.


     ous variables. Technical report, Center for Causal Discover,
     2015. arXiv:1507.07749.
 [7] P. Rosenbaum. Observational Studies. Springer-Verlag,
     1995.
 [8] R. Scheines. Estimating latent causal influences: Tetrad III
     variable selection and Bayesian parameter estimation: the
     effect of lead on IQ. In Handbook of Data Mining. Oxford
     University Press, 2000.
 [9] R. Silva, C. Glymour, R. Scheines, and P. Spirtes. Learning
     the structure of latent linear structure models. Journal of
     Machine Learning Research, 7:191–246, 2006.
[10] P. Spirtes, C. Glymour, and R. Scheines. Causation, predic-
     tion, and search. Lecture Notes in Statistics, 81. Springer-
     Verlag, 1993.
[11] P. Spirtes, C. Glymour, and R. Scheines. Causation, predic-
     tion, and search. MIT Press, 2000.