Measurement Error and Causal Discovery Richard Scheines & Joseph Ramsey Department of Philosophy Carnegie Mellon University Pittsburgh, PA 15217, USA 1 Introduction noise or recording errors or worse. Historically, the de- velopment of statistics as a discipline was partly spurred Algorithms for causal discovery emerged in the early by the need to handle measurement error in astronomy. In 1990s and have since proliferated [4, 10]. After di- many cases measurement error can be substantial. For ex- rected acyclic graphical representations of causal struc- ample, in epidemiology, cumulative exposure to environ- tures (causal graphs) were connected to conditional inde- mental pollutants or toxic chemicals is often measured by pendence relations (the Causal Markov Condition1 and d- proxies only loosely correlated with exposure, like “dis- separation2 ), graphical characterizations of Markov equiv- tance from an industrial pollutant emitter” – or an air qual- alence classes of causal graphs (patterns) soon followed, ity monitor within a few miles of the subject. along with pointwise consistent algorithms to search for In its simplest form, measurement error is random noise: patterns. Researchers in Philosophy, Statistics, and Com- Xmeasure = X + , where  is as an aggregate represent- puter Science have produced constraint-based algorithms, ing many small but independent sources of error and thus score-based algorithms, information-theoretic algorithms, by the central limit theorem at least approximately Gaus- algorithms for linear models with non-Gaussian errors, al- sian. In other cases measurement error is systematic, for gorithms for systems that involve causal feedback, algo- example it is well known that people under-report socially rithms for equivalence classes that contain unmeasured undesirable activities like cheating, and since the more they common causes, algorithms for time-series, algorithms for engage in the activity the more they under-report, this type handling both experimental and non-experimental data, al- of error is not random. In this paper we will concern our- gorithms for dealing with datasets that overlap on a proper selves only with random noise error. Here we explore the subset of their variables, and algorithms for discovering the impact of random noise measurement error on the overall measurement model structure for psychometric models in- accuracy of causal discovery algorithms. volving dozens of “indicators”. In many cases we have proofs of the asymptotic reliability of these algorithms, and in almost all cases we have simulation studies that give 2 Parameterizing Measurement Error us some sense of the finite-sample accuracy of these al- gorithms. The FGES algorithm (Fast Greedy Equivalence Search, [6]), which we feature here, is highly accurate in We consider linear structural equation models (SEMs) in a wide variety of circumstances and is computationally which each variable V is a linear combination of its direct tractable on a million variables for sparse graphs. Many causes and an “error” term V that represents an aggregate algorithms have been applied to serious scientific problems of all other causes of V . In Figure 1, we show a simple like distinguishing between Autistic and neurotypical sub- model involving a causal chain from X to Z to Y . Each jects from fMRI data [2], and interest in the field seems to variable has a structural equation, and the model can be be exploding. parameterized by assigning real values to β1 and β2 , and a joint normal distribution to {X , Z , Y } ∼ N (0, Σ2 ), with Amazingly, work to assess the finite-sample reliability of Σ2 diagonal to reflect the independence among the “error causal discovery algorithms has proceeded under the as- terms” X , Z and Y . sumption that the variables given are measured without er- ror. In almost any empirical context, however, some of the For any values of its free parameters, the model in Fig. 1 variance of any measured variable comes from instrument entails the vanishing partial correlation ρXY.Z = 0, which in the Gaussian case is also the conditional independence: 1 Spirtes et al. [10] X ⊥ ⊥ Y | Z. In Fig. 2 we show the same model, but with 2 Pearl [4] Z “measured” by Zm , with “measurement error” Zm . the population, i.e., |ρXY | > 0, but the sample correlation ρXm ,Ym would mislead us into rejecting independence and accepting ρXm Ym = 0. Thus an algorithm that would make X and Y adjacent because a statistical inference concludes that |ρXY | > 0, the same procedure might conclude that X :=X Xm and Ym are not adjacent because a statistical inference concludes that ρXm Ym = 0. Worse, this decision will in Equations: Z :=β1 X + Z many cases affect other decisions that will in turn affect the Y :=β2 Z + Y output of the procedure. Second, if a variable Z “separates” two other variables X Figure 1: Causal Model for X, Z, Y . and Y , that is X and Y are d-separated by Z and thus X ⊥ ⊥ Y | Z, then a measure of Z with error Zm does not separate X and Y , i.e., X 6⊥⊥ Y | Zm . In Fig. 2, for example, the model implies that X ⊥ ⊥ Y | Z, but it does not imply that X ⊥ ⊥ Y | Zm . If the measurement error is small, we still might be able to statistically infer that X ⊥ ⊥ Y | Zm , but in general measurement error in a sep- arator fails to preserve conditional independence. Again, judgments regarding X ⊥ ⊥ | Zm , will affect decisions in- volving relationships between X, Y, Zm , but it will also have non-local consequences involving other variables in X :=X the graph. Equations: Z :=β1 X + Z For example, consider a case in which we simulated data Y :=β2 Z + Y from a model with six variables in which only two of the six were measured with error. In Fig. 3, the gener- Zm :=Z + Zm ating model on the left is a standardized SEM (all vari- ables mean 0 and variance 1), with L1 and L2 measured Figure 2: Measurement Error on Z. with 1% error as L1m and L2m. Running FGES on a sample of size 1, 000 drawn from the measured variables {L1m, L2m, X1, X2, X3, X4}, we obtained the pattern The variance of Zm is just the sum of the variances of Z output on the right, which is optimal even if we had and the measurement error Zm , i.e., var(Zm ) = var(Z)+ the population distribution for {L1, L2, X1, X2, X3, X4}. var(Zm ), so we can parameterize the “amount” of mea- Measuring L1 and L2 with 20% error produced a pattern surement error by the fraction of var(Zm ) that comes nothing like the true one (Fig. 4), and the errors in the out- from var(Zm ). If var(Zm ) = 0.5var(Zm ), then half put are not restricted to errors involving relations involving of the variation in the measure is from measurement noise. L1m and L2m as potential separators. For simplicity, and without loss of generality, consider For example, FGES found, in error, that X1 and X4 are ad- the re-parameterization of Fig. 2 in which all variables jacent because L2m did not separate them where L2 would {X, Z, Zm , Y } are standardized to have mean 0 and vari- have. Similarly, X1 and X2 were made adjacent because ance 1. In that case Zm = λZ + Zm , where ρZ,Zm = λ2 L1m failed to separate them where L1 would have. X2 and var(Zm ) = 1 − λ2 .3 and X4 were found to be separated, but not by X1, thus the algorithm oriented the false adjacencies X2 — X1 and 3 The Problem for Causal Discovery X4 — X1 as X2 → X1 and X4 → X1, which in turn caused it to orient the X1 — L1m adjacency incorrectly Measurement error presents at least two sorts of problems as X1 → L1. Thus errors made as a result of measure- for causal discovery. First, for two variables X and Y that ment error are not contained to local decisions. Because are measured with error by Xm and Ym , if X and Y have of this non-locality, judging the overall problem for causal correlation ρXY , then this correlation will attenuate in the discovery algorithms posed by measurement error is almost measures Xm and Ym . That is, |ρXY | > |ρXm Ym |. impossible to handle purely theoretically. Thus it might be the case that a sample correlation ρXY would lead us to believe that X and Y are correlated in 4 Causal Discovery Accuracy 3 var(Zm ) = λ2 var(Z) + var(Zm ) = λ2 + var(Zm ) = 1, Causal researchers in several fields recognize that mea- so var(Zm ) = 1 − λ2 . surement error is a serious problem, and strategies have Figure 3: FGES Output with 1% Measurement Error. Figure 4: FGES Output with 20% Measurement Error. emerged for dealing with the problem theoretically. In mul- rameters can be estimated from these same data. In dis- tivariate regression, it is known that a) measurement error cussing the reliability of the causal discovery algorithm, we in the response variable Y inflates the standard errors of care about how the equivalence class output by the algo- the coefficient estimates relating independent variables to rithm compares to the equivalence class of which the true Y but does not change their expectation, b) measurement causal graph is a member. error in an independent variable X attenuates the coeffi- For example, if the causal graph in the upper left of Fig. 5 cient estimate for X toward 0, and c) measurement error is the true model, then the pattern on the upper right rep- in a “covariate” Z produces partial omitted variable bias in resents the Markov equivalence class of which the true any estimate on a variable X for which Z is also included in graph is a member. It represents all that can be discovered the regression. In cases b) and c), there is a literature, pri- about the causal structure under the assumption that the ev- marily in economics, known as “errors-in-variables”, for idence is non-experimental, confined to independence re- handling measurement error in the independent variables lations, and the connection between the graph and data is and covariates [1]. If the precise amount of measurement the Causal Markov and Faithfulness assumptions. On the error is known, then parameter estimates can be adjusted bottom of Fig. 5 we show the patterns output by the FGES accordingly. If the amount is unknown, then one can still algorithm on a sample of 50 (bottom left), and on a sam- use sensitivity analysis or a Bayesian approach [7]. In gen- ple of 1,000 (bottom right). The pattern on the bottom right eral, in cases in which researchers are confident in the spec- matches the pattern for the true graph, so this output is max- ification of a model, the effect of measurement error on pa- imally accurate, even though it did not “orient” the edge rameter estimation has been well studied.4 between X2 and X4. In causal discovery algorithms, however, the input is typ- Patterns output by search procedures can be scored on ac- ically assumed to be an i.i.d. sample for a set of mea- curacy with respect to adjacencies and accuracy with re- sured variables V drawn from a population with variables spect to arrowhead orientation. The true pattern has three V ∪ L that satisfies some general assumptions (e.g., Causal adjacencies: {X1 — X3, X2 — X3, X2 — X4}, and Markov and/or Faithfulness Axiom) and/or some paramet- two arrowhead orientations: {X1 → X3, X2 → X3}. ric assumption (e.g., linearity). The output is often an The pattern on the bottom left contains two of the three equivalence class of causal structures – any representative adjacencies, but it missed the adjacency between X2 and of which might be an identified statistical model whose pa- X3. The adjacency precision (AP) reflects the proportion, 4 among those guessed to be adjacencies, of correct adja- For example, see Pearl [5], and Kuroki and Pearl [3]. Figure 5: Graphs, Patterns, and Search Accuracy cencies. In this case the search algorithm guessed three ables to have mean 0 and variance 1, and we then ap- adjacencies, two of which were correct, so the adjacency plied varying degrees of measurement error to all of the precision is 2/3 = .67. The adjacency recall (AR) is the measured variables. For each variable X, we replaced X proportion of true adjacencies found by the algorithm. In with Xm = X + x , where x ∼ N (0, σ 2 ), for values of this case the search output two of three, so AR = .67. σ 2 ∈ {0, .01, .1, .2, .3, .4, .5, 1, 2, 3, 4, 5}. Since the vari- ance of all the original variables is 1.0, when σ 2 = .1 ap- We can also compute the “arrowhead precision” (AHP ) proximately 10% of the variance of Xm was from measure- and “arrowhead recall” (AHR). In the bottom left of Fig.3 ment error. When σ 2 = 1, 50% of the variance of Xm was the algorithm output two arrowheads, but we only count from measurement error, and when σ 2 = 5, over 83% of the one which also corresponds to a true adjacency, so its the variance of Xm was measurement error. AHP is 1/1 = 1.0. As there were two arrowheads in the true pattern that it could have found, one of which it missed, its AHP = 1/2 = .5. 6 Results 5 The Simulation Study First consider the effect of sample size on FGES accuracy when there is no measurement error at all. In Fig. 6 we show the performance for sparse graphs (average degree = As predicting the global effects of measurement error on 2) and for fairly dense graphs (average degree = 6). For search accuracy is hopelessly complicated by the non- sparse graphs, accuracy is high even for samples of only locality of the problem, we instead performed a somewhat 100, and by N = 1, 000 is almost maximal. systematic simulation study. We generated causal graphs randomly, parameterized them randomly, drew pseudo- For denser graphs, which are much harder to discover, per- random samples of varying size, added varying degrees of formance is still not ideal even at sample size 5,000. Fig. 7 measurement error, and then ran the FGES causal discov- shows the effects of measurement error on accuracy for ery algorithm and computed the four measures of accuracy both sparse and dense graphs at sample size 100, 500, and we discussed above: AP , AR, AHP , and AHR. 5,000. We generated random causal graphs, each with 20 vari- For sparse graphs, at sample size 100 the of accuracy of ables. In half of our simulations the “average degree” of FGES decays severely for measurement errors of less than the graph (the average number of adjacencies for a vari- 17% (ME = .2), especially for orientation accuracy. The able) was 2 and in half it was 6. In a graph with 20 results are quite poor at 50% (ME = 1) and reaches near 0 variables with an average degree of 2, the graph is fairly when two thirds of the variance of each variable is noise. sparse. With an average degree of 6 the graph is fairly Surprisingly, the decay in performance of FGES from mea- dense. We randomly chose edge coefficients, and for each surement error is much slower at larger sample sizes. For SEM we generated 10 samples of size 100, 500, 1,000, N = 500, the algorithm is still reasonably accurate at 50% and 5,000. For each sample we standardized the vari- measurement error, and adjacency precision remains rea- Figure 6: Baseline: Zero Measurement Error Figure 7: Accuracy and Measurement Error sonably high even when 2/3 of the variance of all variables compute a posterior over each independence decision. This is just noise. If FGES outputs an adjacency at sample size would involve serious computational challenges, and we do 500, it is still over 70% likely to actually be an adjacency in not know many variables would be feasible with this ap- the true pattern, even with 67% measurement error. On the proach. It is reasonable in principle, and a variant of it was other hand, the same can’t be said about non-adjacencies tried on the effect of Lead on IQ among children [8]. (AR) in 67% measurement error case. If FGES outputs a Second, if each variable studied is measured with at least pattern in which X and Y are not adjacent, then there is a two “measurement” variables with independent measure- lower than 40% chance that X and Y are not adjacent in ment error, then the independence relations among the “la- the true pattern – even in patterns with average degree of 2. tent” variables in a linear system can be directly tested by Does sample size help FGES’s performance in the presence estimating a structural equation model and performing a of measurement error? Roughly yes. For dense graphs hypothesis test on a particular parameter, which is 0 in the the performance of FGES improves somewhat with sam- population if and only if the independence holds in the pop- ple size at almost all levels of measurement error. Notably, ulation.5 This strategy depends on the measurement error even for cases in which 5/6 of the variance in the measured in each of the measures being independent, an assumption variables was due to measurement error (ME = 5), most that is often false. Techniques to find measurement vari- measures of FGES accuracy improved dramatically when ables that do have independent error have been developed, sample size was increased from N = 500 to N = 5, 000. however [9], and these techniques can be used to help iden- tify measures with independent error in situations where measurement is a serious worry. 7 Discussion, Conclusions and Future A different measurement problem that is likely to produce Research results similar to Gaussian measurement noise is discretiza- tion. We know that, if, for continuous variables X, Y , and We explored the effect of the simplest form of Gaussian Z, X ⊥ ⊥ Y | Z, then for discrete projections of X, Y , and measurement error on FGES, a simple causal discovery al- Z: Xd , Yd , and Zd , in almost every case the independence gorithm, in a simple causal context: no latent confounders, fails. That is, it is often the case that X ⊥ ⊥ Y | Z but all relations are linear, and the variables are distributed nor- Xd 6⊥ ⊥ Yd | Zd . This means that using discrete measures mally. In this context, even small amounts of measurement of variables that are more plausibly continuous likely leads error seriously diminish the algorithm’s accuracy on small to the same problems for causal discovery as measuring samples, especially its accuracy with respect to orientation. continuous variables with error. As far as we are aware, Surprisingly, at large sample sizes (N = 5, 000) FGES is no one has any clear idea of the severity of the problem, still somewhat accurate even when over 80% of the vari- even though using discrete measures of plausibly continu- ance in each variable is random noise. ous variables is commonplace. One way to conceive of the situation we describe is as latent variable model discovery, and use discovery algo- Acknowledgments rithms built to search for PAGs (partial ancestral graphs that represent models that include latent confounding) instead This research was supported by NIH grant #U54HG008540 of patterns. As the variables being measured with error are (the Center for Causal Discovery), and by NSF grant in some sense latent variables, this is technically appropri- #1317428. ate. The kind of latent confounding for which PAGs are meant, however, is not the kind typically produced by mea- References surement error on individual variables. If the true graph, for [1] Wayne A. Fuller. Measurement Error Models. John Wiley example, is X ← Y → Z, and we measure X, Y , and Z & Sons, 1987. with error, then we still want the output of a PAG search to [2] C. Hanson, S. J. Hanson, J. Ramsey, and C. Glymour. Atypi- be: Xm o–o Ym o–o Zm , which we would interpret to mean cal effective connectivity of social brain networks in individ- that any connection between Xm and Zm goes through Ym uals with autism. Brain Connectivity, 3(6):578–589, 2013. (and by proxie Y ). If measurement error caused a problem, [3] M. Kuroki and J. Pearl. Measurement bias and effect however, then the output would likely be a PAG with ev- restoration in causal inference. Biometrika, 101(2):423– 437, 2014. ery pair of variables connected by the “o–o” adjacency, a [4] J. Pearl. Probabilistic Reasoning in Intelligent Systems: result that is technically accurate but which carries almost Networks of Plausible Inference. Morgan Kaufmann, 1988. no information. [5] J. Pearl. Measurement bias in causal inference. Proceedings of 26th Conference in Uncertainty in Artificial Intelligence, There are two strategies that might work and that we will pages 425–432, 2010. try on this problem in future research. First, prior knowl- [6] J. Ramsey. Scaling up greedy causal search for continu- edge about the degree of measurement error on each vari- 5 able might be used in a Bayesian approach in which we See chapter 12, section 12.5 in Spirtes et al. [11]. Figure 8: Sample Size vs. Accuracy in the presence of measurement error. ous variables. Technical report, Center for Causal Discover, 2015. arXiv:1507.07749. [7] P. Rosenbaum. Observational Studies. Springer-Verlag, 1995. [8] R. Scheines. Estimating latent causal influences: Tetrad III variable selection and Bayesian parameter estimation: the effect of lead on IQ. In Handbook of Data Mining. Oxford University Press, 2000. [9] R. Silva, C. Glymour, R. Scheines, and P. Spirtes. Learning the structure of latent linear structure models. Journal of Machine Learning Research, 7:191–246, 2006. [10] P. Spirtes, C. Glymour, and R. Scheines. Causation, predic- tion, and search. Lecture Notes in Statistics, 81. Springer- Verlag, 1993. [11] P. Spirtes, C. Glymour, and R. Scheines. Causation, predic- tion, and search. MIT Press, 2000.