Type-II Errors of Independence Tests Can Lead to Arbitrarily Large
   Errors in Estimated Causal Effects: An Illustrative Example


                                    Nicholas Cornia & Joris M. Mooij
                                              Informatics Institute
                                   University of Amsterdam, The Netherlands
                                        {n.cornia,j.m.mooij}@uva.nl


                     Abstract                              influenced by a hidden confounder, an instrumental
                                                           variable is a third observed variable which is assumed
    Estimating the strength of causal effects from         to be independent of the confounder. In practice it
    observational data is a common problem in              is difficult to decide whether the instrumental vari-
    scientific research. A popular approach is             able definition is satisfied, and the method has aroused
    based on exploiting observed conditional in-           some skepticism [7]. In this paper, we study a set-
    dependences between variables. It is well-             ting that is similar in spirit to the instrumental vari-
    known that this approach relies on the as-             able model, but where all conditional independence as-
    sumption of faithfulness. In our opinion, a            sumptions are directly testable on the observed data.
    more important practical limitation of this            A similar scenario was first studied by Cooper [1] and
    approach is that it relies on the ability to           independently rediscovered in the context of genome
    distinguish independences from (arbitrarily            biology by Chen et al. [8].
    weak) dependences. We present a simple                 An important assumption in causal discovery meth-
    analysis, based on purely algebraic and ge-            ods based on conditional independences is faithful-
    ometrical arguments, of how the estimation             ness, which means that the observed joint distribution
    of the causal effect strength, based on con-           does not contain any additional (conditional) indepen-
    ditional independence tests and background             dences beyond those induced by the causal structure.
    knowledge, can have an arbitrarily large er-           Usually, faithfulness is justified by the assumption that
    ror due to the uncontrollable type II error of         unfaithful distributions are a set of Lebesgue measure
    a single conditional independence test. The            zero in the set of the model parameters. By showing
    scenario we are studying here is related to            that one can create a sequence of faithful distributions
    the LCD algorithm by Cooper [1] and to the             which converges to an unfaithful one, Robins et al.
    instrumental variable setting that is popular          proved the lack of uniform consistency of causal discov-
    in epidemiology and econometry. It is one              ery algorithms [9]. Zhang and Spirtes [10] then intro-
    of the simplest settings in which causal dis-          duced the “Strong Faithfulness” assumption to recover
    covery and prediction methods based on con-            the uniform consistency of causal discovery. Using geo-
    ditional independences arrive at non-trivial           metric and combinatorial arguments, Uhler et al. [11]
    conclusions, yet for which the lack of uniform         addressed the question of how restrictive the Strong
    consistency can result in arbitrarily large pre-       Faithfulness assumption is in terms of the volume of
    diction errors.                                        distributions that do not satisfy this assumption. Even
                                                           for a modest number of nodes and for sparse graphs,
                                                           the “not strongly faithful” regions can be surprisingly
Introduction                                               large, and Uhler et al. argue that this result should dis-
                                                           courage the use of large scale causal algorithms based
Inferring causation from observational data is a com-      on conditional independence tests, such as the PC and
mon problem in several fields, such as biology and eco-    FCI algorithms [12].
nomics. To deal with the presence of unmeasured con-
                                                           In this work, we analyse in the context of the LCD
founders of observed random variables the so-called
                                                           setting how an error in a single conditional indepen-
instrumental variable technique [2] has found applica-
                                                           dence test may already lead to arbitrarily large er-
tions in genetics [3], epidemiology [4, 5] and economics
                                                           rors in predicted causal effect strengths, even when
[6]. Given two observable random variables possibly
the faithfulness assumption is not violated. Our re-         where An(X) is the set of the causal ancestors of X
sults may not be surprising for those familiar with the      (which includes X itself), so this condition means that
work of [9], but we believe that the analysis we present     we assume that X1 is not caused by the other observed
here may be easier to understand to those without a          variables X2 , X3 .
background in statistics, as we separate statistical is-
sues (the possibility of type II errors in the conditional   Cooper [1] proved that:
independence test from a finite sample) from a rather        Theorem 1.1. Under the assumptions in Definition
straightforward analysis of the problem in the popu-         1.1, the causal structure must be a subgraph of:
lation setting. We use an algebraic approach, showing
how causal prediction may lead to wrong predictions                             X1     X2     X3
already in the simple context of linear structural equa-
tion models with a multivariate Gaussian distribution.
                                                             Here, the directed arrows indicate a direct causal rela-
In Section 1, we begin with a brief description of the       tionship and the bidirected edge denotes an unobserved
problem setting in a formal way, giving the definitions      confounder.
of the causal effect, instrumental variable, LCD al-
gorithm and the toy model we present. We consider            Our primary interest is to predict p(X3 |do(X2 )), the
three observed random variables (X1 , X2 , X3 ), which       distribution of X3 after an intervention on X2 . In
is the minimal number such that a non-trivial condi-         general, this quantity may differ from p(X3 |X2 ), the
tional independence test can be obtained. In Section 2,      conditional distribution of X3 given X2 [13]. In the
we show how an (arbitrarily weak) conditional depen-         linear-Gaussian case, the quantity
dence that goes undetected can influence our estima-
tion of the causal effect of X2 on X3 from the observed                          ∂E(X3 |do(X2 ))
covariance matrix, when a confounder between X2 and                                  ∂X2
X3 is almost off-set by a direct effect from X1 to X3 .
In fact, we show that this phenomenon can lead to an         measures the causal effect of X2 on X3 .
arbitrarily large error in the estimated causal effect as    It is easy to show that in the LCD setting, these quan-
the noise variance of X2 approaches zero. We finish          tities are equal:
with conclusions in Section 3.
                                                             Corollary 1.1. Under the LCD assumptions in Def-
                                                             inition 1.1,
1     Problem setting
                                                                           p(X3 |do(X2 )) = p(X3 |X2 ).
1.1    LCD algorithm
                                                             Therefore, in the linear-Gaussian case, the quantity
The model we are interested in arises from the work
                                                               ∂E(X3 |do(X2 ))   ∂E(X3 |X2 )   Cov(X3 , X2 )
of Cooper [1], who proposed the “LCD” algorithm for                            =             =                   (1)
causal discovery in observational databases and the                ∂X2              ∂X2         Var(X2 )
more recent paper of Chen et al.[8], who proposed the
                                                             is a valid estimator for the causal effect of X2 on X3 .
“Trigger” algorithm to infer transcriptional regulatory
networks among genes. Throughout this section we
will assume:                                                 1.2   Relationship with instrumental variables

    • Acyclicity;                                            The other model relevant for our discussion is the so
    • No Selection Bias.                                     called instrumental variable model. Following Pearl
Definition 1.1. (LCD setting) Given three ran-               [13], we define:
dom variables X1 , X2 , X3 such that the following sta-      Definition 1.2. (Instrumental Variable setting)
tistical properties and prior assumptions are satisfied:     Given three random variables X1 , X2 , X3 , we call X1
Statistical dependences:                                     an instrumental variable if the following conditions are
                                                             satisfied:
    • X1 6⊥
          ⊥ X2
                                                             Statistical dependences:
    • X2 6⊥
          ⊥ X3
    • X1 ⊥⊥ X3 |X2                                             • X1 6⊥⊥ X2
Prior assumptions:                                           Prior assumptions:
    • An(X1 ) ∩ {X2 , X3 } = ∅                                 • X1 ⊥⊥ X3 |do(X2 )
    • Faithfulness                                             • Faithfulness
The second assumption says that X1 and X3 are inde-         between X1 and X3 , or between X1 , X2 , X3 . This sim-
pendent after an intervention on the variable X2 . In       plification will not influence the final result of the pa-
terms of the causal graph, this means that all the un-      per, because we will prove how unboundedness of the
blocked paths between X1 and X3 contain an arrow            causal effect estimation error is already achieved for
that points to X2 .                                         this special case.
Unfortunately the instrumental variable property can-       Definition 1.3. We assume that the “true” causal
not be directly tested from observed data. The causal       model has the following causal graph:
graph for the IV setting is a subgraph of:
                                                                                       X4
                    X1    X2     X3
                                                                                X1     X2     X3
So, a possible confounder between X2 and X3 is al-
lowed, in contrast with the LCD setting. Note that
                                                            which is one of the possible causal structures that is
the LCD setting is a special case of the IV model.
                                                            compatible with the following conditions:
Lemma 1.1. Under the IV assumptions in Definition           Statistical dependences:
1.2 and for the linear-Gaussian case, the quantity
                                                              • X1 6⊥⊥ X2
                     Cov(X1 , X3 )                            • X2 6⊥⊥ X3
                     Cov(X1 , X2 )                            • A weak conditional dependence

is a valid estimator for the causal effect of X2 on X3 .                             X1 6⊥⊥ X3 |X2

                                                            Prior assumptions:
1.3   Type II errors in LCD
                                                              • Faithfulness
In practice, the confidence on the result of the con-         • An(X1 ) ∩ {X2 , X3 } = ∅
ditional independence test X1 ⊥⊥ X3 |X2 in the LCD
                                                            The observed random variables are X1 , X2 , X3 while
setting depends on the sample size. Indeed, it could
                                                            X4 is a hidden confounder, assumed to be independent
be hard to distinguish a weak conditional dependence
                                                            from X1 .
                     X1 6⊥
                         ⊥ X3 |X2                           The joint distribution of the observed variables is as-
                                                            sumed to be a multivariate Gaussian distribution with
from a conditional independence using a sample of fi-       covariance matrix Σ and zero mean vector. We also
nite size. Here we study the question of what happens       assume that the structural equations of the model are
to our prediction of the causal effect of X2 on X3 if       linear. Then
the conditional independence test encounters a type II                          X = AX + E,                     (2)
error (i.e., erroneously accepts the null hypothesis of
independence).                                              where                                   T
                                                                              X = X1 , . . . , X4
Note that a type I error (i.e., erroneously rejecting the
null hypothesis of independence) in the tests X1 6⊥⊥ X2     is the vector of the extended system,
and X2 6⊥ ⊥ X3 will not be as dangerous as a type II                                                T
error in the conditional independence test. Indeed, the                       E = E1 , . . . , E4
probability of a type I error can be made arbitrarily
                                                            is the vector of the independent noise terms, such that
small by tuning the significance level appropriately. In
addition, a type I error would let the LCD algorithm                     E ∼ N 0, ∆ ∆ = diag δi2 ,
                                                                                                     
reject a valid triple, i.e., lower the recall instead of
leading to wrong predictions.
                                                                                      
                                                            and A = (αij ) ∈ M4 R is (up to a permutation of
For these reasons we study the model described in the       indices) a real upper triangular matrix in the space
following definition, which allows the presence of a hid-   M4 (R) of real 4 × 4 matrices that defines the causal
den confounder X4 , and a direct effect from X1 on X3       strengths between the random variables of the system.
(not mediated via X2 ). We assume that these addi-          Remark 1.1. In [14], an implicit representation for
tional features result in a possible weak conditional       the confounder X4 is used, by using non-zero covari-
dependence between X1 and X3 given X2 . For sim-            ance between the noise variables E2 , E3 . It can be
plicity we consider only the linear-Gaussian case. We       shown that for our purposes, the two representations
also assume no confounders between X1 and X2 , or           are equivalent and yield the same conclusions.
In the Gaussian case, a conditional independence is              Proof. It is possible to express the covariance matrix
equivalent to a vanishing partial correlation:                   Σ̄ of the joint distribution of X1 , . . . , X4 in terms of
Lemma 1.2. Given a set of three random variables                 the model parameters as follows:
(X1 , X2 , X3 ) with a multivariate Gaussian distribution                                   −T                 −1
the conditional independence                                                   Σ̄ = I − A         ∆ I −A              .

                         X1 ⊥
                            ⊥ X3 | X2                            The individual components in (6)–(11) can now be ob-
                                                                 tained by straightforward algebraic calculations.
is equivalent to a vanishing partial correlation
                            ρ13 − ρ12 ρ23                        Remark 2.1. (Instrumental variable estimator)
             ρ13·2 = q                       =0          (3)   From equation (8) it follows immediately that for
                           1 − ρ212 1 − ρ223                     α13 = 0, we have
where ρij is the correlation coefficient of Xi and Xj .                                           Σ13
                                                                                         α23 =        ,
                                                                                                  Σ12
In the model described in Definition 1.3,
                     ∂E(X3 |do(X2 ))                             which corresponds to the usual causal effect estimator
                                     = α23 .               (4)   in the instrumental variable setting [3].
                         ∂X2
In contrast with the LCD model in Definition 1.1, the            The lemma we present now reflects the fact that we
equality (1) no longer holds. We are interested in the           are always free to choose the scale for the unobserved
error in the estimation of the effect of X2 on X3 that           confounder X4 :
would be due to a type II error of the conditional inde-
pendence test in the LCD algorithm. The next section             Lemma 2.1. The equations of proposition 2.1 are in-
is dedicated to the analysis of the difference between           variant under the following transformation
the true value (4) and the estimated one in (1):                                     q
                                                                           ᾱ4j = δ42 α4j ,     δ̄42 = 1
     |E X3 |X2 − E X3 |do(X2 ) | = |g A, Σ ||X2 |,
where the “causal effect estimation error” is given by:          for j ∈ {2, 3}.
                            Σ32
                     g A, Σ =     − α23 .                  (5)   Proof. This invariance follows from the fact that α42
                              Σ22                                and α43 always appear in a homogeneous polynomial
                                                                 of degree 2, and they are always coupled with a δ42
2      Estimation of the causal effect error                     term.
       from the observed covariance matrix
                                                                 Without loss of generality we can assume from now on
The following proposition gives a set of equations for           that δ42 = 1.
the observed covariance
                        matrix Σ, given the model
parameters A, ∆ and the linear structural equation               Remark 2.2. (Geometrical Interpretation) From
model (2).                                                       a geometrical point of view the joint system of equa-
Proposition 2.1. The mapping Φ : (A, ∆) 7→ Σ that                tions for the observed covariance matrix defines a
maps model parameters (A, ∆) to the observed covari-             manifold MΣ in the space of the model parameters
ance matrix Σ according to the model in Definition 1.3           M4 (R) × Dδ2 , where M4 (R) is the space of the possi-
is given by:                                                     ble causal strengths αij and

       Σ11 = δ12                                           (6)                                  3
                                                                                                Y
                                                                                     D δ2   =       [0, Σii ]
       Σ12 = α12 δ12                                       (7)
                                                                                                i=1
       Σ13 = α13 + α23 α12 δ12
                          
                                                           (8)
                                                                 is the compact hypercube of the noise variances. Note
    Σ11 Σ23 = Σ12 Σ13                                            that we have used the symmetry Σ̄44 = δ42 = 1 and
                                                       (9)
              + Σ11 δ22 α23 + δ42 α42 α43 + α23 α42              that
    Σ11 Σ22 = Σ212 + Σ11 δ22 + δ42 α42
                                    2
                                       
                                                          (10)                         δi2 ≤ Σii
    Σ11 Σ33 = Σ213                                               from equations (6), (10) and (11). Note that the map
                                                         2     Φ : (A, ∆) 7→ Σ is not injective. This means that
               + Σ11 δ22 α23
                          2
                             + δ32 + δ42 α43 + α23 α42      .    given an observed covariance matrix Σ, it is not possi-
                                                          (11)   ble to identify the model parameters in a unique way.
Indeed, the number of equations is six, while the num-                Here,
ber of model parameters is eight. Geometrically, this                                  m = Σ11 Σ22 − Σ212 > 0                (19)
means that the manifold MΣ does not reduce to a
                                                                                       η = Σ11 Σ33 − Σ213 > 0
single point in the space of model parameters. Nev-
ertheless it is still an interesting question whether the                              ω = Σ22 Σ33 − Σ223 > 0
function g is a bounded function on MΣ or not, i.e.,                                   ϑ = Σ13 Σ22 − Σ12 Σ23
whether we can give any guarantees on the estimated
causal effect. Indeed, for the instrumental variable                                   γ = Σ11 Σ23 − Σ12 Σ13 .
case with binary variables, such bounds can be derived
(see, e.g., [13]).                                                    Proof. The proof proceeds by explicitly solving the
                                                                      system of equations (6)–(11). Some useful identities
                                                                      are:
           A                                               Σ                               Σ12 α42 α43   ϑ
                                       Φ                                            α13 =              + ,
               MΣ                                                                               m        m
                                                           .                                    γ − α23 m
                                                                                      α42 α43 =           ,
                                   Φ   −1
                                            =?                                                      Σ11
                           ∆                                                                              ϑ
                                                                                             ρ13·2 = √       ,
                                                                                                          ωm
The following Theorem and its Corollary are the main                                    ηm − γ 2 = Σ11 det Σ.
results of this paper. We will prove that there still re-
main degrees of freedom in the noise variances δ22 , δ32              The signs in the equations are a consequence of the
and the signs s1 , s2 , given the observed covariance ma-             second degree polynomial equations.
trix Σ, that will lead to an unbouded causal effect es-               Corollary 2.1. It is possible to express the error in
timation error g(A, Σ).                                               the estimated causal effect as
Theorem 2.1. Given the causal model in Definition                                                                ϑΣ12
                                                                               g Ψ(Σ, δ22 , δ32 , s1 , s2 ), Σ =
                                                                                                              
1.3, there exists a map                                                                                               +
                                                                                                                 mΣ22
                                                                                       p                     p               (20)
  Ψ : M3 (R) × D(Σ) × {−1, +1}2 → M4 (R)                       (12)                      det Σ − mδ32 m − Σ11 δ22
                                                                                 s1 s2                   p             .
                                                                                                       m δ22
such that for all (A, ∆):
                                                                      By optimizing over δ32 we get:
               Ψ(Φ(A, ∆), δ22 , δ32 , s1 , s2 ) = A.           (13)
                                                                                         α23 ∈ [b− , b+ ] ⊂ R,
                                                       2
Here D(Σ) = [0, m/Σ11 ] × [0, det Σ/m] ⊂ R is the
rectangle where the noise variances of X2 and X3 live,                with                         √        p
                                                                                             γ         det Σ m − Σ11 δ22
with m defined below in (19). The map Ψ gives explicit                         b± (δ22 ) =     ±             p           .   (21)
solutions for the causal strengths αij , given the ob-                                       m              m δ22
served covariance matrix Σ, the noise variances δ22 , δ32             The length of the interval [b− , b+ ] is a function of
and signs si = ±1. The components of Ψ are given by:                  (Σ, δ22 ) and satisfies
         Σ12                                                                                 ∂|b+ − b− |
   α12 =                                       (14)                                                      < 0.
         Σ11                                                                                    ∂δ22
            r
                m
   α42 = s1         − δ22                      (15)
               Σ11                                                    Proof. Equation (20) follows from (18) and:
            p
               det Σ − mδ32                                                              Σ23   γ   ϑΣ12
   α43 = s2     p                              (16)                                          =   +      .
                  δ22 Σ11                                                                Σ22   m mΣ22
                   p             p
               Σ12 det Σ − mδ32 m − Σ11 δ22  ϑ                        From equation (11), combined with the results of The-
   α13 = s1 s2              p               + ,
                          m δ22 Σ11          m                        orem 2.1, we can obtain the following inequality, using
                                               (17)                   also the fact that δ32 Σ11 > 0:
                                                                                   2                     2
and the most important one for our purpose:                                      mα23 − 2γα23 + η − Σ11 α43 ≥ 0.
                   p             p
         γ           det Σ − mδ32 m − Σ11 δ22                         The two solutions of the inequality define the interval
  α23 =    − s1 s2            p               .                (18)   [b− , b+ ]. Its length is a decreasing function of δ22 .
         m                   m δ22
Unfortunately, the causal effect strength α23 in equa-           is proportional to α13 and from (17) one can deduce a
tion (18) is unbounded. This means that for all the              similar growing rate of the function h in terms of the
choices of the observed covariance matrix Σ that are             variance of the noise term E2 :
in accordance with the model assumptions in Defini-
tion 1.3, the set of model parameters (A, ∆) ∈ MΣ                                         1
                                                                                  |h| ∝        as δ2 → 0.            (24)
that would explain Σ leads to an unbounded error g.                                       δ2

Indeed, a singularity is reached in the hyperplane               Remark 2.5. (Singularity analysis)
δ22 = 0, which corresponds to making the random                  Figure 2 shows a contour plot of |g| on the rectangle
variable X2 deterministic with respect to its parents            D(Σ) 3 (δ22 , δ32 ). The singularity in the causal effect
X1 , X4 . Figure 1 shows the singularity of the function
|g(Σ, δ22 , δ32 )| in the limit δ22 → 0. The rate of growth is
proportional to the inverse of the standard deviation
of the noise variable E2 :
                              1
                       |g| ∝        as δ2 → 0.           (22)
                             δ2


                                                                 Figure 2: The function |g| has a singularity in the
                                                                 hyperplane δ22 = 0.

                                                                 function g is reached in the degenerate case, when the
                                                                 conditional distribution of X2 given X1 and X4 ap-
                                                                 proaches a Dirac delta function. This cannot be de-
Figure 1: Causal effect estimation error |g| as a func-          tected empirically, as we can still have well-defined co-
tion of δ22 , for fixed δ32 , Σ and s1 s2 = 1.                   variance matrices Σ of the observed system even if the
                                                                 covariance matrix Σ̄ of the extended one is degenerate.
Remark 2.3. (Lower bound for δ22 ) Corollary 2.1
is the main result of our analysis. The right hand term          Let us investigate in detail the limit for δ22 → 0 from
in (20) consists of two terms: the first one, through ϑ,         the point of view of the causal model. This propo-
represents the contribution of the partial correlation,          sition will show a simple example of how the causal
and is small if ρ13·2 is small. The second term is a             strengths can be arbitrarily large, keeping the entries
fundamental, intrinsic quantity not controllable from            of the observed covariance matrix Σij finite.
the conditional independence test and the sample size.
However, in situations where one is willing to assume            Proposition 2.2. Assume that the observed covari-
a lower bound on δ22 :                                           ance matrix Σ is positive-definite. Then, for the limit
                                                                 δ22 → 0 we have the following scenario for the causal
                          δ22 ≥ δ̂22 ,
                                                                 strength parameters:
it is possible to give a confidence interval [b+ , b− ] for                   
the function g, depending on the choice of the lower                                       −1
                                                                              α23 ≈ ± δ2
                                                                              
bound δ̂22 .                                                                    α43 ≈ ∓ sgn(α42 ) δ2−1
                                                                                α13 ≈ ∓ sgn(α12 ) δ2−1 .
                                                                              
Remark 2.4. (IV estimation error)
                                                                              
In the instrumental variable literature the IV estima-
tor is used, presented in Lemma 1.1. Unfortunately,              This limit, in which our error in the estimated causal
this estimator and its error function                            effect strength of X2 on X3 diverges, is illustrated in
                            Σ13                                  Figure 3.
                 h(Σ, A) =      − α23              (23)
                            Σ12
                            X4                                   • Bayesian Information Criterion: We could
                                                                   directly assign a score based on the likelihood
                          α42         ∓∞                           function of the data given the model parameters
                                                                   (A, ∆) and the model complexity, without assum-
                    α12            ±∞                              ing any prior distribution for the model parame-
             X1             X2             X3                      ters.

                            ∓∞                                   • Nonlinear structural causal equations: To
                                                                   deal with nonlinearity it is possible to consider
                                                                   Spearman’s correlation instead of the usual one,
Figure 3: Scenarios in which the error in the causal ef-           using the following relationships:
fect strength of X2 on X3 based on the LCD algorithm
may become infinitely large.                                               m = Σ11 Σ22 (1 − ρ212 )
                                                                            η = Σ11 Σ33 (1 − ρ213 )
                                                                            ω = Σ22 Σ33 (1 − ρ223 )
3     Conclusions and future work                                                   p
                                                                            γ = Σ11 Σ22 Σ33 (ρ23 − ρ12 ρ13 )
Corollary 2.1 shows how the causal effect estimation
                                                                                    p
                                                                            ϑ = Σ22 Σ11 Σ33 (ρ13 − ρ12 ρ23 )
error can be extremely sensitive to small perturbations
of our model assumptions. Equation (20) holds for any            • “Environment” variable: In many applica-
value of ϑ (which is proportional to the partial corre-            tions in biology, for example where X1 is geno-
lation ρ13·2 ) and the second term vanishes when the               type, X2 gene expression and X3 phenotype,
confounder is not present. This shows that with a                  the observed random variables X2 and X3 are
finite sample, a type II error in the conditional inde-            strongly dependent on the environmental condi-
pendence test may lead to an arbitrarily large error in            tions of the experiment. It might be reasonable
the estimated causal effect. Even in the infinite sample           to assume that most of the external variability
limit, this error could be arbitrarily large if faithfulness       is carried by the covariance between the environ-
is violated. The result is in agreement with the results           ment variable W and the other measured ones,
in [9], and it shows in a clear algebraic way how type             including possible confounders. This leads to the
II errors of conditional independence tests can lead to            following graphical model, which could be useful
wrong conclusions.                                                 in deriving some type of guarantees for this sce-
We believe that this conclusion holds more generally:              nario:
even when we increase the complexity and the number
of observed variables, the influence of confounders will                                   W
still remain hidden, mixing their contribution with the
visible parameters, thereby potentially leading to ar-                                     X4
bitrarily large errors. This means that for individual
cases, we cannot give any guarantees on the error in                                X1     X2     X3
the estimation without making further assumptions.
An interesting question for future research is whether
this negative worst-case analysis can be supplemented          Acknowledgements
with more positive average-case analysis of the esti-
mation error. Indeed, this is what one would hope if           We thank Tom Heskes for posing the problem, and
Occam’s razor can be of any use for causal inference           Jonas Peters for inspiring discussions. We thank the
problems.                                                      reviewers for their comments that helped us improve
                                                               the manuscript.
Other possible directions for future work are:

    • Study more complex models, in terms of
                                                               References
      the number of nodes, edges and cycles.                    [1] G. F. Cooper.       A simple constraint-based
                                                                    algorithm for efficiently mining observational
    • Bayesian model selection: We hope that the                    databases for causal relationships. Data Mining
      Bayesian approach will automatically prefer a                 and Knowledge Discovery, 1:203–224, 1997.
      simpler model that excludes a possible weak con-
      ditional dependence even though the partial cor-          [2] R.J. Bowden and D.A. Turkington. Instrumental
      relation from the data is not exactly zero.                   Variables. Cambridge University Press, 1984.
 [3] V. Didelez and N. Sheehan. Mendelian random-
     ization as an instrumental variable approach to
     causal inference. Statistical Methods in Medical
     Research, 16:309–330, 2007.
 [4] S. Greenland. An introduction to instrumental
     variables for epidemiologists. International Jour-
     nal of Epidemiology, 29:722–729, 2000.
 [5] D. A. Lawlor, R. M. Harbord, J. A. C. Sterne,
     N. Timpson, and G. D. Smith. Mendelian ran-
     domization: Using genes as instruments for mak-
     ing causal inferences in epidemiology. Statistics
     in Medicine, 27:1133–1163, 2008.
 [6] J.D. Angrista, W. G. Imbens, and D.B. Rubinc.
     Identification of causal effects using instrumental
     variables. Journal of the American Statistical As-
     sociation, 91:444–455, 1996.
 [7] D. A. Jaeger J. Bound and R. M. Baker. Problems
     with instrumental variables estimation when the
     correlation between the instruments and the en-
     dogeneous explanatory variable is weak. Journal
     of the American Statistical Association, 90:443–
     450, 1995.
 [8] L. S. Chen, F. Emmert-Streib, and J. D. Storey.
     Harnessing naturally randomized transcription
     to infer regulatory relationships among genes.
     Genome Biology, 8, 2007.
 [9] R. Scheines J. M. Robins, P. Spirtes, and
     L. Wasserman. Uniform consistency in causal in-
     ference. Biometrika, 90:491–515, 2003.
[10] J. Zhang and P. Spirtes. Strong faithfulness and
     uniform consistency in causal inference. In Pro-
     ceedings of the Nineteenth Conference on Uncer-
     tainty in Artificial Intelligence (UAI 2003), pages
     632–639, 2003.
[11] C. Uhler, G. Raskutti, P. Bühlmann, and B. Yu.
     Geometry of the faithfulness assumption in causal
     inference. The Annals of Statistics, 41:436–463,
     2013.
[12] P. Spirtes, C. N. Glymour, and R. Scheines. Cau-
     sation, prediction, and search. The MIT Press,
     2000.
[13] J. Pearl. Causality: models, reasoning and infer-
     ence. Cambridge University Press, 2000.
[14] M. Drton, R. Foygel, and S. Sullivan. Global
     identifiability of linear structural equation mod-
     els. The Annals of Statistics, 39:865–886, 2011.