Marginal causal consistency in constraint-based causal learning Anna Roumpelaki Giorgos Borboudakis Sofia Triantafillou Ioannis Tsamardinos Computer Science Dept. University of Crete Voutes University Campus, Heraklion 70013, Crete Abstract of m-separation. MAGs have several attractive properties: They are closed under marginalization, and they are pair- wise Markov: Every missing edge in the graph corresponds Maximal Ancestral Graphs (MAGs) are proba- to a conditional independence in the distribution. bilistic graphical models that can model the dis- tribution and causal properties of a set of vari- The independence model of a joint probability distribu- ables in the presence of latent confounders. They tion is the set of conditional independencies entailed by the are closed under marginalization. Invariant pair- distribution. The set of MAGs that entail the same inde- wise features of a class of Markov equivalent pendence model define a Markov equivalence class. All MAGs can be learnt from observational data sets invariant pairwise features of a Markov equivalence class using the FCI algorithm and its variations (such of MAGs can be represented using a Partial Ancestral as conservative FCI and order independent FCI). Graph (PAG) [14]. FCI [14, 15] is the first sound and We investigate the consistency of causal features complete algorithm that identifies a PAG given a data set (causal ancestry relations) obtained by FCI in over a set of possibly confounded variables V and a test of different marginals of a single data set. In prin- conditional independence. ciple, the causal relationships identified by FCI Since MAGs are closed under marginalization, FCI can be on a data set D measuring a set of variables V also used in any subset V \ L of V to obtain the corre- should not conflict the output of FCI on marginal sponding marginal PAG. Naturally, the causal features of data sets including only subsets of V. In prac- marginal PAGs should not contradict those of the PAG on tice, however, FCI is prone to error propagation, the full set of variables: a disagreement between a PAG and and running FCI in different marginals results a marginal PAG can only be a result of the sensitivity of in inconsistent causal predictions. We introduce FCI to statistical errors. We use the term marginal causal the term of marginal causal consistency to de- consistency to describe the degree of agreement of causal note the consistency of causal relationships when relationships among a PAG and its marginals. To the best learning marginal distributions, and investigate of our knowledge, this type of consistency of constraint- the marginal causal consistency of different FCI based causal discovery has not been examined before. We variations.Results indicate that marginal causal examine the outputs of FCI [14], order-independent FCI consistency varies for different algorithms, and [3], and conservative FCI with the majority rule heuristic is also sensitive to network density and marginal for collider orientations [11, 3]. size. In a simulated setting, we found that the algorithms’ con- sistency is sensitive to network density and number of vari- ables that are marginalized out. To examine whether the 1 INTRODUCTION consistency of causal relationships correlates with the cor- rectness of the induced causal features, we ranked them ac- Maximal Ancestral Graphs (MAGs) [12] can represent the cording to their frequency of appearance in randomly se- causal relationships among a set of measured variables, as lected marginals, and compared the resulting AUCs with well as the conditional independence of their joint proba- bootstrapping. While marginal consistency measures the bility distribution, in the presence of latent confounders. sensitivity of an algorithm to a specific choice of mea- Under the causal Markov and Faithfulness assumptions sured variables, bootstrapping measures the sensitivity of [14], every conditional independence that holds in the dis- an algorithm to the specific sample. Results show that tribution can be identified in the graph using the criterion marginal consistency can help identify accurate causal fea- tures. However, bootstrapping outperforms marginal-based circles on the path into appropriate tails or arrowheads. ranking in all cases. Under the causal Markov and Faithfulness assumptions The rest of this paper is structured as follows: Section 2 [14], the conditional independencies that hold in a joint introduces basic MAG notions and notation. Section 3 de- probability distribution can be identified in the correspond- fines marginal causal consistency in FCI outputs, presents ing causal graph according to the graphical criterion of m- a sound and complete method for identifying all pairwise separation [12]. Constraint-based methods query the data causal ancestry relationships in a PAG, and compares in- to identify the independence model, and then try to find ternal consistency of outputs of different FCI variations. the causal MAGs that satisfy all and only the observed in- Section 4.1 describes work related to obtaining confidence dependencies. A class of Markov equivalent MAGs, that estimates for causal ancestry relations. Section 4 describes differ in a subset of edge orientations, will in general sat- an algorithm for ranking causal relationships for a given isfy the observed constraints. PAGs have the same adjacen- data set, and compares the AUC of the proposed approach cies and all invariant orientations shared by all MAGs in a to confidence estimates obtained with bootstrapping. Con- Markov equivalence class. Specifically, an edge end-point clusions and future directions are discussed in Section 5. is oriented as an arrowhead (’>’) or tail (’-’) in a PAG if and only if it is invariant in all MAGs represented by it, 2 PRELIMINARIES and is left as a circle (’o’) otherwise. FCI is a sound and complete algorithm for discovering We use V to denote random variables, and V to denote a PAGs from observational data, but it is prone to statisti- set of variables. A graph G, denoted as G = (V, E) is cal errors. Several extensions have been proposed that aim defined over a set of variables V with edges E. A path p is a to tackle the sensitivity of FCI to error propagation: or- sequence of adjacent edges, without repetition. A directed der independent FCI, conservative FCI and majority rule path is a path where all edges are directed and have the FCI among others. Order independent FCI (denoted iFCI same direction. We use X 99K Y to denote that there exists in the rest of this paper), proposed by [3], outputs a PAG a directed path from X to Y (X is an ancestor of Y in G). that does not depend on the order the variables are given. We use X ⊥ ⊥Y |Z to denote variables X and Y to be inde- Conservative FCI [11] checks all unshielded triplets in the pendent given a set Z. In a graph G a vertex V is a collider following way: for every unshielded triple X − Y − Z on a path u if and only if there are two distinct edges on u check all subsets of X’s potential parents and all subsets of containing V as an endpoint and both are into V. Otherwise, Z’s potential parents. If Y is not in any subsets, then ori- V is a non-collider on U. In a graph G, a triplet X − V − Y ent the triplet as a collider; if Y is in all subsets leave the is unshielded if X and Y are not adjacent in G. triplet as a non-collider; otherwise tag the triple as unfaith- ful. Majority rule FCI (denoted mFCI in the rest of this A Bayesian network is represented by a Directed Acyclic paper) is slightly less strict than conservative FCI. In this Graph (DAG) G over a set of variables V and a joint proba- extension, an unshielded triple is oriented as a collider if Y bility distribution P. A directed edge denotes a direct causal is in less than 50% of the subsets and as a non-collider if relationship. We assume that the following two conditions it is in more than 50% of the subsets. To avoid unfaithful hold: the Causal Markov Condition (CMC), which states triples in case of ties, we leave the triple as a non-collider. that every variable is independent of its non-descendants We examine and compare the marginal consistency of FCI, given it’s parents, and the Faithfulness Condition (FC), order-independent FCI and majority rule FCI in simulated which states that all the conditional independencies that data. hold in P stem from G and the CMC. MAGs are mixed graphs, i.e. they can have both directed 3 MARGINAL CAUSAL CONSISTENCY and bi-directed edges. A directed edge X → Y indicates that X causes Y , while a bi-directed edge X ↔ Y indi- IN PAGS cates that X and Y are confounded. Two nodes can be con- nected with only one type of edge, and ancestry has prece- We now define the problem of marginal causal consistency dence over confounding: if X causes Y and the two are of a constraint-based algorithm. Intuitively, we are inter- also confounded, only the directed edge X → Y is present ested in how much marginalizing out variables and rerun- in the MAG. MAGs can also be used to represent selection ning FCI (or a variation) preserves causal relationships. bias via undirected edges. For this paper, we assume no To help define the problem, we use AnP to be the set of all selection bias and only consider MAGs with directed and ancestral relationships that hold in the Markov equivalence bi-directed edges. class [P] entailed by P. Notice that this set may be differ- A path is called uncovered if every consecutive triple on the ent than the set of ancestral relationships T RP that can be path is unshielded [15]. Also, a path is potentially directed identified directly from P by taking the transitive closure if it can be oriented into a directed path by changing the of the directed edges: It may include additional pairs that are connected by a different causal path in every member path p2 = hX, V, V2 , . . . Vm , Y i of length m with U 6= V of [P], so that there is no fully oriented path in P. Fig- must also exist. Assume there is no such path. If for every ure 1 shows an example where an invariant causal relation- MAG in [P], X → U , then p1 is a directed path in P (since ship does not correspond to a single directed path in the every triple on the path is a definite non collider), and X ∈ PAG: while T RP for the PAG of Figure 1a is the empty AnP (Y ), which is a contradiction. Thus, X U in P, set, AnP = {(X, Y )}. and there exists a MAG in [P] where X U and p1 is not a directed path. If p1 , is the only p.d. path from X to Y in Identifying AnP is not trivial. [1] use a method to im- P, then X 99K Y 6∈ [P], which is a contradiction. Thus, plicitly enumerate all MAGs in a Markov equivalence class ∃p2 = hX, V, V2 , . . . Vm , Y i with U 6= V . represented by P. The same techniques can be used to identify AnP , by identifying the invariant ancestral rela- Next we show that hV, X, U i form an unshielded definite tions present in all such MAGs1 . non-collider in P: hV, X, U i is not a collider in P (trivially since p1 and p2 are However, in this work we show that all causal relationships potentially directed paths). We must also show that U, V that are invariant in P but are not in T RP correspond to are not adjacent. If U V in , then there exists a MAG a specific pattern, illustrated in Figure 1. The pattern can in [P] such that U → X in P, which is inconsistent since be easily found in P to identify all additional relationships p1 is a p.d. path.  that are causal in P but not present in T RP . Theorem 3.1 proves soundness and completeness of this rule. Apart from (positive) causal relations, a PAG can also have negative and ambiguous causal relations. X and Y share Theorem 3.1 Let P be the PAG over a set of variables V a negative causal relationship in P if X cannot be a cause representing the Markov equivalence class of MAGs [P]. of Y in P. This happens if (X, Y ) 6∈ AnP , and there can Then, if (X, Y ) 6∈ T RP , X 99K Y ∈ AnP if and only if be no directed path from X to Y in P (i.e. X Y in ∃U, V, U 6= V such that P, or there is no potentially directed path from X to Y in P). Naturally, this does not mean that Y causes X. An 1. hX, U, . . . Y i and hX, V, . . . Y i form uncovered po- ambiguous causal relationship occurs when a relationship tentially directed paths and is neither positive nor negative. 2. hU, X, V i is an unshielded definite non collider in P. We use the notation N AnP to denote the set of negative causal relationships in a PAG P. The conditions mentioned Proof (⇐) Let U, V be distinct variables such that above for membership in N AnP can easily be tested in P: hX, U, . . . Y i and hX, V, . . . Y i are uncovered p.d paths, To rule out the existence of a possible directed path, only with hU, X, V i an unshielded definite non-collider in P. uncovered possibly directed paths need to be checked [15]. Let U X in some MAG in [P], where ? is used as Notice that the set of ancestral and non-ancestral relations a meta-symbol denoting any plausible orientation. Then in a PAG are by no means complementary, since ambiguous X → V (since U, X, V form a definite non collider) and relations also exist. V 99K Y (since every triple on the path is a definite non We are interested in constraint-based algorithms’ consis- collider). Else if U ← X, then U 99K Y (since every tency to marginal ancestral sets: Let D be a data set over triple on the path is a definite non collider). Thus, (X, Y ) a set of possibly confounded variables V, and let G be the in AnP . PAG output of a sound and complete constraint-based algo- (⇒) We will first prove that if there is a directed path from rithm. P defines an ancestral set AnP and a non-ancestral X to Y in all MAGs in [P], and there is no such directed set N AnP . Also, let P[L , L ⊂ V, be the marginal PAG path in P then there exist at least two potentially directed obtained using the same algorithm (with the same hyper- paths (and thus, two uncovered potentially directed paths) parameters) on the restriction of data set D on variables in in P: L. Each marginal PAG P[L defines a marginal ancestral set AnP[L and a marginal negative ancestral set N AnP[L . If there is a directed path from X to Y in every MAG in [P], then there is a p.d. path from X to Y in P. By Lemma Assuming no statistical errors, any marginal ancestral set is B.1 in [15] there is an uncovered p.d. path from X to Y a subset of the ancestral set. Thus, there is no pair (X, Y ) in P. Let p1 = hX, U, U2 , . . . Un , Y i, be such a path of present in any AnP [L that is not present in AnP . On the length n. We want to show that another uncovered p.d. contrary, some ancestral relationships that can be identified in the full data set may not be identifiable in a marginal, 1 Although theoretically possible, the algorithm assumes that due to (a) the fact that some variables are not included in the input PAG and separating sets are correct, that is, that the de- the marginal and (b) the loss of information from exclud- pendencies and independencies encoded in the PAG are the ones implied by the separating sets. In practice however, as suggested ing variables. Therefore, members of AnP are possibly by anecdotal experiments, FCI and its variants rarely produce not present in some AnP[L . Similarly, any marginal non- such output. ancestral set is a subset of the non-ancestral set, while some (a) (b) (c) (d) Figure 1: (a) Example of a PAG in which each Markov equivalent MAG contains a directed path from X to Y . (b,c) Orienting the edge between U and X as U X creates the directed path X → V → Y , while orienting it as U ← X creates the directed path X → U → Y . (d) The general pattern as described in Theorem 3.1. Dashed edges correspond to potentially directed paths. E F F A B C D A B C D (a) (b) Figure 2: An example of ancestral and non-ancestral sets for a PAG P and a marginal PAG P[E . Ancestral and non-ancestral sets: AnP = {(B, {C, D}), (C, D)}, N AnP = {(A, {E, F }), (B, {A, E, F }), (C, {A, B, E, F }), (D, {A, B, C, E, F }), (E, {A, F }), (F, {A, B, E})}. The corresponding marginal sets are AnP[E = {(C, D)}, N AnP[E = {(A, F ), (B, F ), (C, {A, B, F }), (D, {A, B, C, F }), (F, {A, B})} members of N AnP may not be present in some N AnP[L . Marginal PAG P[L In addition, the PAGs should not encode any conflicting causal information: Members of AnP cannot also be mem- Ancestral Non ancestral bers of N AnP[L , and members of N AnP cannot also be Ancestral p c members of AnP[L . PAG P Non ancestral d n For finite samples, however, statistical errors propagate in both the skeleton identification and the orientation steps of Ambiguous e f constraint-based algorithms, and can result in conflicting |AnP[L | |N AnP[L | orientations. Since the algorithms are run on marginal ver- sions of the same data set, conflicts in the marginal ances- Table 1: Confusion matrix for (non) ancestral relationships tral sets can be viewed as a measure of robustness of the for a marginal of P. algorithm to statistical errors. We are therefore interested in comparing the marginal causal consistency of different FCI variations. Marginal consistency can also been as a Marginal PAG P[L measure of the sensitivity of an algorithm to the specific Ancestral Non ancestral choice of observed variables. Ancestral 2 0 For a given algorithm, we are interested in how often pos- itive and negative ancestral relationships entailed by the PAG P Non ancestral 0 11 marginal outputs P[L agree or disagree with the output P Ambiguous 0 0 over the full set of variables. This information is entailed in the confusion matrix described in Table 1. 2 11 Some remarks: Table 2: Confusion matrix for the PAGs in Figure 2. • For perfect statistical knowledge, c = d = e = f = 0, Due to statistical errors, the output of the FCI algorithm p = |AnP[L |, n = |N AnP[L |. Notice that an- is also not necessarily a valid PAG. A very common prob- cestral and non-ancestral relationships identified in a lem is the creation of cycles or almost cycles. To tackle marginal PAG are not expected to be ambiguous in the this problem, we added the option to aggressively prevent PAG over the complete set of variables V. cycles, as implemented in TETRAD [13]. This functional- ity is applied in the phase of orienting edges, where every • The sum p + d + c + n + e + f is different (smaller) attempted orientation is checked for creating an (almost) than the number of possible causal relationships in cycle. If that is the case, then that specific orientation is not the marginal data set, since P[L also has ambiguous performed, and the orientation rules move on to the next edges. We do not take these edges into account here, possible orientation. We have to note that we do not use because they are not an indication of consistency or the orientation rules that aim at recovering undirected edges inconsistency of the marginal. Edges that are (non) (selection bias). ancestral in P can often be ambiguous in P[L , even if the endpoints are present in the marginal. The results of the experiments are shown in Figures 3 and 4. The ratios were computed by summing over all numer- ators and dividing by the sum of all denominators (for ex- An example of (non) ancestral relations in a PAG and a cor- ample, for |AnpP[ | we summed over all correctly identified responding marginal PAG is shown in Figure 2. The corre- L ancestral relations p and divided by the total number of pre- sponding confusion matrix is shown in Table 2 (assuming dicted ancestral relations |AnP[L | over all marginals and perfect statistical knowledge). datasets). This guarantees that each bar sums to one and Notice however, that the confusion matrix in Table 1 only avoids divisions by zero, in case no relation is predicted. measures the robustness of an algorithm in terms of causal For all algorithms, the consistency drops for both denser predictions. Other characteristics and uncertainties of FCI networks and smaller marginals. For networks with den- outputs are not taken into account. sity 0.1 all algorithms have more than 50% consistent pre- dictions for both 18 and 15-variable marginals. For denser 3.1 Marginal consistency of FCI variations in networks however, the performance of iFCI and FCI drops simulated data. below 0.2. mFCI has the largest ratio of consistent rela- tionships, and retains a consistency ratio of 0.60 for 18- We used simulated data to examine the internal consistency variable marginals. However, its performance drops to 0.38 of the FCI algorithm. We used the pcalg package [8] to for 15-variable marginals. For all algorithms, the majority simulate random DAGs with 20 variables. We tried two of causal relationships are found non-ancestral in P, and a different graph densities, 0.1 and 0.2 (corresponding to 1.9 small ratio is found ambiguous. It is worth noting, how- and 3.8 neighbors per variable on average, respectively). ever, that the algorithms typically output very few positive We use linear Gaussian parametrization, with coefficient causal relationships. of each model sampled using the default parameters in the pcalg package. For each graph density we generated 50 Non-ancestral causal relationships on the other hand are DAGs and for each such network we simulated data with much more consistent, as shown in Figures 5 and 6. The 1000 samples. majority of non-ancestral relationships are consistent (blue bars, |AnnP[ | ). Overall, mFCI has the highest ratio of con- We used different variations of the FCI algorithm to retrieve L sistent negative relationships. a causal network with significance threshold α = 0.05 and unconstrained maximum conditioning set. We also created Overall, results show that (a) the performance of constraint- 100 randomly selected marginal data sets of size 18 and based algorithms heavily depends on the graph density, par- 15 for each DAG, and ran all FCI variations with the same ticularly for algorithms that are less conservative, and thus parameters. more sensitive to error propagation and (b) The causal pre- dictions of the algorithms are sensitive to marginalization. The variations of the FCI algorithm we used are the follow- Even for mFCI, removing two out of 20 variables results in ing: order independent FCI (iFCI), FC, FCI with majority 30-40% relations that are not validated in the marginal data rule (mFCI). We did not include the conservative FCI, as set. the existence of ambiguous triples results in outputs that are not complete PAGs. Hence, Theorem 3.1 can not be ap- plied. Instead, we included the majority rule FCI, which is inspired by conservative FCI, but in which the triplet’s ori- entation is dictated by a majority vote on the corresponding conditional independence tests. To guarantee that the out- put is a valid PAG, ambiguous triplets (occurring in a tie vote) are marked as definite non-colliders. (a) (b) Figure 3: Barplots showing the ratios of positive relations in (a) 18-variable marginals, (b) 15-variable marginals in the graphs of density 0.1. Blue: Consistent ancestral relationships. Red: Inconsistent ancestral relationships (found non- ancestral in P). Pink: Inconsistent ancestral relationships (found ambiguous in P). (a) (b) Figure 4: Barplots showing the ratios of positive relations in (a) 18-variable marginals, (b) 15-variable marginals in the graphs of density 0.2. Blue: Consistent ancestral relationships. Red: Inconsistent ancestral relationships (found non- ancestral in P). Pink: Inconsistent ancestral relationships (found ambiguous in P). (a) (b) Figure 5: Barplots showing the ratios of negative relations in (a) 18-variable marginals, (b) 15-variable marginals in the graphs of density 0.1. Blue: Consistent non-ancestral relationships. Red: Inconsistent non-ancestral relationships (found ancestral in P). Pink: Inconsistent non-ancestral relationships (found ambiguous in P). The majority of non-causal predictions are consistent (blue). 4 RANKING CAUSAL RELATIONSHIPS they are not consistent with the output of the algorithm on BASED ON MARGINAL CAUSAL the whole data set. CONSISTENCY 4.1 Related Work Calculating the marginal ancestral sets indicates a way of Alternative approaches for ranking causal ancestry rela- ranking pairwise causal relationships according to their fre- tions can be categorized into (a) Bayesian model averaging quency of appearance in AnP[L for different marginals. methods and (b) resampling-based methods. The idea is that causal relationships that frequently appear in marginal PAGs will tend to be true more often, even if Bayesian model averaging methods compute the posterior (a) (b) Figure 6: Barplots showing the ratios of negative relations in (a) 18-variable marginals, (b) 15-variable marginals in the graphs of density 0.2. Blue: Consistent non-ancestral relationships. Red: Inconsistent ancestral relationships (found ancestral in P). Pink: Inconsistent non-ancestral relationships (found ambiguous in P). The majority of non-causal predictions are consistent (blue). (a) (b) (c) Figure 7: ROC curves of the variations of the FCI algorithm in graphs with originally 20 variables and density 0.1. (a) 18-variable marginals (b) 15-variable marginals (c) bootstrapping. (a) (b) (c) Figure 8: ROC curves of the variations of the FCI algorithm in graphs with originally 20 variables and density 0.2. (a) 18-variable marginals (b) 15-variable marginals (c) bootstrapping. probability of network features by averaging over network Resampling-based methods repeatedly apply a learning structures. This can be either approximated using MCMC method on resampled datasets and estimate the confidence [6] or by using exact methods [10, 2]. Apart from their high of network features as the proportion of induced networks computational cost, such methods are not very general and in which they appeared. Such methods include parametric are only applicable to cases where network scores can be and non-parametric bootstraps [5], as well as stability se- computed. Although various scores exist for Bayesian net- lection [9]. The main advantage is that they are general and work structures [7], scores for MAGs have only recently thus also applicable in the case of PAGs. been explored [12] and require more expensive fitting pro- In this section we compare our approach with the non- cedures [4]. parametric bootstrap by [5]. sible unmeasured variables. It is also possible that confi- dence metrics computed based on marginals could be more 4.2 Experiments helpful in situations where you have ”outlier” variables that do not satisfy the algorithm’s assumptions (e.g. they create We produced rankings for all possible pairs of variables cycles, do not satisfy the distributional assumptions of con- according to (a) the frequency of appearance in the cor- ditional independence tests etc). In such cases, it is pos- responding marginal ancestral sets AnP[L and (b) the fre- sible that taking random marginals could improve the al- quency of appearance in AnP for FCI outputs on differ- gorithms’ performance (similar to the way bootstrapping is ent bootstrap samples of the initial data set D. Since we beneficial when you have outlier samples). only use 100 different marginals per iteration, for each pair We must also point out that PAGs encode much richer in- (X, Y ), the frequency of appearance in AnP[L was divided formation than the causal ancestry relations examined here. by the number of data sets in which X and Y are both Exploring different types of marginal consistency is also an present (i.e. we excluded from the calculation marginals area of interest. in which the causal ancestry could not be found). For ev- ery pair of variables the true label is 1 if the relationship Acknowledgments is ancestral in the DAG the data was sampled from, and 0 otherwise. We would like to thank the anonymous reviewers for their Based on these ranking and the status of the relationship comments, particularly reviewer 2 who helped us identify in the ground truth network (used to simulate the data), we a problem in the simulations in the submitted version of calculated ROC curves and the corresponding AUCs. Fig- this paper. This work was funded by European Research ures 7(a,b) and 8(a,b) show the performance of the algo- Council (ERC) and is part of the CAUSALPATH - Next rithms using marginals of 15 and 18 variables for densities Generation Causal Analysis project, No 617393. 0.1 and 0.2 respectively. For all settings, ROC curves are significantly better than random guessing. Again, the re- References sults are better for sparser networks, where AUC ranges from 82.25-91.84%. The AUCs drop for denser networks, [1] G. Borboudakis and I. Tsamardinos. Incorporat- ranging from 60.78-83.57%. mFCI has the highest AUCs ing causal prior knowledge as path-constraints in for all settings. Bayesian networks and maximal ancestral graphs. Proceedings of the 29th International Conference on We compared our method against bootstrapping. For each Machine Learning, pages 1799–1806, 2012. DAG, 100 data sets with 1000 samples were resampled with replacement from the original data set. FCI was ran [2] Y. Chen, L. Meng, and J. Tian. Exact bayesian learn- on the new data sets, each time calculating the ancestral set ing of ancestor relations in bayesian networks. In Pro- AnP on the output PAG. Each order pair of variables was ceedings of the Eighteenth International Conference then ranked according to the frequency of appearance in on Artificial Intelligence and Statistics, 2015. the ancestral sets. The results are shown in figures 7(c) and [3] D. Colombo and M. H. Maathuis. Order-independent 8(c), where we can see that bootstrapping outperforms the constraint-based causal structure learning. Journal of marginal-based ranking in all cases. Again, the majority Machine Learning Research, 15(1):3741–3782, 2014. rule FCI outperforms the rest of the algorithms. [4] M. Drton, M. Eichler, and T. S. Richardson. Comput- ing maximum likelihood estimates in recursive linear 5 DISCUSSION models with correlated errors. Journal of Machine Learning Research, 10:2329–2348, 2009. We defined and examined the problem of marginal causal [5] N. Friedman, M. Goldszmidt, and A. Wyner. Data consistency in constraint-based causal discovery. We pre- analysis with bayesian networks: A bootstrap ap- sented a way to identify all invariant causal relationships proach. In Proceedings of the Fifteenth Conference entailed in a complete PAG. on Uncertainty in Artificial Intelligence, 1999. We examined how well causal and non-causal relationships [6] N. Friedman and D. Koller. Being Bayesian about predicted by three different variations of FCI are preserved network structure. In Proceedings of the Sixteenth when you marginalize out variables. Results indicate that Conference Annual Conference on Uncertainty in Ar- constraint-based learning methods for causal networks are tificial Intelligence, 2000. sensitive to the selected marginal, particularly for dense [7] D. Heckerman, D. Geiger, and D. M. Chickering. networks. Learning Bayesian networks: The combination of The results are important because in most real-life applica- knowledge and statistical data. Machine Learning, tions researchers may have limited knowledge on the pos- 20(3):197–243, 1995. [8] M. Kalisch, M. Mächler, D. Colombo, M. H. Maathuis, and P. Bühlmann. Causal inference using graphical models with the r package pcalg. Journal of Statistical Software, 47(11):1–26, 2012. [9] N. Meinshausen and P. Bühlmann. Stability selec- tion. Journal of the Royal Statistical Society, Series B, 72:417–473, 2010. [10] P. Parviainen and M. Koivisto. Ancestor relations in the presence of unobserved variables. In Machine Learning and Knowledge Discovery in Databases, volume 6912 of Lecture Notes in Computer Science, pages 581–596. Springer, 2011. [11] J. Ramsey, J. Zhang, and P. Spirtes. Adjacency- faithfulness and conservative causal inference. In Proceedings of the Twenty-Second Conference on Un- certainty in Artificial Intelligence, 2006. [12] T. Richardson and P. Spirtes. Ancestral graph Markov models. Annals of Statistics, 30(4):962–1030, 2002. [13] R. Scheines, P. Spirtes, C. Glymour, C. Meek, and T. Richardson. The tetrad project: Constraint based aids to causal model specification. Multivariate Be- havioral Research, 33(1):65–117, 1998. [14] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, Cambridge, MA, 2nd edition, 2000. [15] J. Zhang. On the completeness of orientation rules for causal discovery in the presence of latent con- founders and selection bias. Artificial Intelligence, 172(16-17):1873–1896, 2008.