Pairwise Cluster Comparison for Learning Latent Variable Models Nuaman Asbeh and Boaz Lerner Department of Industrial Engineering & Management Ben-Gurion University of the Negev, Israel asbeh@post.bgu.ac.il; boaz@bgu.ac.il Abstract the observed variables. These models are vital in, e.g., economics, social sciences, natural language process- Learning a latent variable model (LVM) ex- ing, and machine learning and have become the focus ploits values of the measured variables as man- of many studies. By aggregating observed variables into ifested in the data to causal discovery. Because a few latent variables, each of which represents a “con- the challenge in learning an LVM is similar cept” explaining some aspects of the domain, LVMs re- to that faced in unsupervised learning, where duce dimensionality and facilitate interpretation. the number of clusters and the classes that are Learning an LVM exploits values of the measured vari- represented by these clusters are unknown, we ables to infer about causal relationships among latent link causal discovery and clustering. We pro- variables and to predict these variables’ values. Although pose the concept of pairwise cluster compar- statistical methods, such as factor analysis, effectively re- ison (PCC), by which clusters of data points duce dimensionality and may fit the data reasonably well, are compared pairwise to associate changes in the resulting models might not have any correspondence the observed variables with changes in their to real causal mechanisms [Silva et al., 2006]. Learn- ancestor latent variables and thereby to reveal ing Bayesian networks (BNs) focuses on causal rela- these latent variables and their causal paths of tions among observed variables, but the detection of la- influence, and the learning PCC (LPCC) al- tent variables and their interrelations among themselves gorithm that identifies PCCs and uses them to and with the observed variables has received little atten- learn an LVM. LPCC is not limited to linear or tion. Learning an LVM using Inductive Causation* (IC*) latent-tree models. It returns a pattern of the [Pearl, 2000, Pearl and Verma, 1991] and Fast Causal In- true graph or the true graph itself if the graph ference (FCI) [Spirtes et al., 2000] returns partial ances- has serial connections or not, respectively. The tral graphs, which indicate for each link whether it is a complete theoretical foundation to PCC, the (potential) manifestation of a hidden common cause for LPCC algorithm, and its experimental evalua- the two linked variables. The FindHidden algorithm [El- tion are given in [Asbeh and Lerner, 2016a,b], idan et al., 2000] detects substructures that suggest the whereas, here, we only introduce and promote presence of latents in the form of dense subnetworks, them. The LPCC code and evaluation results but it cannot always find a pure measurement sub-model are available online. [Silva et al., 2006]. Also, the recovery of latent trees of binary and Gaussian variables has been suggested [Pearl, 2000]. Hierarchical latent class (HLC) models, which 1 LVM LEARNING are rooted trees where the leaf nodes are observed while all other nodes are latent, have been proposed [Zhang, Latent variables cannot usually be observed directly from 2004]. Two greedy algorithms are suggested [Harmel- data but only inferred from observed variables (indica- ing and Williams, 2011] to expedite learning of both the tors) [Spirtes, 2013]. Concepts such as “life quality,” structure of a binary HLC and the cardinalities of the la- “mental states,” and “psychological stress” play a key tents. Latent-tree models are also used to speed approx- role in scientific models, and yet are latent [Klee, 1997]. imate inference in BNs, trading the approximation accu- racy with inferential complexity [Wang et al., 2008]. Latent variable models (LVMs) represent latent variables and causal relations among them that are manifested in Models in which multiple latents may have multiple in- dicators (observed children), i.e. multiple indicator mod- correspond to latents L and observed variables O, and els (MIMs) [Bartholomew et al., 2002], are an impor- E is the set of edges between nodes in G. Θ is the set tant subclass of structural equation models (SEM), which of parameters; and A2: No observed variable in O is an are widely used in applied and social sciences to analyze ancestor of any latent variable in L (the measurement as- causal relations [Pearl, 2000, Shimizu et al., 2011]. For sumption [Spirtes et al., 2000]). We define [following these models, and others that are not tree-constrained, Silva et al., 2006]: D1: A model satisfying A1 and A2 most of the mentioned algorithms may lead to unsatis- is a latent variable model; D2: Given an LVM G with factory results. An algorithm that fills the gap between a variable set V, the subgraph containing all variables in learning latent-tree models and learning MIMs is Build- V and all and only those edges directed into variables in PureClusters [BPC; Silva et al., 2006]. It searches for the O is called the measurement model of G; D3: Given an set of MIMs (an equivalence class) that best matches the LVM G, the subgraph containing all and only G’s latent set of vanishing tetrad differences [Scheines et al., 1995], nodes and their respective edges is called the structural but it is limited to linear models [Spirtes, 2013]. model of G; and D4: A pure measurement model is a measurement model in which each observed variable has We target the goal of Silva et al. [2006], but concen- only one latent parent and no observed parent. Then, we trate on the discrete case. Interestingly, the same dif- also assume that: A3: The measurement model of G is ficulty in learning MIMs is also faced in unsupervised pure; and A4: Each latent in the true model G has at least learning that confronts questions such as: (1) How many two observed children and may have latent parents. clusters are there in the observed data? and (2) Which classes do the clusters really represent? Due to this sim- As Silva et al. [2006] pointed out, factor analysis, princi- ilarity, we link learning an LVM and clustering and pro- pal component analysis, and regression analysis adapted pose a concept and an algorithm that combine the two to learning LVMs are well understood, but have not been disciplines. According to the pairwise cluster compari- proven under any general assumptions to learn the true son (PCC) concept, we compare pairwise clusters of data causal LVM, calling for better learning methods. Causal points representing instantiations of the observed vari- structure discovery – learning the number of latent vari- ables to identify those pairs of clusters that exhibit major ables, their interconnections, and connections to the ob- changes in the observed variables due to changes in their served variables, as well as the interconnections among ancestor latent variables. Changes in a latent variable the observed variables – is difficult and requires mak- that are manifested in changes in the observed variables ing assumptions about the problem. By assuming that reveal this latent variable and its causal paths of influ- the true model manifests local influence of each latent ence. The learning PCC (LPCC) algorithm uses PCCs variable on at least a small number of observed vari- to learn an LVM by identifying latent variables – exoge- ables, Silva et al. [2006] show that learning the complete nous and endogenous (the latter may be either colliders Markov equivalence class of MIM is feasible. Similarly, or non-colliders) – and their causal interrelationships as we assume the true model is pure (A3). When it is pure, well as their children (latent and observed variables) and LPCC will identify it correctly (or find its pattern that causal paths from latent variables to observed variables. represents the equivalence class of the true graph), and when it is not, LPCC will learn a pure submodel of the The complete theoretical foundation to PCC is given in true model, in both cases using only two indicators per Asbeh and Lerner [2016a], and the description of the latent (compared to three indicators per latent that are re- LPCC algorithm and its experimental evaluation are pro- quired by BPC [Silva et al., 2006]). vided in Asbeh and Lerner [2016b], whereas, here, we only briefly introduce, motivate, and promote them. Fol- Note the tradeoff between the structural and paramet- lowing, we give preliminaries to LVM learning. In Sec- ric assumptions that an algorithm for learning an LVM tion 2, we introduce and motivate PCC, and in Section 3, has to make; the fewer parametric assumptions it makes, we provide an overview of the LPCC algorithm. In Sec- the more structural assumptions it has to make and vice tion 4, we experimentally compare LPCC to other learn- versa. While BPC needs to make a parametric assump- ing algorithms using synthetic and real-world databases. tion about the linearity of the model, and the latent- Finally, in Section 5, we summarize the contribution of tree algorithms [Zhang, 2004, Harmeling and Williams, LPCC. 2011, Wang et al., 2008] restrict the learned structure to a tree 1 , LPCC assumes that the model structure is pure, First, we present two assumptions that LPCC makes: and A5: A latent collider has no latent descendants. A1: The underlying model is a Bayesian network, BN=, encoding a discrete joint probability dis- tribution P for a set of random variables V=L∪O, where 1 G= is a directed acyclic graph whose nodes V LPCC is not limited to a tree because it allows latent vari- ables to be colliders 2 PRELIMINARIES TO PCC ex, en, el, c, nc, s, o, oex, oc, and os, respectively. Since the underlying model is a BN, the joint probability Figure 1 sketches a range of pure measurement mod- over V, which is represented by the BN, is factored ac- els, from basic to more complex. G1 is a basic MIM cording to the local Markov assumption for G to a prod- of two unconnected latents, and G2 shows a structural uct of products of 1) prior probabilities for the exogenous model having a latent collider. Note that such an LVM variables (EX), 2) probabilities of endogenous (collider cannot be learned by latent-tree algorithms such as in and non-collider) latent variables (EL) conditioned on Zhang [2004]. G3 and G4 demonstrate serial and diverg- their latent parents, and 3) probabilities of observed vari- ing structural models, respectively, that together with G2 ables (O) conditioned on their latent parents. cover the three basic structural models. G5 and G6 man- ifest more complex structural models comprising a latent To demonstrate the influence of exogenous (latent) vari- collider and a combination of serial and diverging con- ables on observed variables and its relation to learning nections. As the structural model becomes more com- an LVM, we show that changes in values of the observed plicated, the learning task becomes more challenging; variables are due to changes in values of the exogenous hence, G1–G6 present a spectrum of challenges. 2 variables, and thus the identification of the former indi- cates the existence of the latter. To do that, we analyze Section 2.1 builds the infrastructure to PCC that relies the propagation of influence along directed paths [Pearl, on understanding the influence of the exogenous latent 1988] connecting both variables, first among latents, re- variables on the observed variables. This influence is di- membering that the paths may contain latent colliders vided into major and minor effects that are introduced in and latent non-colliders, and then paths ending in their Section 2.2. Section 2.3 links this structural influence to sinks (i.e., the observed variables). data clustering and introduces the pairwise cluster com- parison concept for learning an LVM. We prove in Asbeh and Lerner [2016a] that a) a latent non-collider has only a single exogenous latent ancestor, 2.1 INFLUENCE OF EXOGENOUS LATENTS and there is only a single directed path between them; ON OBSERVED VARIABLES and b) a latent colider is connected to a set of exogenous latent ancestors via a set of directed paths. By separating We distinguish between observed (O) and latent (L) the influence of exogenous variables to distinct paths of variables and between exogenous (EX) and endogenous influence, we can determine the joint probability over (V) (EN) variables. EX variables have zero in-degree, and due to value assignment ex to exogenous set EX using are autonomous and unaffected by the values of the other this assignment and the BN conditional probabilities. variables (e.g., L1 in G6 in Figure 1), whereas EN are all non-exogenous variables in G (e.g., L2 in G2 and G6, 2.2 MAJOR AND MINOR EFFECTS/VALUES and X1 in all graphs in Figure 1). We identify three types of variables: (1) Exogenous latents, EX⊂(L∩NC) [all After analyzing the structural influences (path of hierar- exogenous variables are latent non-colliders (NC)]; (2) chies) of the latents on the observed variables, we com- Endogenous latents, EL⊂(L∩EN), which are divided plement now this analysis with the parametric influences, into latent colliders C⊂EL (e.g., L2 in G5) and la- which we divide into major and minor effects. tent non-colliders S⊂(EL∩NC) (e.g., L3 in G6), thus NC= (EX∪S); and (3) Observed variables, O⊂EN, We define a local effect on an endogenous variable EN as which are always endogenous and childless, that are di- the influence of a configuration of EN’s direct latent par- vided into children of exogenous latents OEX⊂O (e.g., ents on any of EN’s values. We distinguish between: 1) a X9 in G2), children of latent colliders OC⊂O (e.g., X5 major local effect, which is the largest local effect on EN in G2), and children of endogenous latent non-colliders that is identified by the maximal conditional probability OS⊂O (e.g., X4 in G3). We denote value configurations of a specific value en of EN given a configuration pa of of EX, EN (when we do not know whether the endoge- EN ’s latent parents; and 2) a minor local effect, which nous variables are latent or observed), EL, C, NC (when is any non-major local effect on EN that is identified by we do not know whether the non-collider variables are a conditional probability of any other value of EN given exogenous or endogenous), S, O, OEX, OC, and OS by pa that is smaller than that of the major effect. Accord- ingly, we define: 1) a major local value as the value en 2 In Section 4, we compare LPCC with BPC and exploratory corresponding to the major effect; and 2) a minor local factor analysis (EFA) using these six LVMs. Since BPC re- value as an en corresponding to any minor local effect. quires three indicators per latent to identify a latent, we deter- mined from the beginning three indicators per latent for all true We assume the most probable explanation [Pearl, 1988], models to recover. Nevertheless, in Section 4, we evaluate the which is that for every endogenous variable and every learning algorithms for increasing numbers of indicators. configuration of its parents, there exists a certain value Figure 1: MIMs with Structural Models of Different Complexities Challenging a Learning Algorithm Differently. that is major, and prove in Asbeh and Lerner [2016a] that variables has changed its value in the corresponding two this value is unique for every parent configuration. configurations of this exogenous variable. Second, assuming that for every endogenous variable, different parent values (or parent value configurations for 2.3 PCC BY DATA CLUSTERING a latent collider) yield different major values, together with using the Markov property and BN parameters, we Practically, we use observational data that were gener- can aggregate all local influences and generalize local ef- ated from an unknown LVM and measured over the ob- fects on specific endogenous variables to effect on all en- served variables. We define an observed value configura- dogenous variables in the graph. Consequently, a major tion, observed major value configuration, and observed effect (MAE) is the largest effect of ex on EN and a minor minor value configuration due to ex as the parts in en, effect (MIE) is any non-MAE effect of ex on EN. Also, MAV, and a minor value configuration, respectively, that a major value configuration (MAV) is the configuration correspond to the observed variables. We show in As- en of EN corresponding to MAE (i.e., the most probable beh and Lerner [2016a] that there is only a single ob- en due to ex), and a minor value configuration is a con- served major value configuration to each exogenous con- figuration en corresponding to any MIE. In a MAV, each figuration ex of EX, and there are different observed variable in EN takes on the major local value and in a major value configurations to different exogenous con- MIE, at least one EN takes on a minor local value. figurations. But, due to the probabilistic nature of BNs, Third, we represent the influence on a subset of the en- each observed value configuration due to ex may be rep- dogenous variables of the subset of exogenous variables resented by several data points. Clustering these data that impact these endogenous variables. This partial rep- points may produce several clusters for each ex, and each resentation of MAE enables LPCC to recover the rela- cluster corresponds to another observed value configura- tionships between exogenous ancestors and only the de- tion. However, one and only one of the clusters corre- scendants that are affected by them. We separately an- sponds to each of the observed major value configura- alyze the effect of each exogenous variable on each ob- tions for a specific ex, whereas the other clusters corre- served variable for which the exogenous is its ancestor spond to observed minor value configurations. and all the latent variables along the path connecting We define the single cluster that corresponds to the ob- them. We show in Asbeh and Lerner [2016a] the exis- served major value configuration, and thus also repre- tence and uniqueness of the value a latent non-collider sents the major effect due to configuration ex of EX, and its observed child get under the influence of an ex- as the major cluster for ex, and all the clusters that cor- ogenous ancestor; one and only one value of the latent respond to the observed minor value configurations due non-collider (observed child)) changes with a change in to minor effects as minor clusters. However, we distin- the value of the exogenous. We also show that a latent guish between different types of minor effects/clusters. collider and its observed child, both descendants of a set A k-order minor effect is a minor effect in which exactly of exogenous variables, change their values in any two k endogenous variables in EN correspond to minor local major configurations only if at least one of the exogenous effects. An en corresponding to a k-order minor effect is a k-order minor value configuration. In addition, mi- minor PCCs are used for identifying the endogenous la- nor clusters that correspond to k-order minor effects are tents, their interrelations, and their observed children. k-order minor clusters. The set of all major clusters (cor- responding to all observed major value configurations) reflects the effect of all possible exs, and thus the num- 3 OVERVIEW OF LPCC ber of major clusters is expected to be equal to the num- ber of EX configurations. Therefore, the identification of LPCC is fed by samples from the observed variables. It all major clusters is a key to the discovery of exogenous clusters the data using the self-organizing map (SOM) al- variables and their causal interrelations. For this purpose, gorithm (although any algorithm that does not require a we introduce the concept of pairwise cluster comparison prior setting of the number of clusters is appropriate), and (PCC). PCC measures the differences between two clus- selects an initial set of major clusters. Then LPCC learns ters; each represents the response of LVM to another ex. an LVM in two stages. First, it identifies (Section 3.1) latent variables (and their observed descendants) without Pairwise cluster comparison is a procedure by which distinguishing exogenous from endogenous latents, be- pairs of clusters are compared, e.g., through a compar- fore distinguishing latent colliders from exogenous vari- ison of their centroids. The result of PCC between a ables (Section 3.2). LPCC iteratively improves the selec- pair of centroids of dimension |O|, where O is the set tion of major clusters (Section 3.3), and the entire stage of observed variables, is a binary vector of size |O| in is repeated until convergence. Second, LPCC identifies which each element is 1 or 0 depending, respectively, endogenous latent non-colliders among the previously on whether or not there is a difference between the cor- identified latent variables and splits these two types of responding elements in the compared centroids. When latent variables from each other before finding the links PCC is between clusters that represent observed major between them (Section 3.4). value configurations (i.e., PCC between major clusters), an element of 1 identifies an observed variable that has 3.1 IDENTIFICATION OF LATENT VARIABLES changed its value between the compared clusters due to a change in ex. Thus, the 1s in a major–major PCC pro- We demonstrate the relations between PCC and learning vide evidence of causal relationships between EX and an LVM by an example. G1 in Figure 1 has two exoge- O. Practically, LPCC always identifies all observed vari- nous variables, L1 and L2, each having three children ables that are represented by 1s together in all PCCs as X1, X2, X3 and X4, X5, X6, respectively. 3 For the ex- the observed descendants of the same exogenous variable ample, assume all variables are binary, i.e., L1 and L2 (Section 3.1). However, due to the probabilistic nature of have four possible exs (L1L2= 00, 01, 10, 11). We used BN and the existence of endogenous latents (mediating a uniform distribution over L1 and L2 and set the prob- the connections from EX to O), some of the clusters are abilities of an observed child, Xi , i = 1, . . ., 6, given its k-order minor clusters (in different orders), representing latent parent, Lk , k = 1, 2 (only if Lk is a direct parent k-order minor configurations/effects. Minor clusters are of Xi , e.g., L1 and X1), to be P (Xi = v | Lk = v) = more difficult to identify than major clusters because the 0.8, v = 0, 1. First, we generated a synthetic data set latter reflect the major effects of EX on EN and, there- of 1,000 patterns from G1. Second, using SOM [Koho- fore, are considerably more populated by data points than nen, 1997], we clustered the data and found 16 clusters, the former. Nevertheless, minor clusters are important in of which four were major (Section 3.3 gives details on causal discovery by LPCC even though a major–minor how to identify major clusters). This meets our expec- PCC cannot tell the effect of EX on EN because an ob- tation of four major clusters corresponding to the four served variable in two compared (major and minor) clus- possible exs. These clusters are presented in Table 1a by ters should not necessarily change its value as a result their centroids, which are the most prevalent patterns in of a change in ex. Their importance is because a ma- the clusters, and their corresponding PCCs are given in jor cluster cannot indicate (when compared with another Table 1b. For example, PCC1,2, comparing clusters C1 major cluster) the existence of minor values. On the and C2, shows that when moving from C1 to C2, only contrary, PCC between major and minor clusters shows the values corresponding to variables X1–X3 have been (through the number of 1s) the number of minor values changed (i.e., δX 1 = δX 2 = δX 3 = 1 in Table 1b). represented in the minor cluster, and this is exploited by Also PCC1,4, PCC2,3, and PCC3,4 show that X1–X3 al- LPCC for identifying the endogenous latents and inter- ways change their values together. This may be evidence relations among them (Section 3.4). That is, PCC is the that X1–X3 are descendants of the same exogenous la- source to identify causal relationships in the unknown LVM; major–major PCCs are used for identifying the 3 We determined three indicators per latent in all true models exogenous variables and their descendants, and major– we learn (Figure 1) because it is required by BPC, making the experimental evaluation in Section 4 fair. Table 1: (a) Centroids of Major Clusters for G1 and (b) PCCs between These Major Clusters PCC δX1 δX2 δX3 δX4 δX5 δX6 Centroid X1 X2 X3 X4 X5 X6 1-2 1 1 1 0 0 0 C1 0 0 0 1 1 1 1-3 0 0 0 1 1 1 C2 1 1 1 1 1 1 1-4 1 1 1 1 1 1 C3 0 0 0 0 0 0 2-3 1 1 1 1 1 1 C4 1 1 1 0 0 0 2-4 0 0 0 1 1 1 3-4 1 1 1 0 0 0 (a) (b) tent, which, as we know from the true graph G1, is L1. 3.2 DISTINGUISHING LATENT COLLIDERS Note, however that PCC1,4 and PCC2,3 show that the values corresponding to X4–X6 have changed together To demonstrate distinguishing latent colliders from ex- too, whereas these values did not change in PCC1,2 and ogenous variables, we use graph G2 in Figure 1. It shows PCC3,4. Because X4–X6 changed their values only in two exogenous latent variables, L1 and L3, that collide in PCC1,4 and PCC2,3 but not in PCC1,2 and PCC3,4, an endogenous latent variable, L2, each having three ob- they cannot be descendants of L1. This strengthens the served children X1–X3 (L1), X4–X6 (L2), and X7–X9 evidence that X1–X3 are the only descendants of L1. A (L3). We assume for the example that all variables are similar analysis using PCC1,3 and PCC2,4 will identify binary. Having two exogenous variables, we expect to that X4–X6 are descendants of another latent variable find four major clusters in the data; each will correspond (L2, as we know). to one of the four possible exs (L1L3= 00, 01, 10, 11). As for G1 (Section 3.1), we expect the values of X1– Therefore, we define a maximal set of observed (MSO) X3 to change together in all PCCs following a change in variables as the set of variables that always changes its the value of L1, and the values of X7–X9 to change to- values together in each major–major PCC in which at gether in all the PCCs following a change in the value of least one of the variables changes value. For example, L3. However, the values of X4–X6 will change together X1 (Table 1b) changes its value in PCC1,2, PCC1,4, with those of X1–X3 in part of the PCCs and together PCC2,3, and PCC3,4 and always together with X2 and with those of X7–X9 in the remaining PCCs, but always X3 (and vice versa). Thus, {X1, X2, X3} (and similarly together in all of the PCCs. This will be evidence that {X4, X5, X6}) is an MSO. Each MSO includes descen- X4–X6 are descendants of the same latent collider (L2, dants of the same latent variable L, and once LPCC iden- as we know). That is, to learn that an already learned tifies this MSO, it introduces L to the learned graph as a latent variable L is a collider for a set of other already new latent variable, together with all the observed vari- learned (exogenous) latent ancestor variables LA⊂EX, ables that are included in this MSO as L’s children. We LPCC requires that: (1) The values of the children of L prove in Asbeh and Lerner [2016a] that every observed will change with the values of descendants of different variable belongs to one and only one MSO, i.e., MSOs latent variables in LA in different parts of major–major corresponding to the learned latents are disjoint, which PCCs; and (2) The values of the children of L will not means that LPCC learns a pure measurement model. We change in any PCC unless the values of descendants of also prove that variables of a particular MSO are children at least one of the variables in LA change. This insures of a particular exogenous latent variable EX or its latent that L does not change independently of latents in LA non-collider descendant or children of a particular latent that are L’s ancestors. collider C. This guarantees that each of multiple latent variables (either an exogenous or any of its non-collider descendants or a collider) is identified by its own MSO. 3.3 CHOOSING MAJOR CLUSTERS LPCC cannot yet distinguish between exogenous latents Due to a lack of prior information regarding the distribu- and latent colliders because the main goal at this stage tion of latent variables, LPCC, first, assumes a uniform was to identify latent variables and their relations with distribution over a latent and selects the major clusters the observed variables. In Section 3.2, we distinguish based on their size, i.e., the number of patterns clustered. these two types of variables, whereas in Section 3.4, we Clusters that are larger than the average cluster size are use major–minor PCCs rather that major–major PCCs to selected as majors. However, this initial selection may distinguish latent non-colliders from exogenous latents. generate: 1) false negative errors (i.e., deciding a major cluster is minor), when a latent variable L has a skewed distribution over its values, and then a rare value can be represented only by small clusters that could not be cho- sen as majors; and 2) false positive errors (i.e., deciding a Table 2: Ten of the Seventeen Largest Clusters for G3 minor cluster is major), when due to similar probabilities X1 X2 X3 X4 X5 X6 X7 X8 X9 size of different values of an observed child given its parent C1 1 1 1 1 1 1 1 1 1 49 C2 0 0 0 0 0 0 0 0 0 47 L and a large sample size, a cluster that is supposed to be C3 1 1 1 1 1 1 1 1 0 28 minor becomes too large and is selected as major. C4 0 0 0 0 0 0 0 1 0 24 C5 0 1 0 0 0 0 0 0 0 22 To avoid such errors, LPCC learns iteratively. Following C6 1 1 1 1 1 1 0 0 0 22 C7 0 0 1 0 0 0 0 0 0 21 the selection of major clusters based on their sizes and C8 0 0 0 1 1 1 1 1 1 19 learning a graph, it becomes possible to find the cardi- C9 0 0 0 0 0 0 1 1 1 18 nalities of the latent variables and all possible exs. For C10 1 1 1 0 0 0 0 0 0 16 each ex, we can use the most probable cluster given the data as the major cluster, and using an EM-style proce- (1-MC) as a cluster that corresponds to a 1-order minor dure [Dempster et al., 1977], update the set of major clus- value configuration, which exists when exactly one en- ters iteratively and probabilistically to augment LPCC to dogenous variable in EN (either latent or observed) has learn more accurate graphs. The final graph may not be a minor local value as a response to a value that EX has optimal since it depends on the initial graph, but it is an obtained4 . To reveal the existence of latent non-colliders improved version of the initial graph. that were previously combined with EX and splits them from EX, we analyze for each EX, PCCs between 1-MCs 3.4 IDENTIFICATION OF LATENT and the major clusters that identified EX. NON-COLLIDER VARIABLES Table 3: All 2S-PCCs for G3 So far, the latent non-colliders that are descendants of an PCC δX1 δX2 δX3 δX4 δX5 δX6 δX7 δX8 δX9 exogenous variable EX were temporarily combined with 1-6 0 0 0 0 0 0 1 1 1 2-6 1 1 1 1 1 1 0 0 0 it, and all their observed children were temporarily com- 1-8 1 1 1 0 0 0 0 0 0 bined with the direct children of EX. To exemplify that, 2-8 0 0 0 1 1 1 1 1 1 check G3 in Figure 1, showing a serial connection of L1, 1-9 1 1 1 1 1 1 0 0 0 2-9 0 0 0 0 0 0 1 1 1 L2, and L3. Assume each latent has three observed chil- 1-10 0 0 0 1 1 1 1 1 1 dren, and all are binary. L1 is the only EX with two possi- 2-10 1 1 1 0 0 0 0 0 0 ble exs (L1= 0, 1), and L2 and L3 are NCs; L2 is a child of L1 and a parent of L3. We set the probabilities of: Table 4: PCCs for C3 with C1 and C2 in Learning G3 1) L1 uniformly; 2) an observed child Xi , i= 1,. . ., 9, PCC δX1 δX2 δX3 δX4 δX5 δX6 δX7 δX8 δX9 given its latent parent Lk , k= 1, 2, 3 (if this is a di- 1-3 0 0 0 0 0 0 0 0 1 rect parent), as P (Xi =v | Lk =v) = 0.8,v= 0, 1; and 2-3 1 1 1 1 1 1 1 1 0 3) an endogenous latent Lj , j= 2, 3, given its latent parent Lk , k= 1, 2 (if this is a direct parent), as Thus, second, we define PCCs between 1-MCs and ma- P (Lj =v | Lk =v)= 0.8,v= 0, 1. We generated 1,000 jor clusters that show two sets of two or more observed patterns from G3 over the nine observed variables. Ta- variables having the same value, different than that of ble 2 presents ten of the seventeen largest clusters by the other set, as 2S-PCC (i.e., PCC of “two sets”) and their centroids and sizes, from which C1 and C2 were the corresponding 1-MC as 2S-MC. To identify a latent selected as major. This meets our expectation of two non-collider that was combined to an exogenous latent major clusters corresponding to the two possible exs of EX, we consider only 2S-PCCs; these PCCs are the result L1. However, because all the elements in PCC1,2 are of comparing all the 2S-MCs among the 1-MCs for EX 1s (compare C1 and C2 in Table 2), the nine observed with the major clusters that revealed EX. For example, variables establish a single MSO and thus are consid- Table 3 presents all 2S-PCCs for G3. While a PCC that ered descendants of the same exogenous variable. That is not a 2S-PCC identifies a minor value of an observed is, the model learned in the first stage of LPCC has only descendant of EX (e.g., PCC1,3, where C1 is major and one exogenous latent variable (i.e., L1) with direct chil- C3 is 1-MC, identifies a minor value of X9; see Table 4), dren that are the nine observed descendants; contrary to a 2S-PCC identifies a minor value in a latent non-collider G3. Since L2 and L3, which are latent non-colliders that descendant of EX and thereby the existence of this latent. are descendants of L1, were combined with L1, LPCC The latter is seen, e.g., in Table 3, in PCC1,6 showing the should now split them from L1 along with their observed change between C1 (major cluster) and C6 (1-MC) in the children. 4 Based on the identification of a 1-MC [Asbeh and Lerner, Such a split is based on major–minor (rather than major– 2016a], we find, e.g., in learning G3 that C2 is the minimal major) PCCs. First, we define a first-order minor cluster major cluster, and all other clusters (Table 2) are 1-MCs. values of X7–X9 and in PCC2,6 showing the change be- omitted; see Asbeh and Lerner [2016b]]. We drew data tween C2 (major cluster) and C6 in the values of X1–X6, sets of between 250 and 2,000 samples and report on the thus identifying the existence of a latent non-collider (L3 structural hamming distance (SHD) [Tsamardinos et al., as we know). This identification yields a split of exoge- 2006] as a performance measure for learning the LVM nous L1 into two latents: one (L3) is a parent of X7-X9, structure. SHD is a structural measure that accounts for and the other is a parent of X1–X6 (which later will split all the possible learning errors (the lower value is the bet- again to L1 and L2, each with its own three children). ter one): addition and deletion of an undirected edge, and addition, removal, and reversal of edge orientation. However, relying only on part of the 2S-PCCs may be in- adequate to conclude on all possible splits (e.g., PCC1,8 and PCC2,8 show that X1–X3 and X4–X9 are children of different latents, but do not suggest the split of X7–X9 as PCC1,6 and PCC2,6 do). Thus, it is necessary to in- troduce for 2S-PCCs a maximal set of observed variables (2S-MSO) that always change their values together in all 2S-PCCs. For example, X1 in Table 3 changes its value in PCC2,6, PCC1,8, PCC1,9, and PCC2,10 and always together with X2 and X3 (and the other way around). Thus, {X1, X2, X3} and similarly {X4, X5, X6} and {X7, X8, X9} are 2S-MSOs. Each 2S-MSO includes children of the same latent non-collider, which is a de- scendant of EX, or EX itself. LPCC detects 2S-MSOs for each EX and thereby identifies its possible splits. Finally, we show in Asbeh and Lerner [2016a] how to (a) SHD of LPCC/BPC/EFA vs. parametrization level. For G2/1,000 samples, LPCC/BPC learn perfectly. identify links between split latents. For a serial connec- tion, LPCC finds the source (EX) and latent sink on the path between them, but not who is who, and thus it can only learn this connection as undirected. For a diverging connection, LPCC learns all directed links among the la- tents. That is, LPCC learns a pattern over the structural model of G, which represents a Markov equivalence class of models among the latents, where in the special case in which G has no serial connection, LPCC learns the true graph. Based on the concepts outlined in Section 3, we formally introduce the LPCC algorithm in Asbeh and Lerner [2016b] and thoroughly evaluate it experimentally us- ing synthetic and real-world data compared with other algorithms. The complete results are given in Asbeh and (b) SHD of LPCC/BPC/EFA vs. number of indicators Lerner [2016b], and a snapshot of them is provided in per latent in G2, pj = 0.75, and four sample sizes. Section 4. Figure 2: Structural Correctness of Learning Algorithms. 4 EXPERIMENTAL EVALUATION Figure 2a shows SHD values for the LPCC, BPC, and EFA algorithms for increasing parametrization levels for Simulated data: We used Tetrad IV to construct graphs four combinations of learned graphs and sample sizes. G2 and G4 of Figure 1 for three parameterization lev- It shows that LPCC and BPC improve performance, as els that differ by the conditional probabilities pj =0.7, expected, with increased levels of latent-observed vari- 0.75, and 0.8 between a latent and each of its children. able correlation (pj ). LPCC never falls behind BPC, and Each such level imposes a different complexity on the its advantage over BPC is especially vivid for a small model and thereby affects the task of learning the latent sample size. EFA, besides falling behind LPCC and model and the causal relations (i.e., pj =0.7 poses a larger BPC, also demonstrates worsening of performance with challenge to learning than pj =0.75) [full details about increasing the parametrization level, especially for large the types of variables and parametrization schemes are sample sizes. Larger parametrization levels increase the chances of an EFA to learn links between latent variables and Lerner [2016b] that LPCC improves accuracy with and observed variables – some of which are not between sample size, it can learn large LVMs, and it has consis- a latent and its real child – to compensate for the algo- tently good results compared to models that are expert- rithm’s inability to identify links among latents (EFA as- based or learned by state-of-the-art algorithms. Applied sumes latents are uncorrelated). EFA is inferior to LPCC to two original domains, LPCC helped identify possible for all parametrization levels, sample sizes, and graphs. causes of young drivers’ involvement in road accidents and cell subpopulations in the immune system. Figure 2b shows SHD values of the LPCC, BPC, and EFA algorithms for increasing numbers of binary indica- tors per latent variable in G2, a parametrization level of 0.75, and four sample sizes. The figure exhibits superior- ity of LPCC over BPC and EFA for all scenarios. While LPCC hardly worsens its performance with the increase of complexity, both BPC and EFA are affected by this increase. They also have a difficulty in learning an LVM for which latent variables have exactly two indicators. Real-world data – The political action survey (PAS): We evaluated LPCC using a simplified PAS data set over six variables (Joreskog, 2004): NOSAY, VOTING, COM- PLEX, NOCARE, TOUCH, and INTEREST. These vari- ables represent political efficacy and correspond to ques- tions to which the respondents have to give their degree of agreement on a discrete ordinal scale of four values. This data set contains the responses from a sample of 1,076 US respondents. A model consisting of two la- tents that correspond to a previously established theoret- ical trait of Efficacy and Responsiveness (Figure 3a) dis- cards VOTING Joreskog [2004] based on the argument that the question for VOTING is not clearly phrased. Similar to the theoretical model, LPCC finds two la- tents (Figure 3b): One corresponds to NOSAY and VOT- ING and the other corresponds to NOCARE, TOUCH, and INTEREST. Compared with the theoretical model, Figure 3: PAS: outputs of (a) a true model, (b) LPCC, (c) LPCC misses the edge between Efficacy and NOCARE BSPC, and BPC with (d) α =0.01, 0.05, and (e) α =0.1. and the bidirectional edge between the latents. Both edges are not supposed to be discovered by LPCC or BSPC/BPC; the former because the algorithms learn a 5 SUMMARY AND CONCLUSIONS pure measurement model in which each observed vari- able has only one latent parent and the latter because no In this work, we introduced the PCC concept and LPCC cycles are assumed. Nevertheless, compared with the algorithm for learning discrete LVMs: 1) the PCC con- theoretical model, LPCC makes no use of prior knowl- cept to learn an LVM ties graphical models with data edge. BSPC output (Figure 3c) is very similar to LPCC clustering; 2) LPCC learns MIMs; 3) LPCC is not lim- output, except for NOCARE, which was not identified ited to latent-tree models and does not assume linearity; by BSPC. Both algorithms identify VOTING as a child 4) LPCC assumes that the measurement model of the true of Efficacy (at the expense of COMPLEX). The outputs graph is pure, but, if the true graph is not, it learns a pure of the BPC algorithm (Figures 3d,e) are poorer than those sub-model of the true model, if one exists. LPCC also of LPCC and BSPC and are sensitive to the significance assumes that a latent collider does not have any latent de- level. The output of the FCI algorithm (not shown here) scendants; 5) LPCC is a two-stage algorithm that exploits using any significance level is not sufficient (showing, PCC. First, it learns the exogenous latents and latent col- e.g., that NOSAY and INTEREST potentially have a la- liders, as well as their observed descendants, and second, tent common cause where these two variables are indica- it learns the endogenous latent non-colliders and their tors of different latents in the theoretical model). children by splitting these latents from their previously learned latent ancestors; and 6) LPCC learns an equiva- Using simulated and real-world data, we show in Asbeh lence class of the structural model of the true graph. References Department of Philosophy, Carnegie-Mellon Univer- sity, Pittsburgh, Pennsylvania, 1995. N. Asbeh and B. Lerner. Learning latent variable models by pairwise cluster comparison: Part I – Theory and S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyvarinen, overview. Journal of Machine Learning Research, 17 Y. Kawahara, T. Washiok, P. Hoyer, and K. Bollen. (224):1–52, 2016a. DirectedLiNGAM: A direct method for learning a lin- ear non-Gaussian structural equation model. Journal N. Asbeh and B. Lerner. Learning latent variable mod- of Machine Learning Research, 12:1225–1248, 2011. els by pairwise cluster comparison: Part II – Algo- rithm and evaluation. Journal of Machine Learning R. Silva, R. Scheines, C. Glymour, and P. Spirtes. Learn- Research, 17(233):1–45, 2016b. ing the structure of linear latent variable models. Jour- nal of Machine Learning Research, 7:191–246, 2006. D. J. Bartholomew, F. Steele, I. Moustaki, and J. I. Gal- braith. The Analysis and Interpretation of Multivariate P. Spirtes. Calculation of entailed rank constraints in par- Data for Social Scientists (Texts in Statistical Science tially non-linear and cyclic models. In Proceedings of Series). Chapman & Hall/CRC Press, Boca Raton, the 29th Conference on Uncertainty in Artificial Intel- Florida, USA, 2002. ligence, pages 606–615, Bellevue, Washington, 2013. A. Dempster, N. Laird, and D. Rubin. Maximum like- P. Spirtes, C. Glymour, and R. Scheines. Causation, Pre- lihood from incomplete data via the EM algorithm. diction and Search. MIT Press, New York, New York, Journal of Royal Statistical Society, B 39:1–39, 1977. 2nd edition, 2000. G. Elidan, N. Lotner, N. Friedman, and D. Koller. I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The Discovering hidden variables: A structure-based ap- max-min hill-climbing Bayesian network structure proach. In Advances in Neural Information Processing learning algorithm. Machine Learning, 65:31–78, Systems, pages 13:479–485, 2000. 2006. S. Harmeling and C. K. I. Williams. Greedy learn- Y. Wang, N. L. Zhang, and T. Chen. Latent-tree mod- ing of binary latent trees. IEEE Transactions on els and approximate inference in Bayesian networks. Pattern Analysis and Machine Intelligence, 33:1087– Journal of Artificial Intelligence Research, 32:879– 1097, 2011. 900, 2008. K. Joreskog. Structural equation modeling with ordinal N. Zhang. Hierarchical latent class models for cluster variables using LISREL. Technical report, Scientific analysis. Journal of Machine Learning Research, 5: Software International Inc, 2004. 697–723, 2004. R. Klee. Introduction to the Philosophy of Science: Cut- ting Nature at its Seams. Oxford University Press, New York, New York, 1997. T. Kohonen. Self-Organizing Maps. Springer-Verlag, New York, New York, 1997. J. Pearl. Probabilistic Reasoning in Intelligent Sys- tems. Morgan Kaufmann Press, San Mateo, Califor- nia, 1988. J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, New York, 2000. J. Pearl and T. Verma. A theory of inferred causation. In Proceedings of the 2nd International Conference on Principles of Knowledge Representation and Reason- ing, pages 441–452, Cambridge, MA, 1991. R. Scheines, P. Spirtes, C. Glymour, C. Meek, and T. Richardson. The tetrad project: Constraint based aids to causal model specification. Technical report,