Pairwise Cluster Comparison for Learning Latent Variable Models


                                         Nuaman Asbeh and Boaz Lerner
                                 Department of Industrial Engineering & Management
                                     Ben-Gurion University of the Negev, Israel
                                       asbeh@post.bgu.ac.il; boaz@bgu.ac.il


                       Abstract                               the observed variables. These models are vital in, e.g.,
                                                              economics, social sciences, natural language process-
     Learning a latent variable model (LVM) ex-               ing, and machine learning and have become the focus
     ploits values of the measured variables as man-          of many studies. By aggregating observed variables into
     ifested in the data to causal discovery. Because         a few latent variables, each of which represents a “con-
     the challenge in learning an LVM is similar              cept” explaining some aspects of the domain, LVMs re-
     to that faced in unsupervised learning, where            duce dimensionality and facilitate interpretation.
     the number of clusters and the classes that are          Learning an LVM exploits values of the measured vari-
     represented by these clusters are unknown, we            ables to infer about causal relationships among latent
     link causal discovery and clustering. We pro-            variables and to predict these variables’ values. Although
     pose the concept of pairwise cluster compar-             statistical methods, such as factor analysis, effectively re-
     ison (PCC), by which clusters of data points             duce dimensionality and may fit the data reasonably well,
     are compared pairwise to associate changes in            the resulting models might not have any correspondence
     the observed variables with changes in their             to real causal mechanisms [Silva et al., 2006]. Learn-
     ancestor latent variables and thereby to reveal          ing Bayesian networks (BNs) focuses on causal rela-
     these latent variables and their causal paths of         tions among observed variables, but the detection of la-
     influence, and the learning PCC (LPCC) al-               tent variables and their interrelations among themselves
     gorithm that identifies PCCs and uses them to            and with the observed variables has received little atten-
     learn an LVM. LPCC is not limited to linear or           tion. Learning an LVM using Inductive Causation* (IC*)
     latent-tree models. It returns a pattern of the          [Pearl, 2000, Pearl and Verma, 1991] and Fast Causal In-
     true graph or the true graph itself if the graph         ference (FCI) [Spirtes et al., 2000] returns partial ances-
     has serial connections or not, respectively. The         tral graphs, which indicate for each link whether it is a
     complete theoretical foundation to PCC, the              (potential) manifestation of a hidden common cause for
     LPCC algorithm, and its experimental evalua-             the two linked variables. The FindHidden algorithm [El-
     tion are given in [Asbeh and Lerner, 2016a,b],           idan et al., 2000] detects substructures that suggest the
     whereas, here, we only introduce and promote             presence of latents in the form of dense subnetworks,
     them. The LPCC code and evaluation results               but it cannot always find a pure measurement sub-model
     are available online.                                    [Silva et al., 2006]. Also, the recovery of latent trees of
                                                              binary and Gaussian variables has been suggested [Pearl,
                                                              2000]. Hierarchical latent class (HLC) models, which
1   LVM LEARNING                                              are rooted trees where the leaf nodes are observed while
                                                              all other nodes are latent, have been proposed [Zhang,
Latent variables cannot usually be observed directly from     2004]. Two greedy algorithms are suggested [Harmel-
data but only inferred from observed variables (indica-       ing and Williams, 2011] to expedite learning of both the
tors) [Spirtes, 2013]. Concepts such as “life quality,”       structure of a binary HLC and the cardinalities of the la-
“mental states,” and “psychological stress” play a key        tents. Latent-tree models are also used to speed approx-
role in scientific models, and yet are latent [Klee, 1997].   imate inference in BNs, trading the approximation accu-
                                                              racy with inferential complexity [Wang et al., 2008].
Latent variable models (LVMs) represent latent variables
and causal relations among them that are manifested in        Models in which multiple latents may have multiple in-
dicators (observed children), i.e. multiple indicator mod-     correspond to latents L and observed variables O, and
els (MIMs) [Bartholomew et al., 2002], are an impor-           E is the set of edges between nodes in G. Θ is the set
tant subclass of structural equation models (SEM), which       of parameters; and A2: No observed variable in O is an
are widely used in applied and social sciences to analyze      ancestor of any latent variable in L (the measurement as-
causal relations [Pearl, 2000, Shimizu et al., 2011]. For      sumption [Spirtes et al., 2000]). We define [following
these models, and others that are not tree-constrained,        Silva et al., 2006]: D1: A model satisfying A1 and A2
most of the mentioned algorithms may lead to unsatis-          is a latent variable model; D2: Given an LVM G with
factory results. An algorithm that fills the gap between       a variable set V, the subgraph containing all variables in
learning latent-tree models and learning MIMs is Build-        V and all and only those edges directed into variables in
PureClusters [BPC; Silva et al., 2006]. It searches for the    O is called the measurement model of G; D3: Given an
set of MIMs (an equivalence class) that best matches the       LVM G, the subgraph containing all and only G’s latent
set of vanishing tetrad differences [Scheines et al., 1995],   nodes and their respective edges is called the structural
but it is limited to linear models [Spirtes, 2013].            model of G; and D4: A pure measurement model is a
                                                               measurement model in which each observed variable has
We target the goal of Silva et al. [2006], but concen-
                                                               only one latent parent and no observed parent. Then, we
trate on the discrete case. Interestingly, the same dif-
                                                               also assume that: A3: The measurement model of G is
ficulty in learning MIMs is also faced in unsupervised
                                                               pure; and A4: Each latent in the true model G has at least
learning that confronts questions such as: (1) How many
                                                               two observed children and may have latent parents.
clusters are there in the observed data? and (2) Which
classes do the clusters really represent? Due to this sim-     As Silva et al. [2006] pointed out, factor analysis, princi-
ilarity, we link learning an LVM and clustering and pro-       pal component analysis, and regression analysis adapted
pose a concept and an algorithm that combine the two           to learning LVMs are well understood, but have not been
disciplines. According to the pairwise cluster compari-        proven under any general assumptions to learn the true
son (PCC) concept, we compare pairwise clusters of data        causal LVM, calling for better learning methods. Causal
points representing instantiations of the observed vari-       structure discovery – learning the number of latent vari-
ables to identify those pairs of clusters that exhibit major   ables, their interconnections, and connections to the ob-
changes in the observed variables due to changes in their      served variables, as well as the interconnections among
ancestor latent variables. Changes in a latent variable        the observed variables – is difficult and requires mak-
that are manifested in changes in the observed variables       ing assumptions about the problem. By assuming that
reveal this latent variable and its causal paths of influ-     the true model manifests local influence of each latent
ence. The learning PCC (LPCC) algorithm uses PCCs              variable on at least a small number of observed vari-
to learn an LVM by identifying latent variables – exoge-       ables, Silva et al. [2006] show that learning the complete
nous and endogenous (the latter may be either colliders        Markov equivalence class of MIM is feasible. Similarly,
or non-colliders) – and their causal interrelationships as     we assume the true model is pure (A3). When it is pure,
well as their children (latent and observed variables) and     LPCC will identify it correctly (or find its pattern that
causal paths from latent variables to observed variables.      represents the equivalence class of the true graph), and
                                                               when it is not, LPCC will learn a pure submodel of the
The complete theoretical foundation to PCC is given in
                                                               true model, in both cases using only two indicators per
Asbeh and Lerner [2016a], and the description of the
                                                               latent (compared to three indicators per latent that are re-
LPCC algorithm and its experimental evaluation are pro-
                                                               quired by BPC [Silva et al., 2006]).
vided in Asbeh and Lerner [2016b], whereas, here, we
only briefly introduce, motivate, and promote them. Fol-       Note the tradeoff between the structural and paramet-
lowing, we give preliminaries to LVM learning. In Sec-         ric assumptions that an algorithm for learning an LVM
tion 2, we introduce and motivate PCC, and in Section 3,       has to make; the fewer parametric assumptions it makes,
we provide an overview of the LPCC algorithm. In Sec-          the more structural assumptions it has to make and vice
tion 4, we experimentally compare LPCC to other learn-         versa. While BPC needs to make a parametric assump-
ing algorithms using synthetic and real-world databases.       tion about the linearity of the model, and the latent-
Finally, in Section 5, we summarize the contribution of        tree algorithms [Zhang, 2004, Harmeling and Williams,
LPCC.                                                          2011, Wang et al., 2008] restrict the learned structure to
                                                               a tree 1 , LPCC assumes that the model structure is pure,
First, we present two assumptions that LPCC makes:
                                                               and A5: A latent collider has no latent descendants.
A1: The underlying model is a Bayesian network,
BN=<G,Θ>, encoding a discrete joint probability dis-
tribution P for a set of random variables V=L∪O, where
                                                                  1
G=<V,E> is a directed acyclic graph whose nodes V                  LPCC is not limited to a tree because it allows latent vari-
                                                               ables to be colliders
2         PRELIMINARIES TO PCC                                       ex, en, el, c, nc, s, o, oex, oc, and os, respectively.
                                                                     Since the underlying model is a BN, the joint probability
Figure 1 sketches a range of pure measurement mod-
                                                                     over V, which is represented by the BN, is factored ac-
els, from basic to more complex. G1 is a basic MIM
                                                                     cording to the local Markov assumption for G to a prod-
of two unconnected latents, and G2 shows a structural
                                                                     uct of products of 1) prior probabilities for the exogenous
model having a latent collider. Note that such an LVM
                                                                     variables (EX), 2) probabilities of endogenous (collider
cannot be learned by latent-tree algorithms such as in
                                                                     and non-collider) latent variables (EL) conditioned on
Zhang [2004]. G3 and G4 demonstrate serial and diverg-
                                                                     their latent parents, and 3) probabilities of observed vari-
ing structural models, respectively, that together with G2
                                                                     ables (O) conditioned on their latent parents.
cover the three basic structural models. G5 and G6 man-
ifest more complex structural models comprising a latent             To demonstrate the influence of exogenous (latent) vari-
collider and a combination of serial and diverging con-              ables on observed variables and its relation to learning
nections. As the structural model becomes more com-                  an LVM, we show that changes in values of the observed
plicated, the learning task becomes more challenging;                variables are due to changes in values of the exogenous
hence, G1–G6 present a spectrum of challenges. 2                     variables, and thus the identification of the former indi-
                                                                     cates the existence of the latter. To do that, we analyze
Section 2.1 builds the infrastructure to PCC that relies
                                                                     the propagation of influence along directed paths [Pearl,
on understanding the influence of the exogenous latent
                                                                     1988] connecting both variables, first among latents, re-
variables on the observed variables. This influence is di-
                                                                     membering that the paths may contain latent colliders
vided into major and minor effects that are introduced in
                                                                     and latent non-colliders, and then paths ending in their
Section 2.2. Section 2.3 links this structural influence to
                                                                     sinks (i.e., the observed variables).
data clustering and introduces the pairwise cluster com-
parison concept for learning an LVM.                                 We prove in Asbeh and Lerner [2016a] that a) a latent
                                                                     non-collider has only a single exogenous latent ancestor,
2.1       INFLUENCE OF EXOGENOUS LATENTS                             and there is only a single directed path between them;
          ON OBSERVED VARIABLES                                      and b) a latent colider is connected to a set of exogenous
                                                                     latent ancestors via a set of directed paths. By separating
We distinguish between observed (O) and latent (L)                   the influence of exogenous variables to distinct paths of
variables and between exogenous (EX) and endogenous                  influence, we can determine the joint probability over (V)
(EN) variables. EX variables have zero in-degree, and                due to value assignment ex to exogenous set EX using
are autonomous and unaffected by the values of the other             this assignment and the BN conditional probabilities.
variables (e.g., L1 in G6 in Figure 1), whereas EN are
all non-exogenous variables in G (e.g., L2 in G2 and G6,             2.2   MAJOR AND MINOR EFFECTS/VALUES
and X1 in all graphs in Figure 1). We identify three types
of variables: (1) Exogenous latents, EX⊂(L∩NC) [all                  After analyzing the structural influences (path of hierar-
exogenous variables are latent non-colliders (NC)]; (2)              chies) of the latents on the observed variables, we com-
Endogenous latents, EL⊂(L∩EN), which are divided                     plement now this analysis with the parametric influences,
into latent colliders C⊂EL (e.g., L2 in G5) and la-                  which we divide into major and minor effects.
tent non-colliders S⊂(EL∩NC) (e.g., L3 in G6), thus
NC= (EX∪S); and (3) Observed variables, O⊂EN,                        We define a local effect on an endogenous variable EN as
which are always endogenous and childless, that are di-              the influence of a configuration of EN’s direct latent par-
vided into children of exogenous latents OEX⊂O (e.g.,                ents on any of EN’s values. We distinguish between: 1) a
X9 in G2), children of latent colliders OC⊂O (e.g., X5               major local effect, which is the largest local effect on EN
in G2), and children of endogenous latent non-colliders              that is identified by the maximal conditional probability
OS⊂O (e.g., X4 in G3). We denote value configurations                of a specific value en of EN given a configuration pa of
of EX, EN (when we do not know whether the endoge-                   EN ’s latent parents; and 2) a minor local effect, which
nous variables are latent or observed), EL, C, NC (when              is any non-major local effect on EN that is identified by
we do not know whether the non-collider variables are                a conditional probability of any other value of EN given
exogenous or endogenous), S, O, OEX, OC, and OS by                   pa that is smaller than that of the major effect. Accord-
                                                                     ingly, we define: 1) a major local value as the value en
      2
     In Section 4, we compare LPCC with BPC and exploratory          corresponding to the major effect; and 2) a minor local
factor analysis (EFA) using these six LVMs. Since BPC re-            value as an en corresponding to any minor local effect.
quires three indicators per latent to identify a latent, we deter-
mined from the beginning three indicators per latent for all true    We assume the most probable explanation [Pearl, 1988],
models to recover. Nevertheless, in Section 4, we evaluate the       which is that for every endogenous variable and every
learning algorithms for increasing numbers of indicators.            configuration of its parents, there exists a certain value
  Figure 1: MIMs with Structural Models of Different Complexities Challenging a Learning Algorithm Differently.


that is major, and prove in Asbeh and Lerner [2016a] that     variables has changed its value in the corresponding two
this value is unique for every parent configuration.          configurations of this exogenous variable.
Second, assuming that for every endogenous variable,
different parent values (or parent value configurations for   2.3   PCC BY DATA CLUSTERING
a latent collider) yield different major values, together
with using the Markov property and BN parameters, we          Practically, we use observational data that were gener-
can aggregate all local influences and generalize local ef-   ated from an unknown LVM and measured over the ob-
fects on specific endogenous variables to effect on all en-   served variables. We define an observed value configura-
dogenous variables in the graph. Consequently, a major        tion, observed major value configuration, and observed
effect (MAE) is the largest effect of ex on EN and a minor    minor value configuration due to ex as the parts in en,
effect (MIE) is any non-MAE effect of ex on EN. Also,         MAV, and a minor value configuration, respectively, that
a major value configuration (MAV) is the configuration        correspond to the observed variables. We show in As-
en of EN corresponding to MAE (i.e., the most probable        beh and Lerner [2016a] that there is only a single ob-
en due to ex), and a minor value configuration is a con-      served major value configuration to each exogenous con-
figuration en corresponding to any MIE. In a MAV, each        figuration ex of EX, and there are different observed
variable in EN takes on the major local value and in a        major value configurations to different exogenous con-
MIE, at least one EN takes on a minor local value.            figurations. But, due to the probabilistic nature of BNs,
Third, we represent the influence on a subset of the en-      each observed value configuration due to ex may be rep-
dogenous variables of the subset of exogenous variables       resented by several data points. Clustering these data
that impact these endogenous variables. This partial rep-     points may produce several clusters for each ex, and each
resentation of MAE enables LPCC to recover the rela-          cluster corresponds to another observed value configura-
tionships between exogenous ancestors and only the de-        tion. However, one and only one of the clusters corre-
scendants that are affected by them. We separately an-        sponds to each of the observed major value configura-
alyze the effect of each exogenous variable on each ob-       tions for a specific ex, whereas the other clusters corre-
served variable for which the exogenous is its ancestor       spond to observed minor value configurations.
and all the latent variables along the path connecting        We define the single cluster that corresponds to the ob-
them. We show in Asbeh and Lerner [2016a] the exis-           served major value configuration, and thus also repre-
tence and uniqueness of the value a latent non-collider       sents the major effect due to configuration ex of EX,
and its observed child get under the influence of an ex-      as the major cluster for ex, and all the clusters that cor-
ogenous ancestor; one and only one value of the latent        respond to the observed minor value configurations due
non-collider (observed child)) changes with a change in       to minor effects as minor clusters. However, we distin-
the value of the exogenous. We also show that a latent        guish between different types of minor effects/clusters.
collider and its observed child, both descendants of a set    A k-order minor effect is a minor effect in which exactly
of exogenous variables, change their values in any two        k endogenous variables in EN correspond to minor local
major configurations only if at least one of the exogenous    effects. An en corresponding to a k-order minor effect
is a k-order minor value configuration. In addition, mi-       minor PCCs are used for identifying the endogenous la-
nor clusters that correspond to k-order minor effects are      tents, their interrelations, and their observed children.
k-order minor clusters. The set of all major clusters (cor-
responding to all observed major value configurations)
reflects the effect of all possible exs, and thus the num-
                                                               3     OVERVIEW OF LPCC
ber of major clusters is expected to be equal to the num-
ber of EX configurations. Therefore, the identification of     LPCC is fed by samples from the observed variables. It
all major clusters is a key to the discovery of exogenous      clusters the data using the self-organizing map (SOM) al-
variables and their causal interrelations. For this purpose,   gorithm (although any algorithm that does not require a
we introduce the concept of pairwise cluster comparison        prior setting of the number of clusters is appropriate), and
(PCC). PCC measures the differences between two clus-          selects an initial set of major clusters. Then LPCC learns
ters; each represents the response of LVM to another ex.       an LVM in two stages. First, it identifies (Section 3.1)
                                                               latent variables (and their observed descendants) without
Pairwise cluster comparison is a procedure by which            distinguishing exogenous from endogenous latents, be-
pairs of clusters are compared, e.g., through a compar-        fore distinguishing latent colliders from exogenous vari-
ison of their centroids. The result of PCC between a           ables (Section 3.2). LPCC iteratively improves the selec-
pair of centroids of dimension |O|, where O is the set         tion of major clusters (Section 3.3), and the entire stage
of observed variables, is a binary vector of size |O| in       is repeated until convergence. Second, LPCC identifies
which each element is 1 or 0 depending, respectively,          endogenous latent non-colliders among the previously
on whether or not there is a difference between the cor-       identified latent variables and splits these two types of
responding elements in the compared centroids. When            latent variables from each other before finding the links
PCC is between clusters that represent observed major          between them (Section 3.4).
value configurations (i.e., PCC between major clusters),
an element of 1 identifies an observed variable that has
                                                               3.1   IDENTIFICATION OF LATENT VARIABLES
changed its value between the compared clusters due to
a change in ex. Thus, the 1s in a major–major PCC pro-         We demonstrate the relations between PCC and learning
vide evidence of causal relationships between EX and           an LVM by an example. G1 in Figure 1 has two exoge-
O. Practically, LPCC always identifies all observed vari-      nous variables, L1 and L2, each having three children
ables that are represented by 1s together in all PCCs as       X1, X2, X3 and X4, X5, X6, respectively. 3 For the ex-
the observed descendants of the same exogenous variable        ample, assume all variables are binary, i.e., L1 and L2
(Section 3.1). However, due to the probabilistic nature of     have four possible exs (L1L2= 00, 01, 10, 11). We used
BN and the existence of endogenous latents (mediating          a uniform distribution over L1 and L2 and set the prob-
the connections from EX to O), some of the clusters are        abilities of an observed child, Xi , i = 1, . . ., 6, given its
k-order minor clusters (in different orders), representing     latent parent, Lk , k = 1, 2 (only if Lk is a direct parent
k-order minor configurations/effects. Minor clusters are       of Xi , e.g., L1 and X1), to be P (Xi = v | Lk = v) =
more difficult to identify than major clusters because the     0.8, v = 0, 1. First, we generated a synthetic data set
latter reflect the major effects of EX on EN and, there-       of 1,000 patterns from G1. Second, using SOM [Koho-
fore, are considerably more populated by data points than      nen, 1997], we clustered the data and found 16 clusters,
the former. Nevertheless, minor clusters are important in      of which four were major (Section 3.3 gives details on
causal discovery by LPCC even though a major–minor             how to identify major clusters). This meets our expec-
PCC cannot tell the effect of EX on EN because an ob-          tation of four major clusters corresponding to the four
served variable in two compared (major and minor) clus-        possible exs. These clusters are presented in Table 1a by
ters should not necessarily change its value as a result       their centroids, which are the most prevalent patterns in
of a change in ex. Their importance is because a ma-           the clusters, and their corresponding PCCs are given in
jor cluster cannot indicate (when compared with another        Table 1b. For example, PCC1,2, comparing clusters C1
major cluster) the existence of minor values. On the           and C2, shows that when moving from C1 to C2, only
contrary, PCC between major and minor clusters shows           the values corresponding to variables X1–X3 have been
(through the number of 1s) the number of minor values          changed (i.e., δX 1 = δX 2 = δX 3 = 1 in Table 1b).
represented in the minor cluster, and this is exploited by     Also PCC1,4, PCC2,3, and PCC3,4 show that X1–X3 al-
LPCC for identifying the endogenous latents and inter-         ways change their values together. This may be evidence
relations among them (Section 3.4). That is, PCC is the        that X1–X3 are descendants of the same exogenous la-
source to identify causal relationships in the unknown
LVM; major–major PCCs are used for identifying the                3
                                                                    We determined three indicators per latent in all true models
exogenous variables and their descendants, and major–          we learn (Figure 1) because it is required by BPC, making the
                                                               experimental evaluation in Section 4 fair.
            Table 1: (a) Centroids of Major Clusters for G1 and (b) PCCs between These Major Clusters
                                                             PCC      δX1   δX2     δX3     δX4     δX5     δX6
        Centroid    X1    X2         X3   X4   X5   X6        1-2      1     1       1       0       0       0
          C1        0     0           0    1    1    1        1-3      0     0       0       1       1       1
          C2        1     1           1    1    1    1        1-4      1     1       1       1       1       1
          C3        0     0           0    0    0    0        2-3      1     1       1       1       1       1
          C4        1     1           1    0    0    0        2-4      0     0       0       1       1       1
                                                              3-4      1     1       1       0       0       0

                               (a)                                                   (b)

tent, which, as we know from the true graph G1, is L1.          3.2   DISTINGUISHING LATENT COLLIDERS
Note, however that PCC1,4 and PCC2,3 show that the
values corresponding to X4–X6 have changed together             To demonstrate distinguishing latent colliders from ex-
too, whereas these values did not change in PCC1,2 and          ogenous variables, we use graph G2 in Figure 1. It shows
PCC3,4. Because X4–X6 changed their values only in              two exogenous latent variables, L1 and L3, that collide in
PCC1,4 and PCC2,3 but not in PCC1,2 and PCC3,4,                 an endogenous latent variable, L2, each having three ob-
they cannot be descendants of L1. This strengthens the          served children X1–X3 (L1), X4–X6 (L2), and X7–X9
evidence that X1–X3 are the only descendants of L1. A           (L3). We assume for the example that all variables are
similar analysis using PCC1,3 and PCC2,4 will identify          binary. Having two exogenous variables, we expect to
that X4–X6 are descendants of another latent variable           find four major clusters in the data; each will correspond
(L2, as we know).                                               to one of the four possible exs (L1L3= 00, 01, 10, 11).
                                                                As for G1 (Section 3.1), we expect the values of X1–
Therefore, we define a maximal set of observed (MSO)
                                                                X3 to change together in all PCCs following a change in
variables as the set of variables that always changes its
                                                                the value of L1, and the values of X7–X9 to change to-
values together in each major–major PCC in which at
                                                                gether in all the PCCs following a change in the value of
least one of the variables changes value. For example,
                                                                L3. However, the values of X4–X6 will change together
X1 (Table 1b) changes its value in PCC1,2, PCC1,4,
                                                                with those of X1–X3 in part of the PCCs and together
PCC2,3, and PCC3,4 and always together with X2 and
                                                                with those of X7–X9 in the remaining PCCs, but always
X3 (and vice versa). Thus, {X1, X2, X3} (and similarly
                                                                together in all of the PCCs. This will be evidence that
{X4, X5, X6}) is an MSO. Each MSO includes descen-
                                                                X4–X6 are descendants of the same latent collider (L2,
dants of the same latent variable L, and once LPCC iden-
                                                                as we know). That is, to learn that an already learned
tifies this MSO, it introduces L to the learned graph as a
                                                                latent variable L is a collider for a set of other already
new latent variable, together with all the observed vari-
                                                                learned (exogenous) latent ancestor variables LA⊂EX,
ables that are included in this MSO as L’s children. We
                                                                LPCC requires that: (1) The values of the children of L
prove in Asbeh and Lerner [2016a] that every observed
                                                                will change with the values of descendants of different
variable belongs to one and only one MSO, i.e., MSOs
                                                                latent variables in LA in different parts of major–major
corresponding to the learned latents are disjoint, which
                                                                PCCs; and (2) The values of the children of L will not
means that LPCC learns a pure measurement model. We
                                                                change in any PCC unless the values of descendants of
also prove that variables of a particular MSO are children
                                                                at least one of the variables in LA change. This insures
of a particular exogenous latent variable EX or its latent
                                                                that L does not change independently of latents in LA
non-collider descendant or children of a particular latent
                                                                that are L’s ancestors.
collider C. This guarantees that each of multiple latent
variables (either an exogenous or any of its non-collider
descendants or a collider) is identified by its own MSO.        3.3   CHOOSING MAJOR CLUSTERS
LPCC cannot yet distinguish between exogenous latents
                                                                Due to a lack of prior information regarding the distribu-
and latent colliders because the main goal at this stage
                                                                tion of latent variables, LPCC, first, assumes a uniform
was to identify latent variables and their relations with
                                                                distribution over a latent and selects the major clusters
the observed variables. In Section 3.2, we distinguish
                                                                based on their size, i.e., the number of patterns clustered.
these two types of variables, whereas in Section 3.4, we
                                                                Clusters that are larger than the average cluster size are
use major–minor PCCs rather that major–major PCCs to
                                                                selected as majors. However, this initial selection may
distinguish latent non-colliders from exogenous latents.
                                                                generate: 1) false negative errors (i.e., deciding a major
                                                                cluster is minor), when a latent variable L has a skewed
                                                                distribution over its values, and then a rare value can be
                                                                represented only by small clusters that could not be cho-
sen as majors; and 2) false positive errors (i.e., deciding a    Table 2: Ten of the Seventeen Largest Clusters for G3
minor cluster is major), when due to similar probabilities              X1    X2     X3    X4    X5   X6    X7    X8   X9     size
of different values of an observed child given its parent         C1    1     1      1     1     1    1     1     1    1      49
                                                                  C2    0     0      0     0     0    0     0     0    0      47
L and a large sample size, a cluster that is supposed to be       C3    1     1      1     1     1    1     1     1    0      28
minor becomes too large and is selected as major.                 C4    0     0      0     0     0    0     0     1    0      24
                                                                  C5    0     1      0     0     0    0     0     0    0      22
To avoid such errors, LPCC learns iteratively. Following          C6    1     1      1     1     1    1     0     0    0      22
                                                                  C7    0     0      1     0     0    0     0     0    0      21
the selection of major clusters based on their sizes and          C8    0     0      0     1     1    1     1     1    1      19
learning a graph, it becomes possible to find the cardi-          C9    0     0      0     0     0    0     1     1    1      18
nalities of the latent variables and all possible exs. For        C10   1     1      1     0     0    0     0     0    0      16

each ex, we can use the most probable cluster given the
data as the major cluster, and using an EM-style proce-
                                                                (1-MC) as a cluster that corresponds to a 1-order minor
dure [Dempster et al., 1977], update the set of major clus-
                                                                value configuration, which exists when exactly one en-
ters iteratively and probabilistically to augment LPCC to
                                                                dogenous variable in EN (either latent or observed) has
learn more accurate graphs. The final graph may not be
                                                                a minor local value as a response to a value that EX has
optimal since it depends on the initial graph, but it is an
                                                                obtained4 . To reveal the existence of latent non-colliders
improved version of the initial graph.
                                                                that were previously combined with EX and splits them
                                                                from EX, we analyze for each EX, PCCs between 1-MCs
3.4   IDENTIFICATION OF LATENT                                  and the major clusters that identified EX.
      NON-COLLIDER VARIABLES
                                                                                   Table 3: All 2S-PCCs for G3
So far, the latent non-colliders that are descendants of an      PCC     δX1   δX2        δX3   δX4   δX5   δX6    δX7      δX8      δX9
exogenous variable EX were temporarily combined with             1-6     0     0          0     0     0     0      1        1        1
                                                                 2-6     1     1          1     1     1     1      0        0        0
it, and all their observed children were temporarily com-        1-8     1     1          1     0     0     0      0        0        0
bined with the direct children of EX. To exemplify that,         2-8     0     0          0     1     1     1      1        1        1
check G3 in Figure 1, showing a serial connection of L1,         1-9     1     1          1     1     1     1      0        0        0
                                                                 2-9     0     0          0     0     0     0      1        1        1
L2, and L3. Assume each latent has three observed chil-          1-10    0     0          0     1     1     1      1        1        1
dren, and all are binary. L1 is the only EX with two possi-      2-10    1     1          1     0     0     0      0        0        0
ble exs (L1= 0, 1), and L2 and L3 are NCs; L2 is a child
of L1 and a parent of L3. We set the probabilities of:           Table 4: PCCs for C3 with C1 and C2 in Learning G3
1) L1 uniformly; 2) an observed child Xi , i= 1,. . ., 9,
                                                                 PCC    δX1    δX2        δX3   δX4   δX5   δX6    δX7      δX8      δX9
given its latent parent Lk , k= 1, 2, 3 (if this is a di-
                                                                 1-3    0      0          0     0     0     0      0        0        1
rect parent), as P (Xi =v | Lk =v) = 0.8,v= 0, 1; and            2-3    1      1          1     1     1     1      1        1        0
3) an endogenous latent Lj , j= 2, 3, given its latent
parent Lk , k= 1, 2 (if this is a direct parent), as            Thus, second, we define PCCs between 1-MCs and ma-
P (Lj =v | Lk =v)= 0.8,v= 0, 1. We generated 1,000              jor clusters that show two sets of two or more observed
patterns from G3 over the nine observed variables. Ta-          variables having the same value, different than that of
ble 2 presents ten of the seventeen largest clusters by         the other set, as 2S-PCC (i.e., PCC of “two sets”) and
their centroids and sizes, from which C1 and C2 were            the corresponding 1-MC as 2S-MC. To identify a latent
selected as major. This meets our expectation of two            non-collider that was combined to an exogenous latent
major clusters corresponding to the two possible exs of         EX, we consider only 2S-PCCs; these PCCs are the result
L1. However, because all the elements in PCC1,2 are             of comparing all the 2S-MCs among the 1-MCs for EX
1s (compare C1 and C2 in Table 2), the nine observed            with the major clusters that revealed EX. For example,
variables establish a single MSO and thus are consid-           Table 3 presents all 2S-PCCs for G3. While a PCC that
ered descendants of the same exogenous variable. That           is not a 2S-PCC identifies a minor value of an observed
is, the model learned in the first stage of LPCC has only       descendant of EX (e.g., PCC1,3, where C1 is major and
one exogenous latent variable (i.e., L1) with direct chil-      C3 is 1-MC, identifies a minor value of X9; see Table 4),
dren that are the nine observed descendants; contrary to        a 2S-PCC identifies a minor value in a latent non-collider
G3. Since L2 and L3, which are latent non-colliders that        descendant of EX and thereby the existence of this latent.
are descendants of L1, were combined with L1, LPCC              The latter is seen, e.g., in Table 3, in PCC1,6 showing the
should now split them from L1 along with their observed         change between C1 (major cluster) and C6 (1-MC) in the
children.
                                                                   4
                                                                     Based on the identification of a 1-MC [Asbeh and Lerner,
Such a split is based on major–minor (rather than major–        2016a], we find, e.g., in learning G3 that C2 is the minimal
major) PCCs. First, we define a first-order minor cluster       major cluster, and all other clusters (Table 2) are 1-MCs.
values of X7–X9 and in PCC2,6 showing the change be-           omitted; see Asbeh and Lerner [2016b]]. We drew data
tween C2 (major cluster) and C6 in the values of X1–X6,        sets of between 250 and 2,000 samples and report on the
thus identifying the existence of a latent non-collider (L3    structural hamming distance (SHD) [Tsamardinos et al.,
as we know). This identification yields a split of exoge-      2006] as a performance measure for learning the LVM
nous L1 into two latents: one (L3) is a parent of X7-X9,       structure. SHD is a structural measure that accounts for
and the other is a parent of X1–X6 (which later will split     all the possible learning errors (the lower value is the bet-
again to L1 and L2, each with its own three children).         ter one): addition and deletion of an undirected edge, and
                                                               addition, removal, and reversal of edge orientation.
However, relying only on part of the 2S-PCCs may be in-
adequate to conclude on all possible splits (e.g., PCC1,8
and PCC2,8 show that X1–X3 and X4–X9 are children
of different latents, but do not suggest the split of X7–X9
as PCC1,6 and PCC2,6 do). Thus, it is necessary to in-
troduce for 2S-PCCs a maximal set of observed variables
(2S-MSO) that always change their values together in all
2S-PCCs. For example, X1 in Table 3 changes its value
in PCC2,6, PCC1,8, PCC1,9, and PCC2,10 and always
together with X2 and X3 (and the other way around).
Thus, {X1, X2, X3} and similarly {X4, X5, X6} and
{X7, X8, X9} are 2S-MSOs. Each 2S-MSO includes
children of the same latent non-collider, which is a de-
scendant of EX, or EX itself. LPCC detects 2S-MSOs
for each EX and thereby identifies its possible splits.
Finally, we show in Asbeh and Lerner [2016a] how to            (a) SHD of LPCC/BPC/EFA vs. parametrization level. For
                                                               G2/1,000 samples, LPCC/BPC learn perfectly.
identify links between split latents. For a serial connec-
tion, LPCC finds the source (EX) and latent sink on the
path between them, but not who is who, and thus it can
only learn this connection as undirected. For a diverging
connection, LPCC learns all directed links among the la-
tents. That is, LPCC learns a pattern over the structural
model of G, which represents a Markov equivalence class
of models among the latents, where in the special case in
which G has no serial connection, LPCC learns the true
graph.
Based on the concepts outlined in Section 3, we formally
introduce the LPCC algorithm in Asbeh and Lerner
[2016b] and thoroughly evaluate it experimentally us-
ing synthetic and real-world data compared with other
algorithms. The complete results are given in Asbeh and        (b) SHD of LPCC/BPC/EFA vs. number of indicators
Lerner [2016b], and a snapshot of them is provided in          per latent in G2, pj = 0.75, and four sample sizes.
Section 4.
                                                               Figure 2: Structural Correctness of Learning Algorithms.

4   EXPERIMENTAL EVALUATION                                    Figure 2a shows SHD values for the LPCC, BPC, and
                                                               EFA algorithms for increasing parametrization levels for
Simulated data: We used Tetrad IV to construct graphs          four combinations of learned graphs and sample sizes.
G2 and G4 of Figure 1 for three parameterization lev-          It shows that LPCC and BPC improve performance, as
els that differ by the conditional probabilities pj =0.7,      expected, with increased levels of latent-observed vari-
0.75, and 0.8 between a latent and each of its children.       able correlation (pj ). LPCC never falls behind BPC, and
Each such level imposes a different complexity on the          its advantage over BPC is especially vivid for a small
model and thereby affects the task of learning the latent      sample size. EFA, besides falling behind LPCC and
model and the causal relations (i.e., pj =0.7 poses a larger   BPC, also demonstrates worsening of performance with
challenge to learning than pj =0.75) [full details about       increasing the parametrization level, especially for large
the types of variables and parametrization schemes are         sample sizes. Larger parametrization levels increase the
chances of an EFA to learn links between latent variables    and Lerner [2016b] that LPCC improves accuracy with
and observed variables – some of which are not between       sample size, it can learn large LVMs, and it has consis-
a latent and its real child – to compensate for the algo-    tently good results compared to models that are expert-
rithm’s inability to identify links among latents (EFA as-   based or learned by state-of-the-art algorithms. Applied
sumes latents are uncorrelated). EFA is inferior to LPCC     to two original domains, LPCC helped identify possible
for all parametrization levels, sample sizes, and graphs.    causes of young drivers’ involvement in road accidents
                                                             and cell subpopulations in the immune system.
Figure 2b shows SHD values of the LPCC, BPC, and
EFA algorithms for increasing numbers of binary indica-
tors per latent variable in G2, a parametrization level of
0.75, and four sample sizes. The figure exhibits superior-
ity of LPCC over BPC and EFA for all scenarios. While
LPCC hardly worsens its performance with the increase
of complexity, both BPC and EFA are affected by this
increase. They also have a difficulty in learning an LVM
for which latent variables have exactly two indicators.
Real-world data – The political action survey (PAS): We
evaluated LPCC using a simplified PAS data set over six
variables (Joreskog, 2004): NOSAY, VOTING, COM-
PLEX, NOCARE, TOUCH, and INTEREST. These vari-
ables represent political efficacy and correspond to ques-
tions to which the respondents have to give their degree
of agreement on a discrete ordinal scale of four values.
This data set contains the responses from a sample of
1,076 US respondents. A model consisting of two la-
tents that correspond to a previously established theoret-
ical trait of Efficacy and Responsiveness (Figure 3a) dis-
cards VOTING Joreskog [2004] based on the argument
that the question for VOTING is not clearly phrased.
Similar to the theoretical model, LPCC finds two la-
tents (Figure 3b): One corresponds to NOSAY and VOT-
ING and the other corresponds to NOCARE, TOUCH,
and INTEREST. Compared with the theoretical model,           Figure 3: PAS: outputs of (a) a true model, (b) LPCC, (c)
LPCC misses the edge between Efficacy and NOCARE             BSPC, and BPC with (d) α =0.01, 0.05, and (e) α =0.1.
and the bidirectional edge between the latents. Both
edges are not supposed to be discovered by LPCC or
BSPC/BPC; the former because the algorithms learn a          5   SUMMARY AND CONCLUSIONS
pure measurement model in which each observed vari-
able has only one latent parent and the latter because no    In this work, we introduced the PCC concept and LPCC
cycles are assumed. Nevertheless, compared with the          algorithm for learning discrete LVMs: 1) the PCC con-
theoretical model, LPCC makes no use of prior knowl-         cept to learn an LVM ties graphical models with data
edge. BSPC output (Figure 3c) is very similar to LPCC        clustering; 2) LPCC learns MIMs; 3) LPCC is not lim-
output, except for NOCARE, which was not identified          ited to latent-tree models and does not assume linearity;
by BSPC. Both algorithms identify VOTING as a child          4) LPCC assumes that the measurement model of the true
of Efficacy (at the expense of COMPLEX). The outputs         graph is pure, but, if the true graph is not, it learns a pure
of the BPC algorithm (Figures 3d,e) are poorer than those    sub-model of the true model, if one exists. LPCC also
of LPCC and BSPC and are sensitive to the significance       assumes that a latent collider does not have any latent de-
level. The output of the FCI algorithm (not shown here)      scendants; 5) LPCC is a two-stage algorithm that exploits
using any significance level is not sufficient (showing,     PCC. First, it learns the exogenous latents and latent col-
e.g., that NOSAY and INTEREST potentially have a la-         liders, as well as their observed descendants, and second,
tent common cause where these two variables are indica-      it learns the endogenous latent non-colliders and their
tors of different latents in the theoretical model).         children by splitting these latents from their previously
                                                             learned latent ancestors; and 6) LPCC learns an equiva-
Using simulated and real-world data, we show in Asbeh        lence class of the structural model of the true graph.
References                                                     Department of Philosophy, Carnegie-Mellon Univer-
                                                               sity, Pittsburgh, Pennsylvania, 1995.
N. Asbeh and B. Lerner. Learning latent variable models
  by pairwise cluster comparison: Part I – Theory and        S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyvarinen,
  overview. Journal of Machine Learning Research, 17           Y. Kawahara, T. Washiok, P. Hoyer, and K. Bollen.
  (224):1–52, 2016a.                                           DirectedLiNGAM: A direct method for learning a lin-
                                                               ear non-Gaussian structural equation model. Journal
N. Asbeh and B. Lerner. Learning latent variable mod-
                                                               of Machine Learning Research, 12:1225–1248, 2011.
  els by pairwise cluster comparison: Part II – Algo-
  rithm and evaluation. Journal of Machine Learning          R. Silva, R. Scheines, C. Glymour, and P. Spirtes. Learn-
  Research, 17(233):1–45, 2016b.                               ing the structure of linear latent variable models. Jour-
                                                               nal of Machine Learning Research, 7:191–246, 2006.
D. J. Bartholomew, F. Steele, I. Moustaki, and J. I. Gal-
  braith. The Analysis and Interpretation of Multivariate    P. Spirtes. Calculation of entailed rank constraints in par-
  Data for Social Scientists (Texts in Statistical Science      tially non-linear and cyclic models. In Proceedings of
  Series). Chapman & Hall/CRC Press, Boca Raton,                the 29th Conference on Uncertainty in Artificial Intel-
  Florida, USA, 2002.                                           ligence, pages 606–615, Bellevue, Washington, 2013.
A. Dempster, N. Laird, and D. Rubin. Maximum like-           P. Spirtes, C. Glymour, and R. Scheines. Causation, Pre-
  lihood from incomplete data via the EM algorithm.             diction and Search. MIT Press, New York, New York,
  Journal of Royal Statistical Society, B 39:1–39, 1977.        2nd edition, 2000.
G. Elidan, N. Lotner, N. Friedman, and D. Koller.            I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The
  Discovering hidden variables: A structure-based ap-           max-min hill-climbing Bayesian network structure
  proach. In Advances in Neural Information Processing          learning algorithm. Machine Learning, 65:31–78,
  Systems, pages 13:479–485, 2000.                              2006.
S. Harmeling and C. K. I. Williams. Greedy learn-            Y. Wang, N. L. Zhang, and T. Chen. Latent-tree mod-
  ing of binary latent trees. IEEE Transactions on             els and approximate inference in Bayesian networks.
  Pattern Analysis and Machine Intelligence, 33:1087–          Journal of Artificial Intelligence Research, 32:879–
  1097, 2011.                                                  900, 2008.
K. Joreskog. Structural equation modeling with ordinal       N. Zhang. Hierarchical latent class models for cluster
  variables using LISREL. Technical report, Scientific         analysis. Journal of Machine Learning Research, 5:
  Software International Inc, 2004.                            697–723, 2004.
R. Klee. Introduction to the Philosophy of Science: Cut-
  ting Nature at its Seams. Oxford University Press,
  New York, New York, 1997.
T. Kohonen. Self-Organizing Maps. Springer-Verlag,
   New York, New York, 1997.
J. Pearl. Probabilistic Reasoning in Intelligent Sys-
   tems. Morgan Kaufmann Press, San Mateo, Califor-
   nia, 1988.
J. Pearl. Causality: Models, Reasoning, and Inference.
   Cambridge University Press, New York, New York,
   2000.
J. Pearl and T. Verma. A theory of inferred causation. In
   Proceedings of the 2nd International Conference on
   Principles of Knowledge Representation and Reason-
   ing, pages 441–452, Cambridge, MA, 1991.
R. Scheines, P. Spirtes, C. Glymour, C. Meek, and
  T. Richardson. The tetrad project: Constraint based
  aids to causal model specification. Technical report,