Exploratory network analysis of large social science questionnaires


        Robert J. B. Goudie                      Sach Mukherjee                       Frances Griffiths
        Department of Statistics             Department of Statistics &        Health Sciences Research Institute
         University of Warwick              Centre for Complexity Science            University of Warwick
             Coventry, UK                      University of Warwick                     Coventry, UK
                                                    Coventry, UK


                       Abstract                               in the data.

     There are now many large surveys of individ-             Graphical models provide a statistical framework
     uals that include questions covering a wide              within which the relationship between variables can
     range of behaviours. We investigate longi-               be studied. These models enable complex multivari-
     tudinal data from the Add Health survey of               ate distributions to be decomposed into simpler local
     adolescents in the US. We describe how struc-            distributions. This can reveal a great deal about the
     tural inference for (dynamic) Bayesian net-              relationships between the variables, as well as provide
     works can be used to explore relationships be-           a statistical and computationally tractable description
     tween variables in such data and present this            of their (often large) joint distribution. The decompo-
     information in an interpretable format for               sition is formed by the conditional independence struc-
     subject-matter practitioners. Surveys such               ture, which can be represented by a graph. The use of
     as this often have a large sample-size, which,           graphs helps to make the interpretation of the model
     whilst increasing the precision of inference,            simpler. In this paper, we focus on the structure of
     may mean that the posterior distribution                 the model, as given by the graph. We aim to make
     over Bayesian networks (or graphs) is con-               inference about this using statistical model selection.
     centrated on disparate graphs. In such situ-             The structure of the model suggests how the differ-
     ations, the standard MC3 sampler converges               ent components of the system interact, which may be
     very slowly to the posterior distribution. In-           helpful in understanding the system as a whole. These
     stead, we use a Gibbs sampler (1), which                 methods have been widely adopted in molecular biol-
     moves more freely through graph space. We                ogy (2, 3), and have been used in some areas of medical
     present and discuss the resulting Bayesian               sciences (4).
     network, focusing on depression, and provide             Consideration of unexpected relationships between
     estimates of how different variables affect the          factors requires datasets that incorporate a wide range
     probability of depression via the overall prob-          of topics. Such data is now widely available for rep-
     abilistic structure given by the Bayesian net-           resentative samples of populations in many countries,
     work.                                                    and for many sub-groups of interest. Many of these
                                                              datasets are derived from surveys that are general in
                                                              scope, and are not collected to study any one par-
1    INTRODUCTION
                                                              ticular question. For example, in the US, the health
                                                              of the whole population is representatively sampled
Hypotheses of multifactorial causes of symptoms and
                                                              annually for the Behavioral Risk Factor Surveillance
outcomes play an important role in the social sciences
                                                              System (BRFSS) survey, and the Add Health study,
and in public health. Regression-based approaches are
                                                              which we use here, followed a cohort of young peo-
widely-used in these fields to explore such hypotheses.
                                                              ple from 1994 until 2008. Data from both of these
A great deal of insight can be gained through such
                                                              have been used in scores of studies, but these com-
approaches, but it is sometimes overly constraining to
                                                              monly focus on one specific aspect, often using the
fix a particular quantity as the dependent variable,
                                                              data to evaluate existing hypotheses. Given the wide
especially if the goal is to explore the possibility of un-
                                                              scope inherent in the design of these studies and the
expected relationships between the data. Instead, we
                                                              large samples available in many cases, it is possible to
can consider a number of variables on an equal footing,
                                                              broaden the scope of the analysis by considering richer
and study the possibility of unexpected relationships
structures. In this paper, we discuss the potential that   Scale (CES-D) (13). Two questions from the 20-item
such a more explorative approach yields. We do not         scale are omitted from AddHealth, and two are mod-
seek to make conclusive causal claims, but instead sug-    ified, and so we scale the score given by the available
gest that a broader approach may uncover important         questions (14). A Receiver Operating Characteristic
aspects that have been neglected.                          (ROC) analysis showed that thresholds of 24 for fe-
                                                           males and 22 for males provided the best agreement
Our focus will be on depression among adolescents in
                                                           with clinical assessments of depression (15). We use
the US, drawing on data from the National Longitu-
                                                           this threshold to create a binary indicator of depres-
dinal Study of Adolescent Health (Add Health). It is
                                                           sion status.
estimated that around 1–6% of adolescents each year
are affected by depression (5, 6). The effects of de-      Many of the remainder of the variables that we con-
pression in this age-group are wide-ranging (7), and       sider (Table 1) are drawn from the risk factors de-
include the stigma associated with poor mental health      scribed in the depression literature, and the mental
more generally (8). There is considerable evidence that    health literature more generally. A recent review (8)
there are a wide range of causal factors for depression    described a wide range of factors that are associated
amongst adolescents, spanning biological, psycholog-       with poor mental health in young people, including
ical and social domains. Understanding these causal        gender, poverty, violence and the absence of social net-
factors and separating them from the consequences of       works in the local neighbourhood. The quality of rela-
depression has been recognised as an important aim         tionships with parents is also thought to be important,
(9). Some of the relevant causal factors may interact      especially with the mother (16), as are parental alcohol
and the approach taken here accounts for this.             problems (17) and parental discord (16). The individ-
                                                           ual’s use of alcohol, drugs, smoking and HIV/AIDS
The remainder of this paper is organised as follows.
                                                           are all also associated with depression (18, 19). Phys-
We first introduce the AddHealth dataset and de-
                                                           ical exercise has been proposed in some studies as a
scribe the Bayesian network framework. Inference for
                                                           useful intervention for the management of depression,
Bayesian networks is performed using Markov Chain
                                                           but many of these studies have been deemed to be poor
Monte Carlo (MCMC), but the large sample size of
                                                           quality (20).
the dataset we consider makes achieving convergence
difficult because the posterior distribution may be con-
centrated on disparate graphs, and so we describe an       2.2   Bayesian Networks
alternative sampler that has superior properties in this   Our study uses Bayesian networks to explore the rela-
situation. Whilst the PC-algorithm (10, 11) has prop-      tionships between variables in the Add Health study.
erties that often make it attractive in such contexts,     Bayesian networks are a particular type of graphical
we found that the results in this situation were not       model that enable classes of probability distributions
robust (see Discussion). We then present and discuss       to be specified using a directed acyclic graph (DAG).
the results for the Add Health dataset.                    A Bayesian network G is represented using a DAG
                                                           with vertices V = (V1 , . . . , Vp ), and directed edges
2     MATERIALS AND METHODS                                E ⊂ V × V . The vertices correspond to the compo-
                                                           nents of a random vector X = [X1 , . . . , Xp ]T , subsets
2.1   Add Health                                           of which will be denoted by XA for sets A ⊆ {1, . . . , p}.
                                                           For 1 ≤ i, j ≤ p, we define the parents Gj of each
The data that we use are drawn from the National Lon-      node Vj to be the subset of vertices V such that
gitudinal Study of Adolescent Health (Add Health)          Vi ∈ Gj ⇔ (Vi , Vj ) ∈ E. Specifying the parents of the
that explores health-related behavior of adolescents       vertices determines the edges E of the graph G. We
(12) in the US. The questionnaire contains over 2000       denote by G the space of all possible directed acyclic
questions that cover many aspects of adolescent be-        graphs with p vertices. We will use XGi to refer to the
haviours and attitudes. We consider the representa-        random variables that are parents of Xi in the graph
tive sample of adolescents from Waves I and II of the      G.
in-home section, and the parental questionnaire from
                                                           The graph specifies that the joint distribution for X,
Wave I of the study. The analysis we perform is not
                                                           with parameters θ = (θ1 , . . . , θp ), can be written as
feasible when the data is not complete (see Discus-
                                                           a product of conditional distributions p(Xi | XGi , θi ),
sion), and so individuals with missing data were re-
                                                           given the variables XGi corresponding to the parents
moved from the study. Removing incomplete samples
                                                           of Xi in the graph.
leaves 5975 individuals in the study.
                                                                                       p
Our measure of depression is a self-assessed scale based
                                                                                       Y
                                                                       p(X | G, θ) =         p(Xi | XGi , θi )
upon the Centre for Epidemiologic Studies Depression                                   i=1
We will need to be able to evaluate the marginal like-         space fully because the sampler may become ‘trapped’
lihood p(X | G) easily, and so we consider only a con-         in one mode. This issue becomes more severe as the
jugate analysis in which the conditional distributions         sample size increases because the posterior distribu-
p(Xi | XGi , θi ) are multinomial, with Dirichlet priors       tion becomes more concentrated. A natural approach
p(θi ) for each θi . In this case, the marginal likelihood     in such situations is to use the PC-algorithm (10, 11),
can be evaluated analytically. Suppose each Xi takes           which has been shown to be asymptotically consistent
one of ri values, and define qi as the number of levels        (23), but we found in this case that the results were
of the sample space of XGi , each element of which we          not robust (see Discussion).
call a configuration. For each configuration j of XGi ,
                                                               Our analyses in this paper were performed using a
let Nijk be the number of observations in which Xi
                                                               Gibbs sampler (1), which we found to converge rapidly
takes value k. We assume the Dirichlet priors for each
                                      0                        to its equilibrium state. A naı̈ve Gibbs sampler for
θi , each with hyperparameters Nijk     , are independent.
                      Pri                 0
                                                Pri     0
                                                               structural inference that proposes single-edge addi-
We define Nij =              N
                        k=1 ijk    and  N ij =    k=1 Nijk ,   tions and removals can easily be constructed, but
and the local score p(Xi | XGi ) to be                         this sampler offers no advantages over the analogous
                  qi          0       ri            0          MC3 . This naı̈ve scheme, however, can be improved
                  Y     Γ(Nij   )     Y  Γ(Nijk + Nijk )
 p(Xi | XGi ) =                                          .     by ‘blocking’ together a number of components, and
                      Γ(N    +  N 0 )       Γ(N 0 )
                  j=1     ij      ij  k=1       ijk            sampling from their joint conditional distribution. In
                                                               theory, any group of components can be taken as a
The marginal likelihood
                     Qp can be shown to equal the              block, but sampling from their joint conditional distri-
product p(X | G) = i=1 p(Xi | XGi ) of these local             bution needs to be possible and, ideally, computation-
scores (21).                                                   ally quick.
                                                               For Bayesian networks, the most natural blocks are
2.3   Structural inference for Bayesian
                                                               those consisting of parent sets G1 , . . . , Gp . This is
      Networks
                                                               natural because the marginal likelihood p(X | G) for
We aim to make inference about the DAG G, given                a graph G factorises across vertices into conditionals
data X and so our interest focuses on the posterior            p(Xj | XGj ) and these conditionals depend on the par-
distribution Pr(G | X) on Bayesian networks. Under             ent set of the vertex. Therefore, since any graph G ∈ G
the assumptions we have made, this can be written in           can be specified by a vector G = (G1 , . . . , Gp ) of parent
terms of the marginal likelihood p(X | G), and a prior         sets, the posterior distribution on Bayesian networks
π(G) for the Bayesian network structure.                       G ∈ G can be written as functions of G1 , . . . Gp in the
                                                               following way.
                               p
                               Y
           Pr(G | X) ∝ π(G)          p(Xi | XGi )                                                               p
                                                                                                                Y
                               i=1                              Pr(G1 , . . . , Gp | X) ∝ π(G1 , . . . , Gp )         p(Xi | XGi )
                                                                                                                i=1
The priors π(G) can be chosen to encode domain infor-
mation (3). For the analyses in this paper, we choose          In the following, we will denote subsets of the vector
an improper prior π(G) ∝ 1 that is flat across the             G = (G1 , . . . , Gp ) by GA = {Gk : k ∈ A}, and the
space of graphs.                                               subset given by the complement AC = {1, . . . , p} \ A
                                                               of a set A will be denoted by G−A = {Gk : k ∈ AC }.
The posterior distribution Pr(G | X) is difficult
                                                               In particular, the complete graph can be specified by
to evaluate, because cardinality of G grows super-
                                                               G = (G1 , . . . Gp ) = (Gi , G−i ) for any i ∈ {1, . . . , p}.
exponentially in p. This motivates the use of approx-
imations to Pr(G | X), which are usually based on              To be able to construct a Gibbs sampler using
Markov chain Monte Carlo (MCMC).                               parent sets, we need to find their conditional dis-
                                                               tribution, given the other parent sets G−j =
2.4   Approximate inference for Bayesian                       {G1 , . . . , Gj−1 , Gj+1 , . . . , Gp }. Parent sets Gj for
      Networks                                                 which G = (Gj , G−j ) is cyclic will have no probabil-
                                                               ity mass in the conditional distribution. Let Kj? be
The standard form of MCMC that is used for struc-              the set of parent sets Gj such that G = (Gj , G−j ) is
tural inference for Bayesian networks is MC3 (22).             acyclic. The conditional posterior distribution of Gj is
This is a Metropolis-Hastings sampler that explores G          multinomial, with weights given by the posterior dis-
by proposing to add or remove a single edge from the           tribution of G = (Gj , G−j ). When the cardinality of
current graph G. This sampler works surprisingly well          Kj? is constrained (for example, by restricting the max-
in many situations, but if the posterior distribution is       imum number of parents of each node) the conditional
not unimodal, the local moves may fail to explore the          posterior distribution for Gj ∈ Kj? can be evaluated
                              MC3 sampler                                                       Gibbs sampler
                                                                                                                                                 Depressed (time point 1)         ●                                       ●
        1.0   ●
              ●                        ●   ● ●
                                             ●           ●           ●●                                                                 ●
                                                                                                                                        ●●
                                                                                                                                         ●
                                                                                                                                         ●
                                                                                                                                          ●
                                                     ●
                                                                     ●
                                                                                                                                       ●
                                                                                                                                       ●       Didn't present to doctor (2)           ●                 ●
                                                                                                                                   ●
                                                                                                                                ●●●
        0.8                                                      ●                                                          ●                                       Female    ●                 ●

              ●
                                                 ●                                                                      ●                      Didn't present to doctor (1)           ●             ●
                                                                 ●                                                  ●
        0.6                                                                                                        ●●
                                                                                                                                                           Good health (2)
Run 2


                               ●                                                                              ●
                                                                                                                                                                                          ●●    ●
                                           ●                                                                  ●
                                                                                                             ●

        0.4   ●
                                   ●                         ●                                          ●
                                                                                                            ●                                              Good health (1)                ●●●
                                   ●                                                                ●
                                                                                                    ●
                    ●
                             ●●
                                  ●                                      ●
                                                                                                 ●● ●                                                Victim of violence (2)           ●●
              ●         ●●        ●                                                             ●●
                                                                                                 ●
        0.2         ● ●● ●
                        ●
                                                                         ●
                                                                                        ●
                                                                                        ●●●
                                                                                         ●
                                                                                           ●
                                                                                           ●
                                                                                            ●
                                                                                                                                                  Strong academically (2)                 ●
                                                                                                                                                                                          ●
                                                                                                                                                                                          ●●
                   ●●
                  ●●  ●                                                               ●
                                                                                      ●
                                                                                      ●●●
                                                                                       ●
                 ● ●                                                                 ●●
              ●
              ●
              ●
              ●
              ●
              ●
               ●
               ●
               ●
               ●
                ●
                ●
                ●
                ●
                ●
                ●
                 ●
                 ●
                 ●
                  ●
                  ●●
                   ●
                   ●
                    ●●●
                       ● ●
                           ●
                                                                         ●
                                                                                 ●
                                                                                 ●
                                                                                 ●
                                                                                  ●
                                                                                  ●
                                                                                  ●
                                                                                  ●
                                                                                  ●
                                                                                  ●
                                                                                  ●
                                                                                   ●
                                                                                   ●
                                                                                   ●
                                                                                   ●
                                                                                   ●
                                                                                   ●
                                                                                    ●
                                                                                    ●
                                                                                    ●
                                                                                    ●
                                                                                    ●
                                                                                    ●
                                                                                     ●
                                                                                    ●●
                                                                                    ● ●
                                                                                                                                                             Drug user (1)                ●●
        0.0   ●
              ●
              ●
              ●
              ●                   ●                                      ●
                                                                         ●       ●
                                                                                 ●
                                                                                 ●
                                                                                 ●


              0.0       0.2        0.4     0.6       0.8             1.0     0.0         0.2         0.4          0.6       0.8         1.0                               0.10        0.15      0.20        0.25   0.30       0.35
                                                                         Run 1                                                                                            Prob. of depression, time point 2
                                                                                                 3
Figure 1: Diagnostic runs for MC (left) and the
                                                                                                                                               Figure 3: Conditional probability of depression. The
Gibbs sampler (right). The posterior edge probabili-
                                                                                                                                               conditional probability of being depressed at Wave II
ties given by two independent runs are plotted against
                                                                                                                                               given the variable indicated is changed to the level in-
each other. When the two runs give the same estimates
                                                                                                                                               dicated by the colours, conditional on the DAG shown
of the posterior edge probabilities, all of the points ap-
                                                                                                                                               in Figure 2. For binary variables, is true, and is
pear on the line y = x. We observe that the two Gibbs
                                                                                                                                               false; shades of grey indicate intermediate levels. Wave
runs gives similar posterior edge probabilities, but the
                                                                                                                                               number (time point) is indicated in parentheses. Only
MC3 runs do not. (5 runs of 750,000 samples (MC3 )
                                                                                                                                               variables for which the conditional probability differed
or 100,000 samples (Gibbs) of each sampler were per-
                                                                                                                                               between levels by at least 0.005 are displayed.
formed; the first half of the samples were discarded as
burn-in; mean Pearson correlation between runs was
0.9999 ± 0.0002 (standard deviation) for Gibbs and
0.6322 ± 0.0477 for MC3 .)
                                                                                                                                               3    RESULTS

                                                                                                                                               The variables that we consider are detailed in Ta-
                                                                                                                                               ble 1. As is common when using graphical models
                                                                                                                                               (24), all of these variables were grouped, initially into
exactly.
                                                                                                                                               ‘Background’, ‘Wave I’ and ‘Wave II’, and then re-
                                                Pr(Gj , G−j | X)                                                                               fined into whether the question asked about the long-
        Pr(Gj | G−j , X) =                                                                                                                     or short-term, as shown in Table 2. These groups de-
                                                 Pr(G−j | X)
                                                                                                                                               fine constraints on the Bayesian networks that are con-
                                                    Pr(Gj , G−j | X)
                                               =P                                                                                        (1)   sidered. Specifically, no edges can be directed back-
                                                  Gj ∈K ? Pr(Gj , G−j | X)   j                                                                 wards through the groups. Edges, however, are al-
                                                                                                                                               lowed within groups. For example, no edge is allowed
We can improve the speed of convergence of this sam-                                                                                           to be directed into ‘Gender’, and no edge can pass
pler by allowing pairs of parent sets to be sampled                                                                                            backwards in time, for example, from Depression at
together. At each step of the Gibbs sampler we                                                                                                 Wave II to Depression at Wave I. Additionally, no
conditionally sample pairs of parent sets (Gj1 , Gj2 ),                                                                                        edge can pass from a short-term variable to a long-
given the remainder of the graph G−{j1 ,j2 } . Parent                                                                                          term variable, for example, from Depressed at Wave I
sets G−{j1 ,j2 } such that G = (Gj1 , Gj2 , G−{j1 ,j2 } ) is                                                                                   to Have HIV/AIDS at Wave I.
cyclic have no probability mass in the conditional dis-
                                                                                                                                               We precomputed the local scores, and then drew
tribution. Let Kj?1 ,j2 be the set of pairs of parent
                                                                                                                                               100,000 samples (the first half of which were discarded
sets (Gj1 , Gj2 ) such that G = (Gj1 , Gj2 , G−{j1 ,j2 } )
                                                                                                                                               as burn-in) using the Gibbs sampler (Section 2.3),
is acyclic. For (Gj1 , Gj2 ) ∈ Kj?1 ,j2 , the conditional
                                                                                                                                               which took 30 minutes (on a single core of a cluster
posterior distribution is multinomial, by analogy with
                                                                                                                                               computer). The graph space was constrained such that
(1), with weights given by posterior distribution of
                                                                                                                                               no node had more than 3 parents, to ensure Equation
G = (Gj1 , Gj2 , G−{j1 ,j2 } ).
                                                                                                                                               1 could be evaluated.
        Pr(Gj1 , Gj2 | G−{j1 ,j2 } , X)                                                                                                        We ran 5 independent samplers, with disparate initial
                  Pr(Gj1 , Gj2 , G−{j1 ,j2 } | X)                                                                                              states. This enables a simple test of convergence to
         =P
           (Gj ,Gj )∈K ?   Pr(Gj1 , Gj2 , G−{j1 ,j2 } | X)                                                                                     be performed that compares the posterior edge prob-
                         1            2        j1 ,j2
                                                                                                                                               abilities obtained from each of the independent runs
Similarly, sets of three parent sets can be conditionally                                                                                      (25). The agreement between runs can be examined
sampled. Full technical details are presented in (1).                                                                                          graphically by plotting the edge probabilities against
                                                    Family bereavement (2)          Been expelled (2)             Victim of violence (2)
                                                     Hisp/Latino                                                                                  Have HIV/AIDS (2)
                                                                                               Been suspended (2)                          Strong academically (2)
                        Family bereavement (1)
                                                                      Been expelled (1)
                   Parents unhappy together (1)                                               Seen shooting (2)
                                                                                                                                                   In physical fights (2)

                                                       Family poor (1)                                                                                In physical fights (1)
       Householder smokes (1)
                                                                                                                      Victim of violence (1)
                                                             Black/Af Am
                 Live with father (1)                                                     Been suspended (1)                                      Learning disability

                                                  Parent drinks (1)
                                                                                                                       Strong academically (1)
                  Experiences prejudice (2)
                                                                                Seen shooting (1)                                              Talks to neighbours (1)
       Live with father (2)                                                                                       Female

                              Experiences prejudice (1)               White                                                                        Talks to neighbours (2)
                                                                                          Skips school (1)
                      Live with mother (1)                                                                                                  Severely injured (1)
      Live with mother (2)                                                                                                                              Exercises (2)
                                                                              Drug user (1)
                              Parents aid decisions (1)
                                                                                                                  Good health (1)                Exercises (1)
                                                  Mother warm/loving (1)

                                                                                        Didn't present to doctor (1)                           Depressed (2)
                                                                Smoker (1)
                   Mother warm/loving (2)                                                                                  Depressed (1)
                                                                                                                                                 Severely injured (2)
                                                                                                   Skips school (2)
                                                       Age
                                 Alcohol (1)
                                                                                        Drug user (2)                          Didn't present to doctor (2)


                                               Alcohol (2)                 Smoker (2)                             Good health (2)                 Asian/Pac Isl.


          Am Ind/Nat Am                                                                           Have HIV/AIDS (1)            Other race


Figure 2: Summary network for the AddHealth variables considered. The edge colors are given by the Kendall
correlation coefficents between the two variables, with green edges corresponding to positive correlation, and
red edges to negative correlation. The strength of the correlation is indicated by the transparency of the line,
with greater transparency indicating weaker correlation. The variables ‘Depressed (1)’, ‘Depressed (2)’ and their
parents are shown in bold.


each other (Figure 1). Mean Pearson correlation coef-                                         the model does not say that these are the only factors
ficients between edge probabilities from pairs of runs                                        that are important. For example, “Drug user” at Wave
were 0.9999±0.0002 (standard deviation) for the Gibbs                                         I is related to depression through “Didn’t present to
sampler and 0.6322 ± 0.0477 for MC3 . The agreement                                           doctor” at Wave I and II (Figure 2).
between the independent runs of the Gibbs sampler
                                                                                              This is shown in Figure 3, which gives the conditional
gave us confidence in our results, in contrast to the
                                                                                              probability of being depressed at Wave 2 when a par-
large disagreements between MC3 runs. In addition,
                                                                                              ticular variable is set to a specific value. We see that
cumulative edge probability plots for each edge showed
                                                                                              general health, violence, academic performance and
regular excursions around the mean (26), and a nu-
                                                                                              drug use all affect the conditional probability of de-
merical diagnostic (27) monitoring the number edges
                                                                                              pression at Wave II. Note that to compute this prob-
in the sampled graph also clearly suggested that suffi-
                                                                                              ability, links from the parents of the variable in which
cient samples had been drawn (R̂ ≈ 1.0).
                                                                                              we ‘intervene’ are removed; this is equivalent to the
The samples drawn using MCMC allow the posterior                                              ‘do-operator’ in the terminology of Pearl (28).
distribution of Bayesian networks to be approximated.
                                                                                              The analysis reveals the interaction between the many
In particular, the samples can be used to estimate the
                                                                                              aspects of life that have an impact on depression. The
posterior edge probability P (e|X) with e ∈ E. Fig-
                                                                                              connection between the depression and its two parents
ure 2 displays all edges with posterior probability of
                                                                                              in Figure 2 have been previously discussed in the lit-
at least 0.5.
                                                                                              erature. The importance of gender in depression is
Our focus is on depression, the parents of which in                                           particularly extensively documented in the literature
Figure 2 we observe are “Didn’t present to doctor”                                            (8). The connection to a failure in seeking medical care
and “Gender”. It important, however, to note that                                             even when the individual thinks they should has also
been discussed in the literature, often in terms of poor    pression have not yet been demonstrated to be cost
accessibility of health care services for young people      effective (31).
(29, 8). Several decades of research have revealed the
                                                            We performed structural inference for the Bayesian
complex causation of depression in young people, as
                                                            network using a Gibbs sampler (1), because MC3 did
suggested by this study (8).
                                                            not mix in a reasonable time. We have also found
                                                            (1) this algorithm to be superior to the REV sampler
4   DISCUSSION                                              (32), and it has the advantage of avoiding the need to
                                                            consider an order prior as required by order MCMC
There is a large amount of information held in large        methods (33, 34), which induces a bias that can only
social science questionnaires. In this paper we have ex-    be corrected exactly by NP-hard computation of a cor-
amined a graphical model approach to inferring struc-       rection factor.
ture amongst the variables in such questionnaires. In       An alternative to the MCMC method used here is the
contrast to the standard regression-based approaches,       PC-algorithm (10, 11). This method is computation-
a graphical model approach forgoes the need to specify      ally efficient and is asymptotically consistent. How-
a particular variable as the response. Instead, a more      ever, to test whether the sample size available here is
comprehensive estimate of the entire structure of the       sufficient to reach the asymptotic regime, we applied
underlying system can be obtained. Regression ap-           the PC-algorithm (without constraints) to 10 differ-
proaches posit a particular conditional-independence        ent subsamples, each containing 90% of the data. We
structure, while graphical approaches allow considera-      found that these results differed significantly, with a
tion of more general structures.                            mean 84 in structural Hamming distance between the
The limitations of this study include those of all simi-    pairs of completed partially directed acyclic graphs
lar studies using observational data that are collected     (CPDAGs) given for the subsamples.
for multiple audiences. These forms of data, including      We used a Multinomial-Dirichlet model for the local
the longitudinal data used here, do not permit strong       conditional distributions, which yields a closed-form
causal conclusions to be drawn. In particular there         marginal likelihood. This model posits an entirely
may be important variables that we have not included        general discrete distribution, allowing its form to be
in the analysis. However, the results are consistent        guided by the data. However, the number of parame-
with studies that have used other research approaches       ters in the local distributions for this model increases
including experimental designs. The connection be-          exponentially with the number of parents, which may
tween an individual not seeking medical care when           mean that overly-sparse models are preferred. This is
they think they should and depression supports cur-         problematic when the sample size of the available data
rent practice guidance in the UK (30) where there is an     is small, because models with many parameters cannot
emphasis on providing access to health care through         be assessed adequately without a large dataset. The
the school system rather than expecting young people        large sample size of the dataset used here minimises
to seek health care themselves. Not seeking medical         this issue, but it would nonetheless be worthwhile to
care despite believing it should be sought is a com-        consider more compact parameterisations. However,
plex factor because it captures both barriers to getting    estimating such models (35) significantly increases the
medical care within the individual, such as lacking mo-     complexity of the model space, which makes such an
tivation to seek care, and barriers within the individ-     approach computationally challenging in this setting.
ual’s environment, such as poor access to care. This
may mean that the variable encapsulates a number of         For this paper, we removed samples with missing data.
different characteristics related to depression, and thus   It is possible to handle missing data formally, for exam-
may form a ‘marker’ for depression. However, the use        ple by using structural EM (36), and similarly consider
of a form of the question “Has there been any time over     latent variables (e.g. shared genetics driving both child
the past year when you thought you should get medi-         and parent behaviour). However, at present, doing so
cal care, but you did not?” as a screening question in      whilst robustly exploring large model spaces remains
different contexts needs further consideration.             an open challenge. Tackling these computational and
                                                            inferential issues is a key area for future research.
This method of analysis clarifies the complexity of
depression and suggests why when using traditional          References
methods of analysis it can be difficult to clarify
whether or not factors, such as experiences in the fam-      [1] Goudie, R. J. B. and Mukherjee, S. M. (2011). An
ily, in the wider community and at school, impact on             Efficient Gibbs Sampler for Structural Inference
the experience of depression for young people. It may            in Bayesian Networks. CRiSM Working Paper 11-
also suggest why interventions for prevention of de-             21 (Dept. of Statistics, University of Warwick).
 [2] Friedman, N. (2004) Science, 303, 5659, 799–805.      [23] Kalisch, M. and Bühlmann, P. (2007) J Mach
 [3] Mukherjee, S. and Speed, T. P. (2008) Proc Natl            Learn Res, 8, 613–636.
     Acad Sci USA, 105, 38, 14313–14318.                   [24] Cox, D. and Wermuth, N. (1996) Multivariate De-
 [4] Acid, S., de Campos, L. M., Fernández-Luna,               pendencies Models, Analysis and Interpretation
     J. M., Rodrı́guez, S., Rodrı́guez, J. M. and Sal-          (Chapman & Hall, London).
     cedo, J. L. (2004) Artif Intell in Med, 30, 3, 215–   [25] Robert, C. P. and Casella, G. (2004) Monte Carlo
     232.                                                       Statistical Methods (Springer, New York).
 [5] Costello, E. J., Mustillo, S., Erkanli, A., Keeler,   [26] Yu, B. and Mykland, P. (1998) Statistics and
     G. and Angold, A. (2003) Arch Gen Psych, 60, 8,            Computing, 8, 3, 275–286.
     837–844.
                                                           [27] Gelman, A. and Rubin, D. B. (1992) Statistical
 [6] Costello, E. J., Erkanli, A. and Angold, A. (2006)         Science, 7, 4, 457–472.
     J Child Psychol Psych, 47, 12, 1263–1271.
                                                           [28] Pearl, J. (2000) Causality: Models, Reasoning,
 [7] Thapar, A., Collishaw, S., Potter, R. and Thapar,          and Inference (Cambridge University Press, New
     A. K. (2010) Br Med J, 340, c209.                          York).
 [8] Patel, V., Flisher, A. J., Hetrick, S. and McGorry,   [29] Rickwood, D. J., Deane, F. P. and Wilson, C. J.
     P. (2007) Lancet, 369, 9569, 1302–1313.                    (2007) Med J Aus, 187, 7 Suppl, S35–S39.
 [9] Barnett, P. A. and Gotlib, I. H. (1988) Psych Bull,   [30] National Institute for Health and Clinical Excel-
     104, 1, 97–126.                                            lence (2005) Depression in Children and Young
[10] Spirtes, P., Glymour, C. and Scheines, R. (2000)           People (NICE, London).
     Causation, Prediction, and Search (The MIT            [31] Merry, S. N. (2007) Curr Opin Psych, 20, 4, 325–
     Press, Cambridge, MA).                                     329.
[11] Korb, K. B. and Nicholson, A. E. (2011) Bayesian      [32] Grzegorczyk, M. and Husmeier, D. (2008) Mach
     Artificial Intelligence (CRC Press, Boca Raton,            Learn, 71, 2-3, 265–305.
     FL).
                                                           [33] Ellis, B. and Wong, W. H. (2008) J Am Stat As-
[12] Harris, K. M., Halpern, C. T., Whitsel, E. A.,             soc, 103, 482, 778–789.
     Hussey, J., Tabor, J., Entzel, P. and Udry, J. R.
     (2009) The National Longitudinal Study of Ado-        [34] Friedman, N. and Koller, D. (2003) Mach Learn,
     lescent Health: Research Design.                           50, 1-2, 95–125.
[13] Radloff, L. (1977) App Psych Meas, 1, 3, 385–401.     [35] Friedman, N. and Goldszmidt, M. (1996) In Proc.
                                                                Twelfth Conference on Uncertainty in Artificial
[14] Goodman, E. (1999) Am J Pub Health, 89, 10,
                                                                Intelligence (UAI-96) (Morgan Kaufmann Pub-
     1522–1528.
                                                                lishers Inc.), 252–260.
[15] Roberts, R. E., Lewinsohn, P. M. and Seeley, J. R.
                                                           [36] Friedman, N. (1998) In Proc. Fourteenth Con-
     (1991) J Am Acad Child Adolesc Psych, 30, 1, 58–
                                                                ference on Uncertainty in Artificial Intelligence
     66.
                                                                (UAI-98) (Morgan Kaufmann Publishers Inc.),
[16] Holt, S., Buckley, H. and Whelan, S. (2008) Child          129–138.
     Abuse & Neglect, 32, 8, 797–810.
[17] Obot, I. S. and Anthony, J. C. (2004) J Child
     Adolesc Subst Abuse, 13, 4, 83–96.
[18] Brown, R. A., Lewinsohn, P. M., Seeley, J. R. and
     Wagner, E. F. (1996) J Am Acad Child Adolesc
     Psych, 35, 12, 1602–1610.
[19] Battles, H. B. and Wiener, L. S. (2002) J Adoles
     Health, 30, 3, 161–168.
[20] Larun, L., Nordheim, L. V., Ekeland, E., Hagen,
     K. B. and Heian, F. (2006) Cochrane Database
     Syst Rev, 3, CD004691.
[21] Heckerman, D., Geiger, D. and Chickering, D. M.
     (1995) Mach Learn, 20, 197–243.
[22] Madigan, D. and York, J. C. (1995) Int Stat Rev,
     63, 2, 215–232.
Table 1: The table shows the label used in the plots above, the number of levels (r), and the exact word-
ing of the question. The ID(s) of the relevant variables in the Add Health dataset are in parentheses. See
www.cpc.unc.edu/projects/addhealth for full details of all of these questions.

 Label                    r   Question
 Female                   2   Interviewer, please confirm that R’s sex is (male) female. (BIO SEX)
 Hisp/Latino              2   Are you of Hispanic or Latino origin? (H1GI4)
 White                    2   What is your race? [White] You may give more than one answer (H1GI6A)
 Black/Af Am              2   What is your race? [Black or African American] You may give more than one
                              answer (H1GI6B)
 Am Ind/Nat Am            2   What is your race? [American Indian or Native American] You may give more
                              than one answer (H1GI6C)
 Asian/Pac Isl.           2   What is your race? [Asian or Pacific Islander] You may give more than one
                              answer (H1GI6D)
 Other race               2   What is your race? [Other] You may give more than one answer (H1GI6E)
 Skips school             4   [If SCHOOL YEAR:] During this school year [If SUMMER:] During the 1994-
                              1995 school year how many times HAVE YOU SKIPPED/DID YOU SKIP
                              school for a full day without an excuse? (H1ED2; H2ED2)
 Experiences prejudice    3   [If SCHOOL YEAR:] Students at your school are prejudiced [If SUMMER:] Last
                              year, the students at your school were prejudiced. (H1ED21; H2ED17)
 In physical fights       4   In the past 12 months, how often did you get into a serious physical fight?
                              (H1DS5; H2FV16)
 Didn’t present to doc-   2   Has there been any time over the past year when you thought you should get
 tor                          medical care, but you did not? (H1GH26; H2GH28)
 Severely injured         3   Which of these best describes your worst injury during the past year? (H1GH54;
                              H2GH47)
 Have HIV/AIDS            2   Have you ever been told by a doctor or a nurse that you had... HIV/AIDS
                              (H1CO16D; H2CO19D)
 Seen shooting            3   During the past 12 months, how often did each of the following things happen?
                              You saw someone shoot or stab another person. (H1FV1; H2FV1)
 Mother warm/loving       4   Most of the time, your mother is warm and loving toward you. (H1PF1; H2PF1)
 Been suspended           2   Have you ever received an out-of-school suspension from school? (H1ED7;
                              H2ED3)
 Been expelled            2   Have you ever been expelled from school? (H1ED9; H2ED5)
 Good health              3   In general, how is your health? Would you say... (H1GH1; H2GH1)
 Talks to neighbours      2   In the past month, you have stopped on the street to talk with someone who
                              lives in your neighborhood? (H1NB2; H2NB2)
 Age                      5   Age at interview, computed from date of birth, and date of interview (Con-
                              structed from IYEAR, IMONTH, IDAY, H1GI1Y, H1GI1M)
 Live with mother         2   Indicator variable (Constructed from H1HR3A-T; H2HR4A-Q)
 Live with father         2   Indicator variable (Constructed from H1HR3A-T; H2HR4A-Q)
 Smoker                   4   Frequency of smoking (Constructed from H1TO1/2/5; H2TO1/5)
 Drinks alcohol           4   Frequency and amount of drinking alcohol (Constructed from H1TO12/15/18;
                              H2TO15/19/22)
 Exercises                3   Amount of exercise (Constructed from H1DA4/5/6; H2DA4-6)
 Depressed                2   Rescaled CES-D, following (14) (Constructed from H1FS1-18; H2FS1-18)
 Victim of violence        2   Indicator variable (Constructed from H1FV2-6; (H2FV2-5)
 Family bereavement        3   Number of bereavements (Constructed from H1NM2/F2, H1FP24A1-5;
                               H2NM4/F4, H2FP28A1-3)
 Strong academically       4   Quartiles (Constructed from H1ED11-4; H2ED7-10)
 Drug user                 2   Indicator variable (Constructed from H1TO30/34/37/41; H2TO44/50/54/58)
 Family poor               5   Census Bureau measure of poverty (Constructed from H1HR2/3/7/8, PA55)
 Parents unhappy to-       4   (Parent asked.) Do you and your partner argue/talk of separating? (Constructed
 gether                        from PB19/20)
 Parent drinks             4   (Parent asked.) Number/frequency of drinks (Constructed from PA61/2)
 Householder smokes        3   (Parent asked.) Either parent or others in household smokes (Constructed from
                               PA63/4)
 Has learning disability   2   (Parent asked.) Does (he/ she) have a specific learning disability, such as diffi-
                               culties with attention, dyslexia, or some other reading, spelling, writing, or math
                               disability? (PC38)
 Parents aid decisions     5   (Parent asked.) How often would it be true for you to make each of the following
                               statements about {child’s name}? {Child’s name} and you make decisions about
                               (his/ her) life together. (PC34B)

Table 2: The groupings of the variables that were used to determine constraints on the Bayesian networks. Each
variable in the analysis is either a Background variable, or from Wave I or Wave II of the Add Health study.
Within each wave of the study, variables were further classified into whether they asked about the short- or
long-term.

 Background          Wave I Long-term          Wave I Short-term      Wave II Long-term         Wave II Short-term
 Female              Skips school              Househol. smokes       Seen shooting             Smoker
 Age                 Experiences prejudice     Smoker                 Alcohol                   Live with mother
 Hisp/Latino         In physical fights        Live with mother       Drug user                 Live with father
 White               Didn’t pres. to doctor    Live with father       Mother warm/loving        Talks neighbours
 Black/Af Am         Severely injured          Parent drinks          Have HIV/AIDS             Exercises
 Am Ind/Nat Am       Have HIV/AIDS             Talks neighbours       Family bereavement        Depressed
 Asian/Pac Isl.      Seen shooting             Exercises              Experiences prejudice
 Other race          Mother warm/loving        Depressed              Been expelled
 Has learning dis.   Been suspended                                   Been suspended
                     Been expelled                                    Victim of violence
                     Good health                                      In physical fights
                     Alcohol                                          Strong academically
                     Victim of violence                               Didn’t pres. to doctor
                     Family bereavement                               Skips school
                     Strong academically                              Severely injured
                     Drug user                                        Good health
                     Family poor
                     Parents unhappy togth.
                     Parents aid decisions