A Comparison of Two MCMC Algorithms for Hierarchical Mixture Models


                                                      Russell G. Almond⇤
                                                     Florida State University


                         Abstract                                    1   Introduction

                                                                     Mixture models (McLachlan & Peel, 2000) are a fre-
     Mixture models form an important class of
                                                                     quently used method of unsupervised learning. They
     models for unsupervised learning, allowing
                                                                     sort data points into clusters based on just their val-
     data points to be assigned labels based on
                                                                     ues. One of the most frequently used mixture models
     their values. However, standard mixture
                                                                     is a mixture of normal distributions. Often the mean
     models procedures do not deal well with rare
                                                                     and variance of each cluster is learned along with the
     components. For example, pause times in
                                                                     classification of each data point.
     student essays have di↵erent lengths depend-
     ing on what cognitive processes a student                       As an example, Almond, Deane, Quinlan, Wagner, and
     engages in during the pause. However, in-                       Sydorenko (2012) fit a mixture of lognormal distribu-
     stances of student planning (and hence very                     tions to the pause time of students typing essays as
     long pauses) are rare, and thus it is dif-                      part of a pilot writing assessment. (Alternatively, this
     ficult to estimate those parameters from a                      model can be described as a mixture of normals fit to
     single student’s essays. A hierarchical mix-                    the log pause times.) Almond et al. found that mix-
     ture model eliminates some of those prob-                       ture models seems to fit the data fairly well. The mix-
     lems, by pooling data across several of the                     ture components could correspond to di↵erent cogni-
     higher level units (in the example students)                    tive process used in writing (Deane, 2012) where each
     to estimate parameters of the mixture com-                      cognitive process takes di↵erent amounts of time (i.e.,
     ponents. One way to estimate the parame-                        students pause longer when planning, than when sim-
     ters of a hierarchical mixture model is to use                  ply typing).
     MCMC. But these models have several issues
                                                                     Mixture models are difficult to fit because they dis-
     such as non-identifiability under label switch-
                                                                     play a number of pathologies. One problem is com-
     ing that make them difficult to estimate just
                                                                     ponent identification. Simply swapping the labels of
     using o↵-the-shelf MCMC tools. This paper
                                                                     Components 1 and 2 produces a model with identical
     looks at the steps necessary to estimate these
                                                                     likelihoods. Even if a prior distribution is placed on
     models using two popular MCMC packages:
                                                                     the mixture component parameters, the posterior is
     JAGS (random walk Metropolis algorithm)
                                                                     multimodal. Second, it is easy to get a pathological
     and Stan (Hamiltonian Monte Carlo). JAGS,
                                                                     solution in which a mixture component consists of a
     Stan and R code to estimate the models and
                                                                     single point. These solutions are not desirable, and
     model fit statistics are published along with
                                                                     some estimation tools constrain the minimum size of
     the paper.
                                                                     a mixture component (Gruen & Leisch, 2008). Fur-
                                                                     thermore, if a separate variance is to be estimated for
                                                                     each mixture component, several data points must be
                                                                     assigned to each component in order for the variance
                                                                     estimate to have a reasonable standard error. There-
Key words: Mixture Models, Markov Chain Monte
                                                                     fore, fitting a model with rare components requires a
Carlo, JAGS, Stan, WAIC
                                                                     large data size.
    ⇤
     Paper presented at the Bayesian Application Workshop at Un-
certainty in Artificial Intelligence Conference 2014, Quebec City,
                                                                     Almond et al. (2012) noted these limitations in
Canada.                                                              their conclusions. First, events corresponding to the

                                                                1
highest-level planning components in Deane (2012)’s
cognitive model would be relatively rare, and hence
would lumped in with other mixture components due
to size restrictions. Second, some linguistic contexts
(e.g., between Sentence pauses) were rare enough that
fitting a separate mixture model to each student would
not be feasible.
One work-around is a hierarchical mixture model. As
with all hierarchical models, it requires units at two
di↵erent levels (in this example, students or essays
are Level 2 and individual pauses are Level 1). The
assumption behind the hierarchical mixture model is
that the mixture components will look similar across
the second level units. Thus, the mean and variance of
Mixture Component 1 for Student 1 will look similar
                                                                    Figure 1: Non-hierarchical Mixture Model
to those for Student 2. Li (2013) tried this on some of
the writing data.
                                                              tion 2.4 describes prior distributions and constraints
One problem that frequently arises in estimating mix-
                                                              on parameters which keep the Markov chain away from
ture models is determining how many mixture compo-
                                                              those points.
nents to use. What is commonly done is to estimate
models for K = 2, 3, 4, . . . up to some small maximum
number of components (depending on the size of the            2.1    Mixture of Normals
data). Then a measure of model–data fit, such as AIC,
                                                              Consider a collection observations,              Yi       =
DIC or WAIC (see Gelman et al., 2013, Chapter 7), is
                                                              (Yi,1 , . . . , Yi,Ji ) for a single student, i.  Assume
calculated for each model and the model with the best
                                                              that the process that generated these data is a mix-
fit index is chosen. These methods look at the deviance
                                                              ture of K normals. Let Zi,j ⇠ cat(⇡ i ) be a categorical
(minus twice the log likelihood of the data) and adjust
                                                              latent index variable indicating which component Ob-
it with a penalty for model complexity. Both DIC and                                                  ⇤
                                                              servation j comes from and let Yi,j,k       ⇠ N (µi,k , i,k )
WAIC require Markov chain Monte Carlo (MCMC) to
                                                              be the value of Yi,j which would be realized when
compute, and require some custom coding for mixture
                                                              Zi,j = k.
models because of the component identification issue.
                                                              Figure 1 shows this model graphically. The plates in-
This paper is a tutorial for replicating the method used
                                                              dicate replication over categories (k), Level 1 (pauses,
by Li (2013). The paper walks through a script writ-
                                                              j) and Level 2 (students, i) units. Note that there is
ten in the R language (R Core Team, 2014) which per-
                                                              no connection across plates, so the model fit to each
forms most of the steps. The actual estimation is done
                                                              Level 2 unit is independent of all the other Level 2
using MCMC using either Stan (Stan Development
                                                              units. This is what Gelman et al. (2013) call the no
Team, 2013) or JAGS (Plummer, 2012). The R scripts
                                                              pooling case.
along with the Stan and JAGS models and some sam-
ple data are available at http://pluto.coe.fsu.edu/           The latent variables Z and Y ⇤ can be removed from
mcmc-hierMM/.                                                 the likelihood for Yi,j by summing over the possible
                                                              values of Z. The likelihood for one student’s data, Yi ,
2    Mixture Models                                           is
                                                                                             Ji X
                                                                                             Y  K
                                                                                                                Yi,j     µi,k
Let i 2 {1, . . . , I} be a set of indexes over the sec-         Li (Yi |⇡ i , µi ,   i) =             ⇡i,k (                   )   (1)
                                                                                                                       i,k
ond level units (students in the example) and let                                            j=1 k=1
j 2 {1, . . . , Ji } be the first level units (pause events
                                                              where (·) is the unit normal density.
in the example). A hierarchical mixture model is by
adding a Level 2 (across student) distribution over the       Although conceptually simple, there are a number of
parameters of the Level 1 (within student) mixture            issues that mixture models can have. The first is-
model. Section 2.1 describes the base Level 1 mixture         sue is that the component labels cannot be identified
model, and Section 2.2 describes the Level 2 model.           from data. Consider the case with two components.
Often MCMC requires reparameterization to achieve             The model created by swapping the labels for Compo-
better mixing (Section 2.3). Also, there are certain pa-      nents 1 and 2 with new parameters ⇡ 0i = (⇡i,2 , ⇡i,1 ),
rameter values which result in infinite likelihoods. Sec-     µ0i = (µi,2 , µi,1 ), and 0i = ( i,2 , i,1 ) has an identical

                                                         2
likelihood. For the K component model, any permu-
tation of the component labels produces a model with
identical likelihood. Consequently, the likelihood sur-
face is multimodal.
A common solution to the problem is to identify the
components by placing an ordering constraint on one
of the three parameter vectors: ⇡, µ or . Section 4.1
returns to this issue in practice.
A second issue involves degenerate solutions which
contain only a single data point. If a mixture com-
ponent has only a single data point, its standard devi-
ation, i,k will go to zero, and ⇡i,k will approach 1/Ji ,
which is close to zero for large Level 1 data sets. Note
that if either ⇡i,k ! 0 or i,k > 0 for any k, then the
likelihood will become singular.                                      Figure 2: Hierarchical Mixture Model
Estimating i,k requires a minimum number of data
points from Component k (⇡i,k Ji > 5 is a rough mini-        ing mixture model. To make the hierarchical model,
mum). If ⇡i,k is believed to be small for some k, then a     add across-student prior distributions for the student-
large (Level 1) sample is needed. As K increases, the        specific parameters parameters, ⇡ i , µi and i . Be-
smallest value of ⇡i,k becomes smaller so the minimum        cause JAGS parameterizes the normal distribution
sample size increases.                                       with precisions (reciprocal variances) rather than stan-
                                                             dard deviations, ⌧ i are substituted for the standard
2.2   Hierarchical mixtures                                  deviations i . Figure 2 shows the hierarchical model.

Gelman et al. (2013) explains the concept of a hier-         Completing the model requires specifying distributions
archical model using a SAT coaching experiment that          for the three Level 1 (student-specific) parameters. In
took place at 8 di↵erent schools. Let Xi ⇠ N (µi , i )       particular, let
be the observed e↵ect at each school, where the school
                                                                 ⇡ i = (⇡i,1 , . . . , ⇡i,K ) ⇠ Dirichlet(↵1 , . . . , ↵k )   (2)
specific standard error i is known (based mainly on
the sample size used at that school). There are three                                 µi,k ⇠ N (µ0,k ,       0,k )            (3)
ways to approach this problem: (1) No pooling. Es-                              log(⌧i,k ) ⇠ N (log(⌧0,k ), 0,k )             (4)
timate µi separately with no assumption about about
the similarity of µ across schools. (2) Complete pool-       This introduces new Level 2 (across student) param-
ing. Set µi = µ0 for all i and estimate µ0 (3) Par-          eters: ↵, µ0 , 0 , ⌧ 0 , and 0 . The likelihood for a
tial pooling. Let µi ⇠ N (µ0 , ⌫) and now jointly es-        single student, Li (Yi |⇡ i , µi , ⌧ i ) is given (with a suit-
timate µ0 , µ1 , . . . , µ8 . The no pooling approach pro-   able chance of variable) by Equation 1. To get the
duces unbiased estimates for each school, but it has         complete data likelihood, multiply across the students
the largest standard errors, especially for the smallest     units (or to avoid numeric overflow, sum the log likeli-
schools. The complete pooling approach ignores the           hoods). If we let ⌦ be the complete parameter (⇡ i ,µi ,
school level variability, but has much smaller standard      ⌧ i for each student, plus ↵, µ0 , 0 , ⌧ 0 , and 0 ), then
errors. In the partial pooling approach, the individual                               I
school estimates are shrunk towards the grand mean,                                   X
                                                                        L(Y|⌦) =            log Li (Yi |⇡ i , µi , ⌧ i ) .    (5)
µ0 , with the amount of shrinkage related to the size of
                                                                                      i=1
the ratio ⌫ 2 /(⌫ 2 + i 2 ); in particular, there is more
shrinkage for the schools which were less precisely mea-
                                                             Hierarchical models have their own pathologies which
sured. Note that the higher level standard deviation,
                                                             require care during estimation. If either of the stan-
⌫ controls the amount of shrinkage: the smaller ⌫ is
                                                             dard deviation parameters, 0,k or 0,k , gets too close
the more the individual school estimates are pulled to-
                                                             to zero or infinity, then this could cause the log poste-
wards the grand mean. At the limits, if ⌫ = 0, the par-
                                                             rior to go to infinity. These cases correspond to the no
tial pooling model is the same as the complete pooling
                                                             pooling and complete pooling extremes of the hierar-
model and if ⌫ = 1 then the partial pooling model is
                                                             chical model. Similarly, the varianceP   of the Dirichlet
the same as the no pooling model.                                                                      K
                                                             distribution is determined by ↵N = k=1 ↵k . If ↵N
Li (2013) builds a hierarchical mixture model for            is too close to zero, this produces no pooling in the
the essay pause data. Figure 1 shows the no pool-            estimates of ⇡ i and if ↵N is too large, then there is

                                                        3
nearly complete pooling in those estimates. Again,
those values of ↵N can cause the log posterior to be-
come infinite.                                                                    ↵k = ↵0,k ⇤ ↵N                     (6)
                                                                                 µi,k = µ0,k + ✓i,k 0,k              (7)
                                                                                 ✓i,k ⇠ N (0, 1)                     (8)
                                                                            log(⌧i,k ) = log(⌧0,k ) + ⌘i,k 0,k       (9)
                                                                                 ⌘i,k ⇠ N (0, 1)                    (10)
2.3   Reparameterization
                                                                2.4   Prior distributions and parameter
                                                                      constraints

It is often worthwhile to reparameterize a model in or-         Rarely does an analyst have enough prior informa-
der to make MCMC estimation more efficient. Both                tion to completely specify a prior distribution. Usu-
random walk Metropolis (Section 3.2) and Hamilto-               ally, the analyst chooses a convenient functional form
nian Monte Carlo (Section 3.3) work by moving a point           and chooses the hyperparameters of that form based
around the parameter space of the model. The geom-              on available prior information (if any). One popular
etry of that space will influence how fast the Markov           choice is the conjugate distribution of the likelihood.
chain mixes, that is, moves around the space. If the            For example, if the prior for a multinomial probability,
geometry is unfavorable, the point will move slowly             ⇡ i follows a Dirichlet distribution with hyperparame-
through the space and the autocorrelation will be high          ter, ↵, then the posterior distribution will also be a
(Neal, 2011). In this case a very large Monte Carlo             Dirichlet with a hyperparameter equal to the sum of
sample will be required to obtain reasonable Monte              the prior hyperparameter and the data. This gives the
Carlo error for the parameter estimates.                        hyperparameter of a Dirichlet prior a convenient in-
                                                                terpretation as pseudo-data. A convenient functional
Consider once more the Eight Schools problem where
                                                                form is one based on a normal distribution (sometimes
µi ⇠ N (µ0 , ⌫). Assume that we have a sampler that
                                                                on a transformed parameter, such as the log of the
works by updating the value of the µi ’s one at a time
                                                                precision) whose mean can be set to a likely value of
and then updating the values of µ0 and ⌫. When up-
                                                                the parameter and whose standard deviation can be
dating µi , if ⌫ is small then values of µi close to µ0 will
                                                                set so that all of the likely values for the parameter
have the highest conditional probability. When updat-
                                                                are within two standard deviations of the mean. Note
ing ⌫ if all of the values of µi are close to µ0 , then small
                                                                that for location parameters, the normal distribution
values of ⌫ will have the highest conditional probabil-
                                                                is often a conjugate prior distribution.
ity. The net result is a chain in which the movement
from states with small ⌫ and the µi ’s close together           Proper prior distributions are also useful for keeping
to states with large ⌫ and µi ’s far apart takes many           parameter values away from degenerate or impossible
steps.                                                          solutions. For example, priors for standard deviations,
                                                                  0,k and 0,k must be strictly positive. Also, the group
A simple trick produces a chain which moves much
                                                                precisions for each student, ⌧ i , must be strictly pos-
more quickly. Introduce a series of auxiliary variables
                                                                itive. The natural conjugate distribution for a preci-
✓i ⇠ N (0, 1) and set µi = µ0 + ⌫✓i . Note that the
                                                                sion is a gamma distribution, which is strictly positive.
marginal distribution of µi has not changed, but the
                                                                The lognormal distribution, used in Equations 4 and 9,
geometry of the ✓, µ0 , ⌫ space is di↵erent from the ge-
                                                                has a similar shape, but its parameters can be inter-
ometry of the µ, µ0 , ⌫ space, resulting in a chain that
                                                                preted as a mean and standard deviation on the log
moves much more quickly.
                                                                scale. The mixing probabilities ⇡ i must be defined
A similar trick works for modeling the relationships            over the unit simplex (i.e., they must all be between 0
between ↵ and ⇡ i . Setting ↵k = ↵0,k ⇤ ↵N , where              and 1 and they must sum to 1), as must the expected
↵0 has a Dirichlet distribution and ↵N has a gamma              mixing probabilities ↵0 ; the Dirichlet distribution sat-
distribution seems to work well for the parameters ↵ of         isfies this constraint and is also a natural conjugate.
the Dirichlet distribution. In this case the parameters         Finally, ↵N must be strictly positive; the choice of the
have a particularly easy to interpret meaning: ↵0 is            chi-squared distribution as a prior ensures this.
the expected value of the Dirichlet distribution and ↵N
                                                                There are other softer restraints on the parameters. If
is the e↵ective sample size of the Dirichlet distribution.
                                                                  0,k or 0,k gets too high or too low for any value of
Applying these two tricks to the parameters of the              k, the result is a no pooling or complete pooling so-
hierarchical model yields the following augmented pa-           lution on that mixture component. For both of these
rameterization.                                                 parameters (.01, 100) seems like a plausible range. A

                                                           4
lognormal distribution which puts the bulk of its prob-              comes up with a single parameter estimate but does
ability mass in that range should keep the model away                not explore the entire space of the distribution. In con-
from those extremes. Values of ⌧i,k that are too small               trast, MCMC algorithms explore the entire posterior
represent the collapse of a mixture component onto                   distribution by taking samples from the space of possi-
a single point. Again, if ⌧0,k is mostly in the range                ble values. Although the MCMC samples are not inde-
(.01, 100) the chance of an extreme value of ⌧i,k should             pendent, if the sample is large enough it converges in
be small. This yields the following priors:                          distribution to the desired posterior. Two approaches
                                                                     to MCMC estimation are the random walk Metropolis
                     log( 0k ) ⇠ N (0, 1)                    (11)    algorithm (RWM; used by JAGS, Section 3.2) and the
                     log( 0k ) ⇠ N (0, 1)                    (12)    Hamiltonian Monte Carlo algorithm (HMC; used by
                      log(⌧0k ) ⇠ N (0, 1)                   (13)    Stan, Section 3.3).

                                                                     3.1   EM Algorithm
High values of ↵N also correspond to a degenerate so-
lution. In general, the gamma distribution has about                 McLachlan and Krishnan (2008) provides a review of
the right shape for the prior distribution of ↵N , and we            the EM algorithm with emphasis on mixture models.
expect it to be about the same size as I, the number                 The form of the EM algorithm is particularly simple
of Level-2 units. The choice of prior is a chi-squared               for the special case of a non-hierarchical mixture of
distribution with 2 ⇤ I degrees of freedom.                          normals. It alternates between an E-step where the
                                 2                                   p(Zi,j = k) = pi,j,k is calculated for every observation,
                       ↵N ⇠          (I ⇤ 2)                 (14)
                                                                     j, and every component, k and an M-step where the
                                                                     maximum likelihood values for ⇡i,k , µi,k and i,k are
The two remaining parameters we don’t need to con-                   found by taking moments of the data set Yi weighted
strain too much. For µ0,k we use a di↵use normal                     by the component probabilities, pi,j,k .
prior (one with a high standard deviation), and for ↵0
we use a Je↵rey’s prior (uniform on the logistic scale)              A number of di↵erent problems can crop up with using
which is the Dirichlet distribution with all values set              EM to fit mixture models. In particular, if ⇡i,k goes to
to 1/2.                                                              zero for any k, that component essentially disappears
                                                                     from the mixture. Also, if i,k goes to zero the mix-
               µ0,k ⇠ N (0, 1000)                            (15)    ture component concentrates on a single point. Fur-
                 ↵0 ⇠ Dirichlet(0.5, . . . , 0.5)            (16)    thermore, if µi,k = µi,k0 and i,k = i,k0 for any pair
                                                                     of components the result is a degenerate solution with
                                                                     K 1 components.
The informative priors above are not sufficient to al-
ways keep the Markov chain out of trouble. In partic-                As the posterior distribution for the mixture model is
ular, it can still reach places where the log posterior              multimodal, the EM algorithm only finds a local max-
distribution is infinite. There are two di↵erent places              imum. Running it from multiple starting points may
where these seem to occur. One is associated with high               help find the global maximum; however, in practice it
values of ↵N . Putting a hard limit of ↵N < 500 seems                is typically run once. If the order of the components
to avoid this problem (when I = 100). Another possi-                 is important, the components are typically relabeled
ble degenerate spot is when ↵k ⇡ 0 for some k. This                  after fitting the model according to a predefined rule
means that ⇡i,k will be essentially zero for all students.           (e.g., increasing values of µi,k ).
Adding .01 to all of the ↵k values in Equation 2 seems               Two packages are available for fitting non-hierarchical
to fix this problem.                                                 mixture models using the EM algorithm in R (R Core
                                                                     Team, 2014): FlexMix (Gruen & Leisch, 2008) and
                                                                     mixtools (Benaglia, Chauveau, Hunter, & Young,
⇡ i = (⇡i,1 , . . . , ⇡i,K ) ⇠ Dirichlet(↵1 +.01, . . . , ↵k +.01)   2009). These two packages take di↵erent approaches
                                                                     to how they deal with degenerate solutions. FlexMix
                                                                     will combine two mixture components if they get too
3    Estimation Algorithms
                                                                     close together or the probability of one component gets
                                                                     too small (by default, if ⇡i,k < .05). Mixtools, on the
There are generally two classes of algorithms used
                                                                     other hand, retries from a di↵erent starting point when
with both hierarchical and mixture models. The ex-
                                                                     the the EM algorithm converges on a degenerate so-
pectation maximization (EM) algorithm (Section 3.1)
                                                                     lution. If it exceeds the allowed number of retries, it
searches for a set of parameter values that maximizes
                                                                     gives up.
the log posterior distribution of the data (or if the prior
distribution is flat, it maximizes the log likelihood). It           Neither mixtools nor FlexMix provides standard er-

                                                                5
rors for the parameter estimates. The mixtools pack-       ter for the correctness of the algorithm, the most com-
age recommends using the bootstrap (resampling from        mon method is to go one parameter at a time and add
the data distribution) to calculate standard errors, and   a random o↵set (a step) to its value, accepting or re-
provides a function to facilitate this.                    jecting it according to the Metropolis rule. As this
                                                           distribution is essentially a random walk over the pa-
                                                           rameter space, this implementation of MCMC is called
3.2   Random-walk Metropolis Algorithm
                                                           random walk Metropolis (RWM). The step size is a
      (RWM; used by JAGS)
                                                           critical tuning parameter. If the average step size is
Geyer (2011) gives a tutorial summary of MCMC. The         too large, the value will be rejected nearly every cycle
basic idea is that a mechanism is found for construct-     and the autocorrelation will be high. If the step size
ing a Markov chain whose stationary distribution is the    is too small, the chain will move very slowly through
desired posterior distribution. The chain is run until     the space and the autocorrelation will be high. Gibbs
the analyst is reasonably sure it has reach the station-   sampling, where the step is chosen using a conjugate
ary distribution (these early draws are discarded as       distribution so the Metropolis-Hastings ratio always
burn-in). Then the the chain is run some more until it     accepts, is not necessarily better. Often the e↵ective
is believed to have mixed throughout the entire poste-     step size of the Gibbs sampler is small resulting in high
rior distribution. At this point it has reached pseudo-    autocorrelation.
convergence (Geyer calls this pseudo-convergence, be-      Most packages that use RWM do some adaptation on
cause without running the chain for an infinite length     the step size, trying to get an optimal rejection rate.
of time, there is no way of telling if some part of the    During this adaptation phase, the Markov chain does
parameter space was never reached.) At this point          not have the correct stationary distribution, so those
the mean and standard error of the parameters are          observations must be discarded, and a certain amount
estimated from the the observed mean and standard          of burn-in is needed after the adaptation finishes.
deviation of the parameter values in the MCMC sam-
ple.                                                       MCMC and the RWM algorithm were made popular
                                                           by their convenient implementation in the BUGS soft-
There are two sources of error in estimates made from      ware package (Thomas, Spiegelhalter, & Gilks, 1992).
the MCMC sample. The first arises because the ob-          With BUGS, the analyst can write down the model
served data are a sample from the universe of potential    in a form very similar to the series of equations used
observations. This sampling error would be present         to describe the model in Section 2, with a syntax de-
even if the posterior distribution could be computed       rived from the R language. BUGS then compiles the
exactly. The second is the Monte Carlo error that          model into pseudo-code which produces the Markov
comes from the estimation of the posterior distribu-       chain, choosing to do Gibbs sampling or random walk
tion with the Monte Carlo sample. Because the draws        Metropolis for each parameter depending on whether
from the Markov chain are not statistically indepen-       or not a convenient conjugate proposal distribution
dent,pthis Monte Carlo error does not fall at the rate     was available. The output could be exported in a
of 1/ R (where R is the number of Monte Carlo sam-         form that could be read by R, and the R package coda
ples). It is also related to the autocorrelation (corre-   (Plummer, Best, Cowles, & Vines, 2006) could be used
lation between successive draws) of the Markov chain.      to process the output. (Later, WinBUGS would build
The higher the autocorrelation, the lower the e↵ective     some of that output processing into BUGS.)
sample size of the Monte Carlo sample, and the higher
the Monte Carlo error.                                     Although BUGS played an important role in encourag-
                                                           ing data analysts to use MCMC, it is no longer actively
Most methods for building the Markov chain are based       supported. This means that the latest developments
on the Metropolis algorithm (see Geyer, 2011, for de-      and improvements in MCMC do not get incorporated
tails). A new value for one or more parameter is pro-      into its code. Rather than use BUGS, analysts are
posed and the new value is accepted or rejected ran-       advised to use one of the two successor software pack-
domly according to the ratio of the posterior distribu-    ages: OpenBUGS (Thomas, O’Hara, Ligges, & Sturtz,
tion and the old and new points, with a correction fac-    2006) or JAGS (just another Gibbs sampler; Plummer,
tor for the mechanism used to generate the new sam-        2012). The R package rjags allows JAGS to be called
ple. This is called a Metropolis or Metropolis-Hastings    from R, and hence allows R to be used as a scripting
update (the latter contains a correction for asymmet-      language for JAGS, which is important for serious an-
ric proposal distributions). Gibbs sampling is a special   alytic e↵orts. (Similar packages exist for BUGS and
case in which the proposal is chosen in such a way that    OpenBUGS.)
it will always be accepted.
As the form of the proposal distribution does not mat-

                                                      6
3.3   Hamiltonian Monte Carlo (HMC; used                    3.4   Parallel Computing and Memory Issues
      by Stan)
                                                            As most computers have multiple processors, paral-
                                                            lel computing can be used to speed up MCMC runs.
Hamiltonian Monte Carlo (HMC) (Neal, 2011) is a
                                                            Often multiple Markov chains are run and the results
variant on the Metropolis Algorithm which uses a dif-
                                                            are compared to assess pseudo-convergence and then
ferent proposal distribution than RWM. In HMC, the
                                                            combined for inference (Gelman & Shirley, 2011). This
current draw from the posterior is imagined to be a
                                                            is straightforward using the output processing package
small particle on a hilly surface (the posterior distri-
                                                            coda. It is a little bit trickier using the rstan package,
bution). The particle is given a random velocity and is
                                                            because many of the graphics require a full stanfit
allowed to move for several discrete steps in that direc-
                                                            object. However, the conversion from Stan to coda
tion. The movement follows the laws of physics, so the
                                                            format for the MCMC samples is straightforward.
particle gains speed when it falls down hills and loses
speed when it climbs back up the hills. In this manner      In the case of hierarchical mixture models, there is
a proposal is generated that can be a great distance        an even easier way to take advantage of multiple pro-
from the original starting point. The proposed point is     cesses. If the number of components, K, is unknown,
then accepted or rejected according to the Metropolis       the usual procedure is to take several runs with dif-
rule.                                                       ferent values of K and compare the fit. Therefore, if
                                                            4 processors were available, one could run all of the
The software package Stan (Stan Development Team,
                                                            chains for K = 2, one for K = 3, and one for K = 4,
2013) provides support for HMC. As with BUGS and
                                                            leaving one free to handle operating system tasks.
JAGS, the model of Section 2 is written in pseudo-
code, although this time the syntax looks more like         In most modern computers, the bottleneck is usually
C++ than R. Rather than translate the model into            not available CPU cycles, but available memory. For
interpreted code, Stan translates it into C++ then          running 3 chains with I = 100 students and R = 5000
compiles and links it with existing Stan code to run        MCMC samples in each chain, the MCMC sample can
the sampler. This has an initial overhead for the com-      take up to 0.5GB of memory! Consequently, it is crit-
pilation, but afterwards, each cycle of the sampler runs    ically important to monitor memory usage when run-
faster. Also, as HMC generally has lower autocorrela-       ning multiple MCMC runs. If the computer starts re-
tion than random walk Metropolis, smaller run lengths       quiring swap (disk-based memory) in addition to phys-
can be used, making Stan considerably faster than           ical memory, then running fewer processes will proba-
JAGS in some applications. A package rstan is avail-        bly speed up the computations.
able to link Stan to R, but because of the compilation
                                                            Another potential problem occurs when storing the re-
step, it requires that the user have the proper R de-
                                                            sult of each run between R sessions in the .RData file.
velopment environment set up.
                                                            Note that R attempts to load the entire contents of
HMC has more tuning parameters than random walk             that data file into memory when starting R. If there
Metropolis: the mass of the particle, the distribution      are the results of several MCMC runs in there, the
of velocities and the number of steps to take in each       .RData file can grow to several GB in size, and R can
direction must be selected to set up the algorithm.         take several minutes to load. (In case this problem
Stan uses a warm-up phase to do this adaptation. The        arises, it is recommended that you take a make a copy
recommended procedure is to use approximately 1/2           of the .RData after the data have been cleaned and all
the samples for warm-up as a longer warm-up produces        the auxiliary functions loaded but before starting the
lower autocorrelations when actively sampling.              MCMC runs, and put it someplace safe. If the .RData
                                                            file gets to be too large it can be simply replaced with
Stan has some interesting features that are not present
                                                            the backup.)
in BUGS or JAGS. For one, it does not require every
parameter to have a proper prior distribution (as long      In order to prevent the .RData file from growing
as the posterior is proper). It will simply put a uni-      unmanageably large, it recommended that the work
form prior over the space of possible values for any        space not be saved at the end of each run. Instead,
parameter not given a prior distribution. However,          run a block of code like this
using explicit priors has some advantages for the ap-
plication to student pause data. In particular, when        assign(runName,result1)
data for a new student become available, the poste-         outfile <-
rior parameters for the previous run can be input into           gzfile(paste(runName,"R","gz",sep="."),
Stan (or JAGS) and the original calibration model can                open="wt")
be reused to estimate the parameters for new student        dump(runName,file=outfile)
(Mislevy, Almond, Yan, & Steinberg, 1999).                  close(outfile)

                                                       7
after all computations are completed. Here result1             5. Identify the mixture components. Ideally, the
is a variable that gathers together the portion of the            Markov chains have visited all of the modes of the
results to keep, and runName is a character string that           posterior distribution, including the ones which
provides a unique name for the run.                               di↵er only by a permutation of the component la-
                                                                  bels. Section 4.1 describes how to permute the
Assume that all of the commands necessary to perform
                                                                  component labels are permuted so that the com-
the MCMC analysis are in a file script.R. To run this
                                                                  ponent labels are the same in each MCMC cycle.
from a command shell use the command:
                                                               6. Check pseudo-convergence. Several statistics and
R CMD BATCH --slave script.R                                      plots are calculated to see if the chains have
                                                                  reached pseudo-convergence and the sample size
This will run the R code in the script, using the .RData
                                                                  is adequate. If the sample is inadequate, then ad-
file in the current working directory, and put the out-
                                                                  ditional samples are collected (Section 4.3).
put into script.Rout. The --slave switch performs
two useful functions. First, it does not save the .RData       7. Draw Inferences. Summaries of the posterior dis-
file at the end of the run (avoiding potential memory             tribution for the the parameters of interest are
problems). Second, it suppresses echoing the script to            computed and reported. Note that JAGS o↵ers
the output file, making the output file easier to read.           some possibilities here that Stan does not. In par-
Under Linux and Mac OS X, the command                             ticular, JAGS can monitor just the cross-student
                                                                  parameters (↵0 , ↵N , µ0 , 0 , log(⌧ 0 ), and 0 ) for
nohup R CMD BATCH --slave script.R &                              a much longer time to check pseudo-convergence,
                                                                  and then a short additional run can be used to
runs R in the background. The user can log out of the
                                                                  draw inferences about the student specific param-
computer, but as long as the computer remains on, the
                                                                  eters, ⇡ i , µi and ⌧ i (for a considerable memory
process will continue to run.
                                                                  savings).

4      Model Estimation                                        8. Data point labeling. In mixture models, it is some-
                                                                  times of interest to identify which mixture com-
A MCMC analysis always follows a number of similar                ponent each observation Yi,j comes from (Sec-
steps, although the hierarhical mixture model requires            tion 4.5).
a couple of additional steps. Usually, it is best to write     9. Calculate model fit index. If the goal is to compare
these as a script because the analysis will often need            models for several di↵erent values of K, then a
to be repeated several times. The steps are as follows:           measure of model fit such as DIC or WAIC should
                                                                  be calculated (Section 5).
    1. Set up parameters for the run. This includes
       which data to analyze, how many mixture compo-         4.1   Component Identification
       nents are required (i.e., K), how long to run the
       Markov chain, how many chains to run, what the         The multiple modes in the posterior distribution for
       prior hyperparameters are.                             mixture models present a special problem for MCMC.
                                                              In particular, it is possible for a Markov chain to get
    2. Clean the data. In the case of hierarchical mixture    stuck in a mode corresponding to a single component
       models, student data vectors which are too short       labeling and never mix to the other component la-
       should be eliminated. Stan does not accept miss-       belings. (This especially problematic when coupled
       ing values, so NAs in the data need to be replaced     with the common practice of starting several paral-
       with a finite value. For both JAGS and Stan the        lel Markov chains.) Früthwirth-Schnatter (2001) de-
       data are bundled with the prior hyperparameters        scribes this problem in detail.
       to be passed to the MCMC software.
                                                              One solution is to constrain the parameter space to fol-
    3. Set initial values. Initial values need to be chosen   low some canonical ordering (e.g., µi,1  µi,2  · · · 
       for each Markov chain (see Section 4.2).               µi, K). Stan allows parameter vectors to be specified
                                                              as ordered, that is restricted to be in increasing order.
    4. Run the Markov Chain. For JAGS, this consists
                                                              This seems tailor-made for the identification issue. If
       of four substeps: (a) run the chain in adaptation
                                                              an order based on µ0 is desired, the declaration:
       mode, (b) run the chain in normal mode for the
       burn-in, (c) set up monitors on the desired param-       ordered[K] mu0;
       eters, and (d) run the chain to collect the MCMC
       sample. For Stan, the compilation, warm-up and         enforces the ordering constraint in the MCMC sam-
       sampling are all done in a single step.                pler. JAGS contains a sort function which can achieve

                                                         8
a similar e↵ect. Früthwirth-Schnatter (2001) rec-              for each Markov chain c: do
ommends against this solution because the resulting               for each MCMC sample r in Chain c: do
Markov chains often mix slowly. Some experimenta-                    Find a permutation of indexes k10 , . . . , kK       0
                                                                                                                             so
tion with the ordered restriction in Stan confirmed this             that !c,r,k10  · · ·  !c,r,k10 .
finding; the MCMC runs took longer and did not reach                 for ⇠ in the Level 2 parameters
pseudo-convergence as quickly when the ordered re-                   {↵0 , µ0 , 0 , ⌧ 0 , 0 }: do
striction was not used.                                                 Replace ⇠ c,r with (⇠c,r,k10 , . . . , ⇠c,r,kK
                                                                                                                     0 ).

                                                                     end for{Level 2 parameter}
Früthwirth-Schnatter (2001) instead recommends let-                 if inferences about Level 1 parameters are de-
ting the Markov chains mix across all of the possi-                  sired then
ble modes, and then sorting the values according to                     for ⇠ in the Level 1 parameters {⇡, µ, ⌧ , }
the desired identification constraints post hoc. JAGS                   do
provides special support for this approach, o↵ering a                     for each Level 2 unit (student), i do
special distribution dnormmix for mixtures of normals.                       Replace                        ⇠ c,r,i        with
This uses a specially chosen proposal distribution to                        (⇠c,r,i,k10 , . . . , ⇠c,r,i,kK
                                                                                                           0 ).
encourage jumping between the multiple modes in the                       end for{Level 1 unit}
posterior. The current version of Stan does not pro-                    end for{Level 1 parameter}
vide a similar feature.                                              end if
One result of this procedure is that the MCMC sample              end for{Cycle}
will have a variety of orderings of the components; each        end for{Markov Chain}
draw could potentially have a di↵erent labeling. For
example, the MCMC sample for the parameter µi,1                  Figure 3: Component Identification Algorithm
will in fact be a mix of samples from µi,1 , . . . , µi,K .
The average of this sample is not likely to be a good
                                                              reach convergence more quickly. Gelman et al. (2013)
estimate of µi,1 . Similar problems arise when looking
                                                              recommend starting at the maximum likelihood esti-
at pseudo-convergence statistics which are related to
                                                              mate when that is available. Fortunately, o↵-the shelf
the variances of the parameters.
                                                              software is available for finding the maximum likeli-
To fix this problem, the component labels need to be          hood estimate of the individual student mixture mod-
permuted separately for each cycle of the MCMC sam-           els. These estimates can be combined to find starting
ple. With a one-level mixture model, it is possible           values for the cross-student parameters. The initial
to identify the components by sorting on one of the           values can be set by the following steps:
student-level parameters, ⇡i,k , µi,k or ⌧i,k . For the
hierarchical mixture model, one of the cross-student           1. Fit a separate mixture model for students using
parameters, ↵0,k , µ0,k , or ⌧0,k , should be used. Note          maximum likelihood. The package mixtools is
that in the hierarchical models some of the student-              slightly better for this purpose than FlexMix as
level parameters might not obey the constraints. In               it will try multiple times to get a solution with
the case of the pause time analysis, this is acceptable           the requested number of components. If the EM
as some students may behave di↵erently from most of               algorithm does not converge in several tries, set
their peers (the ultimate goal of the mixture modeling            the values of the parameters for that student to
is to identify students who may benefit from di↵erent             NA and move on.
approaches to instruction). Choosing an identification
rule involves picking which parameter should be used           2. Identify Components. Sort the student-level pa-
for identification at Level 2 (cross-student) and using           rameters, ⇡ i , µi , i and ⌧ i according to the de-
the corresponding parameter is used for student-level             sired criteria (see Section 4.1.
parameter identification.
                                                               3. Calculate the Level 2 initial values. Most of the
Component identification is straightforward. Let ! be             cross-student parameters are either means or vari-
the parameter chosen for model identification. Fig-               ances of the student-level parameters. The initial
ure 3 describes the algorithm.                                    value for ↵0 is the mean of ⇡ i across the students
                                                                  i (ignoring NAs). The initial values for µ0 and 0
4.2   Starting Values                                             are the mean and standard deviation of µi . The
                                                                  initial values for log(⌧ 0 ) and 0 are the mean and
Although the Markov chain should eventually reach                 standard deviation of log(⌧ i ).
its stationary distribution no matter where it starts,
starting places that are closer to the center of the dis-      4. Impute starting values for maximum likelihood es-
tribution are better in the sense that the chain should           timates that did not converge. These are the NAs

                                                         9
      from Step 1. For each student i that did not          the size of the Monte Carlo error could still be an is-
      converge in Step 1, set ⇡ i = ↵0 , µi = µ0 and        sue. Both rstan and coda compute an e↵ective sample
      ⌧ i = ⌧ 0.                                            size for the Monte Carlo sample. This is the size of
                                                            a simple random sample from the posterior distribu-
 5. Compute initial values for ✓ i and ⌘ i . These can      tion that would have the same Monte Carlo error as
    be computed by solving Equations 7 and 9 for ✓ i        the obtained dependent sample. This is di↵erent for
    and ⌘ i .                                               di↵erent parameters. If the e↵ective sample size is too
                                                            small, then additional samples are needed.
 6. Set the initial value of ↵N = I. Only one param-
    eter was not given an initial value in the previous     Note that there are K(3I + 5) parameters whose con-
    steps, that is the e↵ective sample size for ↵. Set      vergence needs to be monitored. It is difficult to
    this equal to the number of Level 2 units, I. The       achieve pseudo-convergence for all of these parame-
    initial values of ↵ can now be calculated as well.      ters, and exhausting to check them all. A reasonable
                                                            compromise seems to be to monitor only the 5K cross-
This produces a set of initial values for a single chain.   student parameters, ↵, µ0 , 0 , log(⌧ 0 ) and 0 . JAGS
Common practice for checking for pseudo-convergence         makes this process easier by allowing the analyst to
involves running multiple chains and seeing if they ar-     pick which parameters to monitor. Using JAGS, only
rive the same stationary distribution. Di↵erent start-      the cross-student parameters can be monitored dur-
ing values for the second and subsequent chains can         ing the first MCMC run, and then a second shorter
be found by sampling some fraction of the Level 1           sample of student-level parameters can be be obtained
units from each Level 2 unit. When Ji is large, = .5        through an additional run. The rstan package always
seems to work well. When Ji is small, = .8 seems            monitors all parameters, including the ✓ i and ⌘i pa-
to work better. An alternative would be to take a           rameters which are not of particular interest.
bootstrap sample (a new sample of size Ji drawn with        When I and Ji are large, a run can take several hours
replacement).                                               to several days. As the end of a run might occur dur-
                                                            ing the middle of the night, it is useful to have a au-
4.3    Automated Convergence Criteria                       tomated test of convergence. The rule I have been
                                                                                            b are less than a certain
                                                            using is to check to see if all R
Gelman and Shirley (2011) describe recommended              maximum (by default 1.1) and if all e↵ective sample
practice for assessing pseudo-convergence of a Markov       sizes are greater than a certain minimum (by default
chain. The most common technique is to run several          100) for all cross-student parameters. If these criteria
Markov chains and look at the ratio of within-chain         are not met, new chains of twice the length of the old
variance to between-chain variance. This ratio is called    chain can be run. The traceplots are saved to a file for
 b and it comes in both a univariate (one parameter
R,                                                          later examination, as are some other statistics such
at a time) and a multivariate version. It is natural to     as the mean value of each chain. (These traceplots
look at parameters with the same name together (e.g.,       and outputs are available at the web site mentioned
µ0 , 0 , ⌧ 0 , and 0 ). Using the multivariate version of   above.) It is generally not worthwhile to restart the
Rb with ↵0 and ⇡ i requires some care because calcu-        chain for a third time. If pseudo-convergence has not
lating the multivariate R  b involves inverting a matrix    been achieved after the second longer run, usually a
that does not have full rank when the parameters are        better solution is to reparameterize the model.
restricted to a simplex. The work-around is to only
look at the first K 1 values of these parameters.           It is easier to extend the MCMC run using JAGS than
                                                            using Stan. In JAGS, calling update again samples
The other commonly used test for pseudo-convergence         additional points from the Markov chain, starting at
is to look at a trace plot for each parameter: a time       the previous values, extending the chains. The current
series plot of the sampled parameter value against the      version of Stan (2.2.0)1 saves the compiled C++ code,
MCMC cycle number. Multiple chains can be plot-             but not the warm-up parameter values. So the sec-
ted on top of each other using di↵erent colors. If the      ond run is a new run, starting over from the warm-up
chain is converged and the sample size is adequate, the     phase. In this run, both the warm-up phase and the
traceplot should look like white noise. If this is not      sampling phase should be extended, because a longer
the case, a bigger MCMC sample is needed. While
it is not particularly useful for an automated test of          1
                                                                  The Stan development team have stated that the abil-
pseudo-convergence, the traceplot is very useful for di-    ity to save and reuse the warm-up parameters are a high
agnosing what is happening when the chains are not          priority for future version of Stan. Version 2.3 of Stan was
converging. Some sample traceplots are given below.         released between the time of first writing and publication
                                                            of the paper, but the release notes do not indicate that the
Even if the chains have reached pseudo-convergence,         restart issue was addressed.

                                                      10
warm-up may produce a lower autocorrelation. In ei-         timated number of components) was odd or vice versa.
ther case, the data from both runs can be pooled for        The values of Rb for the constrained Stan model were
inferences.                                                 substantially worse, basically confirming the findings
                                                            of Früthwirth-Schnatter (2001).
4.4   A simple test                                         The e↵ective sample sizes were much better for Stan
                                                            and HMC. For the K 0 = 2 case, the smallest e↵ective
To test the scripts and the viability of the model, I       sample size ranged from 17–142, while for the uncon-
created some artificial data consistent with the model      strained Stan model it ranged from 805–3084. Thus,
in Section 2. To do this, I took one of the data sets       roughly 3 times the CPU time is producing a 5-fold
use by Li (2013) and ran the initial parameter algo-        decrease in the Monte Carlo error.
rithm described in Section 4.2 for K = 2, 3 and 4.
I took the cross-student parameters from this exer-         The following graphs compare the JAGS and Stan
cise, and generated random parameters and data sets         (Unconstrained model) outputs for four selected cross-
for 10 students. This produced three data sets (for         student parameters. The output is from coda and pro-
K = 2, 3 and 4) which are consistent with the model         vides a traceplot on the left and a density plot for the
and reasonably similar to the observed data. All            corresponding distribution on the right. Recall that
three data sets and the sample parameters are avail-        the principle di↵erence between JAGS and Stan is the
able on the web site, http://pluto.coe.fsu.edu/             proposal distribution: JAGS uses a random walk pro-
mcmc-hierMM/.                                               posal with a special distribution for the mixing param-
                                                            eters and Stan uses the Hamiltonian proposal. Also,
For each data set, I fit three di↵erent hierarchical mix-   note that for Stan the complete run of 15,000 samples
ture models (K 0 = 2, 3 and 4, where K 0 is the value       is actually made up of a sort run of length 5,000 and
used for estimation) using three di↵erent variations        a longer run of length 10,000; hence there is often a
of the algorithm: JAGS (RWM), Stan (HMC) with               discontinuity at iteration 5,000.
the cross-student means constrained to be increasing,
and Stan (HMC) with the means unconstrained. For            Although not strictly speaking a parameter, the de-
the JAGS and Stan unconstrained run the chains were         viance (twice the negative log likelihood) can easily
sorted (Section 4.1) on the basis of the cross-students     be monitored in JAGS. Stan does not o↵er a deviance
means (µ0 ) before the convergence tests were run. In       monitor, but instead monitors the log posterior, which
each case, the initial run of 3 chains with 5,000 ob-       is similar. Both give an impression of how well the
servations was followed by a second run of 3 chains         model fits the data. Figure 4 shows the monitors for
with 10,000 observations when the first one did not         the deviance or log posterior. This traceplot is nearly
converge. All of the results are tabulated at the web       ideal white noise, indicating good convergence for this
                                                            value. The value of R b is less than the heuristic thresh-
site, including links to the complete output files and
all traceplots.                                             old of 1.1 for both chains, and the e↵ective sample size
                                                            is about 1/6 of the 45,000 total Monte Carlo observa-
Unsurprisingly, the higher the value of K 0 (number of      tions.
components in the estimated model), the longer the
run took. The runs for K 0 = 2 to between 10–15 min-        Figure 5 shows the grand mean of the log pause times
utes in JAGS and from 30–90 minutes in Stan, while          for the first component. The JAGS output (upper
the runs for K 0 = 4 too just less than an hour in JAGS     row) shows a classic slow mixing pattern: the chains
and between 2–3 hours in Stan. The time di↵erence           are crossing but moving slowly across the support of
between JAGS and Stan is due to the di↵erence in            the parameter space. Thus, for JAGS even though
                                                            the chains have nearly reached pseudo-convergence (R  b
the algorithms. HMC uses a more complex proposal
distribution than RWM, so naturally each cycle takes        is just slightly greater than 1.1), the e↵ective sample
longer. The goal of the HMC is to trade the longer          size is only 142 observations. A longer chain might
run times for a more efficient (lower autocorrelation)      be needed for a good estimate of this parameter. The
sample, so that the total run time is minimized. The        Stan output (bottom row) looks much better, and the
constrained version of Stan seemed to take longer than      e↵ective sample size is a respectable 3,680.
the unconstrained.                                          The Markov chains for ↵0,1 (the proportion of pause
In all 27 runs, chain lengths of 15,000 cycles were not     times in the first component) have not yet reached
sufficient to reach pseudo-convergence. The maximum         pseudo convergence, but they are close (a longer run
                                                b varied
(across all cross-student parameters) value for R           might get them there). Note the black chain often
between 1.3 and 2.2, depending on the run. It seemed        ranges above the values in the red and green chains
to be slightly worse for cases in which K (number of        in the JAGS run (upper row). This is an indication
components for data generation) was even and K 0 (es-       that the chains may be stuck in di↵erent modes; the

                                                      11
                           Figure 4: Deviance (JAGS) and Log Posterior (Stan) plots
Note: To reduce the file size of this paper, this is a bitmap picture of the traceplot. The original pdf version is
available at http://pluto.coe.fsu.edu/mcmc-hierMM/DeviancePlots.pdf.


                                                     12
                                  Figure 5: Traceplots and Density plots for µ0,1
Note: To reduce the file size of this paper, this is a bitmap picture of the traceplot. The original pdf version is
available at http://pluto.coe.fsu.edu/mcmc-hierMM/mu0Plots.pdf.


                                                     13
                                  Figure 6: Traceplots and Density plots for ↵0,1
Note: To reduce the file size of this paper, this is a bitmap picture of the traceplot. The original pdf version is
available at http://pluto.coe.fsu.edu/mcmc-hierMM/alpha0Plots.pdf.


                                                     14
shoulder on the density plot also points towards mul-                 4.6   Additional Level 2 Units (students)
tiple modes. The plot for Stan looks even worse, after
iteration 5,000 (i.e., after the chain was restarted), the            In the pause time application, the goal is to be able
red chain has moved into the lower end of the support                 to estimate fairly quickly the student-level parameters
of the distribution. This gives a large shoulder in the               for a new student working on the same essay. As the
corresponding density plot.                                           MCMC run is time consuming, a fast method for an-
                                                                      alyzing data from new student would be useful.
The chains for the variance of the log precisions for the
first component, 0,1 (Figure 7), are far from pseudo-                 One approach would be to simply treat Equations 2,
convergence with R b = 1.73. In the JAGS chains (top                  3 and 4 as prior distributions, plugging in the point
row) the black chain seems to prefer much smaller val-                estimates for the cross-student parameters as hyper-
ues for this parameter. In the Stan output, the green                 parameters, and find the posterior mode of the ob-
chain seems to be restricted to the smaller values in                 servations from the new data. There are two problems
both the first and second run. However, between the                   with this approach. First, the this method ignores any
first and second run the black and red chains swap                    remaining uncertainty about the cross-student param-
places. Clearly this is the weakest spot in the model,                eters. Second, the existing mixture fitting software
and perhaps a reparameterization here would help.                     packages (FlexMix and mixtools) do not provide any
                                                                      way to input the prior information.
Looking at Figures 6 and 7 together, a pattern
emerges. There seems to be (at least) two distinct                    A better approach is to do an additional MCMC run,
modes. One mode (black chain JAGS, green chain in                     using the posterior distributions for the cross-student
Stan) has higher value for ↵0,1 and a lower value for                 parameters from the previous run as priors from the
                                                                      previous run (Mislevy et al., 1999). If the hyperpa-
  0,1 , and the other one has the reverse pattern. This
is an advantage of the MCMC approach over an EM                       rameters for the cross-student priors are passed to
approach. In particular, if the EM algorithm was only                 the MCMC sampler as data, then the same JAGS or
run once from a single starting point, the existence of               Stan model can be reused for additional student pause
the other mode might never be discovered.                             records. Note that in addition to the MCMC code
                                                                      Stan provides a function that will find the maximum
                                                                      of the posterior. Even if MCMC is used at this step,
4.5   Data Point Labeling
                                                                      the number of Level 2 units, I, will be smaller and the
For some applications, the analysis can stop after es-                chains will reach pseudo-convergence more quickly.
timating the parameters for the students units, ⇡ i , µi              If the number of students, I, is large in the original
and ⌧ i (or i as desired). In other applications it is                sample (as is true in the original data set), then a
interesting to assign the individual pauses to compo-                 sample of students can be used in the initial model
nents, that is, to assign values to Zi,j .                            calibration, and student-level parameters for the oth-
When ⇡ i , µi and i are known, it is simple to calculate              ers can be estimated later using this method. In par-
                             (c,r)   (c,r)       (c,r)                ticular, there seems to be a paradox estimating mod-
pi,j,k = Pr(Zi,j = k). Let ⇡ i , µi        and i       be
                                                                      els using large data sets with MCMC. The larger the
the draws of the Level 1 parameters from Cycle r of
                                                                      data, the slower the run. This is not just because
Chain c (after the data have been sorted to identify
                                                                      each data point needs to be visited within each cycle
the components). Now define,
                                                                      of each Markov chain. It appears, especially with hi-
                                                  (c,r)               erarchical models, that more Level 2 units cause the
                  (c,r)    (c,r)       Yi,j     µi,k
               qi,j,k = ⇡i,k       (                      ),   (17)   chain to have higher autocorrelation, meaning larger
                                              (c,r)
                                              i,k                     runs are needed to achieve acceptable Monte Carlo er-
                                                                      ror. For the student keystroke logs, the “initial data”
          (c,r)       PK
                     (c,r)                                            used for calibration was a sample of 100 essays from
and set Qi,j      =
               k=1 qi,j,k . The distribution for Zi,j
                               (c,r)     (c,r)  (c,r)                 the complete collection. There is a obvious trade-o↵
for MCMC draw (c, r) is then pi,j,k = qi,j,k /Qi,j .
                                                                      between working with smaller data sets and being able
The MCMC estimate for pi,j,k is just the aver-
                                  (c,r)
                                                                      to estimate rare events: I = 100 seemed like a good
age over all the MCMC draws of pi,j,k , that is                       compromise.
 1
   PC PR        (c,r)
CR   c=1   r=1 pi,j,k .

If desired, Zi,j can be estimated as the maximum over                 5     Model Selection
k of pi,j,k , but for most purposes simply looking at the
probability distribution pi,j provides an adequate, if                A big question in hierarchical mixture modeling is
not better summary of the available information about                 “What is the optimal number of components, K?”
the component from which Yi,j was drawn.                              Although there is a way to add a prior distribution

                                                                 15
                                  Figure 7: Traceplots and Density plots for 0,1
Note: To reduce the file size of this paper, this is a bitmap picture of the traceplot. The original pdf version is
available at http://pluto.coe.fsu.edu/mcmc-hierMM/gamma0Plots.pdf.


                                                     16
over K and learn this as part of the MCMC pro-              Gelman et al. (2013) advocate using a new measure of
cess, it is quite tricky and the number of parameters       model selection called WAIC. WAIC makes two sub-
varies for di↵erent values of K. (Geyer, 2011, describes    stitutions in the definition of AIC above. First, it uses
the Metropolis-Hastings-Green algorithm which is re-        D(⌦) in place of D(⌦). Second, it uses a new way of
quired for variable dimension parameter spaces). In         calculating the dimensionality of the model. There are
practice, analysts often believe that the optimal value     two possibilities:
of K is small (less than 5), so the easiest way to de-                      Ji
                                                                          I X
                                                                          X
termine a value of K is to fit separate models for
                                                            pWAIC1 = 2              (log E[Qi,j (⌦)]   E[log Qi, j(⌦)]) ,
K = 2, 3, 4, . . . and compare the fit of the models.
                                                                          i=1 j=1
Note that increasing K should always improve the fit                                                              (21)
of the model, even if the extra component is just used                      Ji
                                                                          I X
                                                                          X
to trivially fit one additional data point. Consequently,   pWAIC2 = 2              Var(Qi,j (⌦)) .               (22)
analysts use penalized measures of model fit, such as                     i=1 j=1
AIC to chose among the models.
                                                            The expectation and variance are theoretically defined
Chapter 7 of Gelman et al. (2013) gives a good              over the posterior distribution and are approximated
overview of the various information criteria, so the        using the MCMC sample. The final expression for
current section will provide only brief definitions. Re-    WAIC is:
call Equation 5 that described the log likelihood of
the complete data Y as a function of the complete                       WAIC = D(⌦) + 2 ⇤ mWAIC .                 (23)
parameter ⌦. By convention, the deviance is twice
the negative log likelihood: D(⌦) = 2L(Y|⌦). The            For all of the model fit statistics discussed here: AIC,
lower the deviance, the better the fit of the data to the   WAIC1 and WAIC2 , the model with the lowest value
model, evaluated at that parameter. As students are         is the best. One way to chose the best value of K is to
independent given the cross-student parameters, and         calculate all of these statistics for multiple values of K
pause times are independent given the student-level         and choose the value of K which has the lowest value
                e for some value of the parameters
parameters, D(⌦),                                           for all 5 statistics. Hopefully, all of the statistics will
e
⌦ can be written as a sum:                                  point at the same model. If there is not true, that is
                                                            often evidence that the fit of two di↵erent models is
                        Ji
                      I X
                      X
               e =                        e ,               too close to call.
             D(⌦)               log Qi,j (⌦)        (18)
                      i=1 j=1                               In the 27 runs using the data from known values of K,
                                                            simply using the lowest WAIC value was not sufficient
            e is the likelihood of data point Yi,j , that
where Qi,j (⌦)                                              to recover the model. As the Markov chains have not
is:                                                         yet reached pseudo-convergence, the WAIC values may
                      Xk                                    not be correct, but there was fairly close agreement on
                 e =              Yi,j µ gi,k
           Qi,j (⌦)       g
                          ⇡ i,k (             ).     (19)   both WAIC1 and WAIC2 for both the JAGS and Stan
                                      g
                                      i,k
                      k=1                                   (unconstrained) models. However, the values of both
Note that the cross-student parameters drop out of          WAIC1 and WAIC2 were also fairly close for all values
these calculations.                                         of K 0 (number of estimated components). For example
                                                            when K = 2 the WAIC1 value was 1132, 1132 and
The Akaike information criterion (AIC) takes the de-        1135 for K 0 = 2, 3 and 4 respectively when estimated
viance as a measure of model fit and penalizes it for       with JAGS and 1132, 1131, and 1132 when estimated
                                           b be the
the number of parameters in the model. Let ⌦                with Stan. The results for WAIC2 were similar. It
maximum likelihood estimate of the parameter. Then          appears as if the common practice of just picking the
AIC is defined as:                                          best value of WAIC to determine K is not sufficient.
                         b +2⇤m                             In particular, it may be possible to approximate the
                 AIC = D(⌦)                         (20)
                                                            data quite well with a model which has an incorrect
where m = K(2IK + 4) + (K 1)(I + 1) + 1 is the              value of K.
number of free parameters in the model. Note that
the MCMC run does not provide a maximum likeli-             6    Conclusions
hood estimate. A pseudo-AIC can be computed by
substituting ⌦, the posterior mean of the parameters        The procedure described above is realized as code in
⌦. Note that ⌦ must be calculated after the parame-         R, Stan and JAGS and available for free at http://
ters in each MCMC cycle are permuted to identify the        pluto.coe.fsu.edu/mcmc-hierMM/. Also available,
components.                                                 are the simulated data for trying out the models.

                                                      17
These scripts contain the functions necessary to auto-      for large data sets indicates that there may be addi-
matically set starting parameters for these problems,       tional transformation of this model that are necessary
identify the components, and to calculate WAIC model        to make it work properly. Hopefully, the publication of
fit statistics.                                             the source code will allow other to explore this model
                                                            more fully.
The scripts seem to work fairly well for a small number
of students units (5  I  10). With this sample size,
although JAGS has the faster run time, Stan has a           References
larger e↵ective sample size. Hamiltonian Monte Carlo.
With large sample sizes I = 100, Ji ⇡ 100 even the          Almond, R. G., Deane, P., Quinlan, T., Wagner,
HMC has high autocorrelation and both JAGS and                    M., & Sydorenko, T. (2012). A prelimi-
Stan are extremely slow. Here the ability of JAGS to              nary analysis of keystroke log data from a
allow the user to selectively monitor parameters and              timed writing task (Research Report No.
to extend the chains without starting over allows for             RR-12-23). Educational Testing Service. Re-
better memory management.                                         trieved from http://www.ets.org/research/
                                                                  policy research reports/publications/
The code supplied at the web site solves two key tech-
                                                                  report/2012/jgdg
nical problems: the post-run sorting of the mixture
                                                            Benaglia, T., Chauveau, D., Hunter, D. R., & Young,
components (Früthwirth-Schnatter, 2001) and the cal-
                                                                  D. (2009). mixtools: An R package for ana-
culation of the WAIC model-fit indexes. Hopefully,
                                                                  lyzing finite mixture models. Journal of Sta-
this code will be useful for somebody working on a
                                                                  tistical Software, 32 (6), 1–29. Retrieved from
similar application.
                                                                  http://www.jstatsoft.org/v32/i06/
The model proposed in this paper has two problems,          Brooks, S., Gelman, A., Jones, G. L., & Meng, X.
even with data simulated according to the model. The              (Eds.). (2011). Handbook of markov chain monte
first is the slow mixing. This appears to be a prob-              carlo. CRC Press.
lem with multiple modes, and indicates some possible        Deane, P. (2012). Rethinking K-12 writing assessment.
non-identifiability in the model specifications. Thus,            In N. Elliot & L. Perelman (Eds.), Writing as-
further work is needed on the form of the model. The              sessment in the 21st century. essays in honor of
second problem is that the WAIC test does not ap-                 Edward M. white (pp. 87–100). Hampton Press.
pear to be sensitive enough to recover the number of        Früthwirth-Schnatter, S. (2001). Markov chain
components in the model. As the original purpose of               Monte Carlo estimation of classical and dynamic
moving to the hierarchical mixture model was to dis-              switching and mixture models. Journal of the
cover if there were rare components that could not be             American Statistical Association, 96 (453), 194–
detected in the single-level hierarchical model, the cur-         209.
rent model does not appear to be a good approach for        Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B.,
that purpose.                                                     Vehtari, A., & Rubin, D. B. (2013). Bayesian
In particular, returning to the original application of           data analysis (3rd ed.). Chapman and Hall. (The
fitting mixture models to student pauses while writing,           third edition is revised and expanded and has
the hierarchical part of the model seems to be creat-             material that the earlier editions lack.)
ing as many problems as it solves. It is not clearly        Gelman, A., & Shirley, K. (2011). Inference from
better than fitting the non-hierarchical mixture model            simulations and monitoring convergence. In
individually to each student. Another problem it has              S. Brooks, A. Gelman, G. L. Jones, & X. Meng
for this application is the assumption that each pause            (Eds.), Handbook of markov chain monte carlo
is independent. This does not correspond with what                (pp. 163–174). CRC Press.
is known about the writing process: typically writ-         Geyer, C. J.       (2011).    Introduction to Markov
ers spend long bursts of activities simply doing tran-            chain Monte Carlo. In S. Brooks, A. Gelman,
scription (i.e., typing) and only rarely pause for higher         G. L. Jones, & X. Meng (Eds.), Handbook of
level processing (e.g., spelling, grammar, word choice,           markov chain monte carlo (pp. 3–48). CRC
planning, etc). In particular, rather than getting this           Press.
model to work for this application, it may be better to     Gruen, B., & Leisch, F. (2008). FlexMix version
look at hidden Markov models for individual student               2: Finite mixtures with concomitant variables
writings.2                                                        and varying and constant parameters. Journal
                                                                  of Statistical Software, 28 (4), 1–35. Retrieved
The slow mixing of the hierarchical mixture model                 from http://www.jstatsoft.org/v28/i04/
                                                            Li, T. (2013). A Bayesian hierarchical mixture ap-
   2
       Gary Feng and Paul Deane, private communication.           proach to model timing data with application to

                                                      18
      writing assessment (Unpublished master’s the-
      sis). Florida State University.
McLachlan, G., & Krishnan, T. (2008). The EM al-
      gorithm and extensions (2nd ed.). Wiley.
McLachlan, G., & Peel, D. (2000). Finite mixture
      models. Wiley.
Mislevy, R. J., Almond, R. G., Yan, D., & Stein-
      berg, L. S. (1999). Bayes nets in educational
      assessment: Where the numbers come from. In
      K. B. Laskey & H. Prade (Eds.), Uncertainty in
      artificial intelligence ’99 (p. 437-446). Morgan-
      Kaufmann.
Neal, R. M. (2011). MCMC using Hamiltonian dynam-
      ics. In MCMCHandbook (Ed.), (pp. 113–162).
Plummer, M. (2012, May). JAGS version 3.2.0
      user manual (3.2.0 ed.) [Computer software
      manual]. Retrieved from http://mcmc-jags
      .sourceforge.net/
Plummer, M., Best, N., Cowles, K., & Vines, K.
      (2006). Coda: Convergence diagnosis and out-
      put analysis for mcmc. R News, 6 (1), 7–11.
      Retrieved from http://CRAN.R-project.org/
      doc/Rnews/
R Core Team. (2014). R: A language and environ-
      ment for statistical computing [Computer soft-
      ware manual]. Vienna, Austria. Retrieved from
      http://www.R-project.org/
Stan Development Team. (2013). Stan: A C++ library
      for probability and sampling (Version 2.2.0 ed.)
      [Computer software manual]. Retrieved from
      http://mc-stan.org/
Thomas, A., O’Hara, B., Ligges, U., & Sturtz, S.
      (2006). Making BUGS open. R News, 6 , 12-17.
      Retrieved from http://www.rni.helsinki.fi/
      ~boh/publications/Rnews2006-1.pdf
Thomas, A., Spiegelhalter, D. J., & Gilks, W. R.
      (1992).       BUGS: A program to perform
      Bayesian inference using Gibbs sampling. In
      J. M. Bernardo, J. O. Berger, A. P. Dawid, &
      A. F. M. Smith (Eds.), Bayesian statistics 4. (pp.
      837–842). Clarendon Press.

Acknowledgments

Tingxuan Li wrote the first version of much of the
R code as part of her Master’s thesis. Paul Deane
and Gary Feng of Educational Testing Service have
been collaborators on the essay keystroke log analysis
work which is the inspiration for the present research.
Various members of the JAGS and Stan users groups
and development teams have been helpful in answer-
ing emails and postings related to these models, and
Kathy Laskey and several anonymous reviewers have
given me useful feedback (to which I should have paid
closer attention) on the draft of this paper.

                                                     19