=Paper= {{Paper |id=None |storemode=property |title=Using POMDPs to Forecast Kindergarten Students' Reading Comprehension |pdfUrl=https://ceur-ws.org/Vol-962/paper01.pdf |volume=Vol-962 |dblpUrl=https://dblp.org/rec/conf/uai/AlmondTO12 }} ==Using POMDPs to Forecast Kindergarten Students' Reading Comprehension== https://ceur-ws.org/Vol-962/paper01.pdf
       Using POMDPs to Forecast Kindergarten Students Reading
                         Comprehension


        Russell G. Almond                         Umit Tokac                     Stephanie Al Otaiba∗
     Educational Psychology and             Educational Psychology and     Department of Teaching and Learning
          Lerning Systems                        Lerning Systems              Southern Methodist University
      Flordia State University               Flordia State University               Dallas, TX 75275
       Tallahassee, FL 32306                  Tallahassee, FL 32306                salotaiba@smu.edu
         ralmond@fsu.edu                         ut08@my.fsu.edu
                      Abstract                              coverage and reliability. Moreover, a proper model
                                                            for student growth allows forecasting of the students
                                                            eventual status at the end of the year. Consequently,
    Summative assessment of student abilities
                                                            teachers and administrators can form plans for stu-
    typically comes at the end of the instructional
                                                            dents which maximize learning outcomes and identify
    period, too late for educators to use the infor-
                                                            students for whom the goals are unreachable for spe-
    mation for planning instruction. This paper
                                                            cial instruction. In this sense, the periodic assessments
    explores the possibility of using Hierarchi-
                                                            play a role somewhere between traditional summative
    cal Linear Models to forecast students end of
                                                            assessment and formative assessment — assessing stu-
    year performance. Because these models are
                                                            dent learning for the purpose of improving instruc-
    closely related to partially observed Markov
                                                            tion (Black & Wiliam, 1998; Wiggins, 1998; Pelligrino,
    decision processes (POMDPs), these should
                                                            Glaser, & Chudowsky, 2001).
    support extensions to instructional planning
    to meet educational goals. Despite the new              Almond (2007) noted that the forecasting could be
    notation, the POMDP models are subject                  done using a partially observed Markov decision pro-
    to a familiar problem from the educational              cess (POMDP; Boutilier, Dean, & Hanks, 1999): the
    context: scale identifiability. This paper de-          latent variables describing student proficiency form an
    scribes how this problem manifests itself and           unobserved Markov process, and the periodic assess-
    looks at one potential solution.                        ments provide observable evidence about the state of
                                                            those latent variables. The instructional activities cho-
                                                            sen between time points are the measurement space,
1   INTRODUCTION                                            and in fact, the students response to instruction often
                                                            provides important clues about their proficiency and
There is a long tradition in education of separating        specific learning problems (Marcotte & Hintze, 2009).
instruction and assessment: summative assessment            Almond (2009) notes the similarity between POMDPs
of what a student learns comes at the end of the            and other frameworks more commonly used in educa-
unit/semester/year. As limited time is allocated for        tion, such as latent growth modeling (Singer & Willett,
assessment, such assessments are typically limited in       2003) and hierarchical linear modeling (HLM; Rauden-
their reliability (accuracy of measurement) and con-        bush & Byrk, 2002). The principle difference is one of
tent validity (coverage of the targeted knowledge, skills   emphasis: in the POMDP framework, the emphasis is
and ability). Because summative assessment comes            usually on estimating the individuals latent state for
at the end of instructions, instructors are not able to     the purpose of planning. In the HLM and multilevel
make changes to their instructions to maximize stu-         growth model, the emphasis is usually on estimating
dent learning (Almond, 2010).                               the effectiveness of various activities. This paper looks
                                                            at the problem of forecasting using HLM models both
Bennett (2007) suggested breaking the summative as-         directly and through conversion to POMDP parame-
sessment into four or six periodic assessments. First,      terizations.
spreading the cost (student time taken away from di-
rect instruction) over multiple measurement occasions       The purpose of our study is to try to fit a POMDP-
allows for longer testing providing both greater content    based latent growth model using Bayesian methods to
                                                            a set of data documenting the development of Reading
    ∗
      Some of the work took place while she was at the      skills in a number of Kindergarten students. Once the
Florida Center for Reading Research, Tallahassee, FL
model is successfully fit, we will use it to predict the   given special instructions nor a prescribed curriculum,
end-of-year status of the students.                        although most of them used the same curriculum.

2   THE DATA                                               3     THE POMDP FRAMEWORK
This study uses longitudinal data about reading de-        Almond (2007) provides a generalized model for how
velopment originally collected by the Florida Center       a POMDP can represent measurement of a developing
for Reading Research (Al Otaiba et al., 2011). The         proficiency across multiple time points (Figure 1).
reading skills for this initial cohort of students was
measured three times (Fall, Winter and Spring) dur-                                    Activity             Activity

ing Kindergarten, and follow-up measurements were
taken at the end of 1st, 2nd and 3rd grade. There                        S            Growth      S                    S


were 247 students in the initial sample, but only 224                    Assessment

were still in the area at the end of the first year.                    O                         O                    O

During Kindergarten, children rapidly develop in                             t=1                      t=2                  t=3

Reading and pre-Reading skills (e.g., oral vocabu-
lary and letter identification). Consequently, not all         Figure 1: Measurement across time as POMDP
measures are appropriate for all time points. Conse-
quently, different measures were collected at different
                                                           In this figure, the nodes marked S represent the la-
time points. Table 1 shows the measures that were
                                                           tent student proficiency as it evolves over time. At
collected during Kindergarten:
                                                           each time slice, there is generally some kind of mea-
                                                           surement of student progress represented by the ob-
      Table 1: Measures Collected By Occasion              servable outcomes O. Note that these may be differ-
        Measure      Fall   Winter     Spring              ent for different time slices (c.f., Table 1). Following
        LW            X       X          X                 the terminology of evidence-centered assessment de-
        PV            X       X          X                 sign (ECD; Mislevy, Steinberg, & Almond, 2003) we
        ISF           X       X                            call this an evidence model. In general, both the profi-
        PSF                   X           X                ciency variables at Measurement Occasion m, Sm , and
        NWF                   X           X                the observable outcome variables on that occasion, Om
        LNF           X       X           X                are multivariate.
                                                           Extending the ECD terminology, Almond (2007) calls
The measures are taken from the Woodcock-Johnson           the model for the Sm ’s, the proficiency growth model.
III Cognitive Test (WJ-III; Woodcock, McGrew, &            Following the normal logic of POMDPs this is ex-
Mather, 2001) and the Dynamic Indicators of Basic          pressed with two parts: the first is the initial profi-
Early Literacy Skills (DIBELS; Good & Kaminkski,           ciency model, which gives the population distribution
2002). The measures used were:                             for proficiency at the first measurement occasion. The
                                                           second is an action model, which gives a probability
                                                           distribution for change in proficiency over time that
LW – Letter-Word Identification (WJ-III)
                                                           depends on the instructional activity chosen between
PV – Picture Vocabulary (WJ-III)                           measurement occasions.
LNF – Letter Naming Fluency (DIBELS)
                                                           3.1   PROFICIENCY GROWTH MODEL
ISF – Initial Sound Fluency (DIBELS)
                                                           For the data from the Al Otaiba et al. (2011) study, the
PSF – Phoneme Segmentation Fluency (DIBELS)                latent proficiency is obviously Reading. The question
                                                           immediately arises as to how many dimensions to use
NWF – Nonsense Word Fluency (DIBELS)
                                                           to represent the reading construct. As the students are
                                                           entering the study in Kindergarten, components of the
The Woodcock-Johnson measures are available in sev-
                                                           reading skill, such as oral vocabulary and phonemic
eral forms. We used the “W” scale (which is scaled
                                                           awareness are less tightly correlated than they are with
to an item response theory model), as it showed more
                                                           older children. (In the fall of the Kindergarten year the
variation than the scale scores.
                                                           correlation between the LW and PV scores in the Al
Additionally, teacher and school identifiers are avail-    Otaiba et al. study was r = .46, n = 247, while in
able for each child. For this cohort teachers were not     the spring it had increased to r = .56, n = 224.) As
a starting point, we will fit a unidimensional model        and 6 years 4 months at the time of the first testing
of Reading, representing it with a single continuous        (with a few students 7 years or older). This represents
variable: Rnm the reading ability of Individual n on        a considerable variation in maturity, and potentially
Measurement Occasion m.                                     in initial ability.
                                                            We define the following model for Measurement Occa-
3.1.1   Model for Growth                                    sion 1:
In the first cohort of the Al Otaiba et al. (2011) study,                   Rn1 ∼ N (µs(n) , υs(n) )        (2)
teachers were not given specific instructions about cur-
riculum or activity between the time points. We there-      3.2   EVIDENCE MODELS
fore do not have a dependency on activity to mea-
                                                            Because we are assuming that Reading proficiency is
sure here. However we do expect there to be some
                                                            unidimensional, we do not need to specify which of the
classroom-to-classroom differences, so we will make
                                                            measures in Table 1 are relevant to which proficiencies.
the growth parameters dependent on the classroom
                                                            Thus, the evidence model is a collection of simple re-
(The teacher effect is part of the classroom effect,
                                                            gressions, for each observation Ynmi for Individual n
however aspects of the peer group and environment
                                                            at Measurement Occasion m on Instrument i, we have:
are captured as well). Let c(n) be the classroom to
which Student n belongs. Note also that classrooms                   Ynmi = ai + bi Rnm + nmi                    (3)
are nested within schools, so school effects are consid-              nmi ∼ N (0, ωi )                           (4)
ered part of the general classroom effect.
                                                            Note that the slope parameter bi actually encodes a
Following this logic, for Measurement Occasion m > 1,
                                                            relative importance for the various measures.
define:
                                                            One advantage of this structure is that we we do not
  Rnm = Rn(m−1) + (γc(n)m + γ0m )∆Tnm + ηnm                 need to explicitly specify the data collection structure
                                                     (1)    (Table 1). Instead, we can simply set the values of
                     p
   ηnm ∼ N (0, σc(n)m ∆Tnm )                                measures not recorded in each wave to missing values.
                                                            Because each of the instruments are well established
Here ∆Tnm is the time between Measurement Occa-             (Woodcock et al., 2001; Good & Kaminkski, 2002),
sions m and m − 1 for individual n. Here γ0m is an          we know some of their critical psychometric proper-
average growth rate, and γcm is a classroom specific        ties. In particular, the reliability of Instrument i, ρi
growth rate. Note that the residual standard devia-         is documented in the handbooks for the measures. In
tion depends on both a classroom specific rate, σcm ,       classical test theory, the reliability is the squared cor-
and the time elapsed between measurements. This is          relation between the true score of an examinee and the
consistent with the model that student ability is grow-     observed score. With a bit of algebra, this definition
ing according to a nonstationairy Wiener (Brownian          is is equivalent to:
motion) process.
                                                                                       Varn (nmi )
                                                                            ρi = 1 −                .             (5)
3.1.2   Model For Initial Proficiency                                                  Varn (Ynmi )
                                                            Here the notation Varn (·) indicates that the variance
Children entering Kindergarten have very diverse lan-
                                                            is taken over individuals (with measurement occasion
guage and early literacy backgrounds. There are con-
                                                            and instrument held constant). Solving Equation 5
siderable differences in the amount of experience with
                                                            for Varn (nmi ) yields an estimate for ωi2 for each mea-
print material the child experiences at home, breadth
                                                            surement occasion. We took the median of the three
and depth of vocabulary used with the child, as well
                                                            estimates as our base estimate for ωi2 , ω̃i .
as a wide variety of preschool experiences. As a child’s
preschool and early home experiences are at least par-      One drawback of the classical test theory concept of re-
tially dependent on their parents’ social and economic      liability is that it is dependent on the population being
status, and within-school socio-economic status tends       measured. Thus, as the sample in the Al Otaiba et al.
to be more homogeneous than across school status, we        (2011) is slightly different from the norming samples
model the initial status as dependent on the school.        used in the development of the WJ-III and DIBELS
Let s(n) be the school attended (during Kindergarten)       measures, we expect our observed reliability will dif-
for Student n.                                              fer slightly from the published values. What we do is
                                                            set up priors for ωi using ω̃i as the prior mean. In
There is also a considerable variation in the age at
                                                            particular,
entry. In the Al Otaiba et al. (2011) study, 95% of the
children were between the ages of 5 years 2 months                         1/ωi2 ∼ Gamma(α, αω̃i2 ) ,             (6)
where Gamma(α, β) is a gamma distribution with                (i.e., they enforce the constraint by manipulating the
shape parameter α and rate parameter β. We note               samples in R and coda R Development Core Team,
that any gamma distribution with β = αω̃i2 will have          2007; Plummer, Best, Cowles, & Vines, 2006 rather
the proper mean. The shape parameter α is then effec-         than in BUGS    P or JAGS). For example, rather than
tively a tuning parameter giving the strength the prior       constraining        µs = 0, they would estimate µs freely,
distribution, or equivalently the relative weight of the      but post hoc would adjust the sample from the rth cy-
published reliabilities and the observed error distribu-             (r)0     (r)   P (r)
                                                              cle, µs = µs − µs , making appropriate adjust-
tion. We initially chose a value of α = 100 weights the       ments to the other parameters. They claim that the
prior knowledge as equivalent to 100 observations, but        resulting model mixes better, however, there is some
later increased it to 1000 when we were experiencing          difficulty in figuring out how the post hoc adjustments
convergence problems.                                         will affect other parameters in the model.

3.3   SCALE IDENTIFICATION                                    4     PROBLEMS WITH MODEL
A problem that frequently arises in educational models              FITTING
using latent variables is the identifiability of the scale.
In particular, suppose we replaced Rnmi with Rnmi   0
                                                         =    We attempted to fit the model described in the pre-
Rnmi + c for an arbitrary constant c, and replaced ai         vious section with MCMC using JAGS (Plummer,
with a0i = ai − bi c. The likelihood of the observed data     2012).1 After some initial difficulties we removed the
Ynmi (implicit in Equation 3) would be identical. A           teacher and school effects (intending to add them again
similar problem arises if we replace Rnmi with Rnmi 00
                                                         =    after we fit the simpler model). This also allowed us
                       00
cRnmi and bi with bi = bi /c. Additional constraints          to restrict the prior distribution for Rn1 to be a unit
must be added to the model to identify the scale and          normal distribution (zero mean, variance one). This
location of the latent variable R.                            is a common identifiability constraint imposed in psy-
                                                              chometric models.
A frequently used convention in psychometrics is to
identify the scale and location of the latent variable
                                                              4.1   FIVE MEASURE MODEL
by assuming that the population mean and variance
for the latent variable is 0 and 1 (i.e., that the latent     Our initial experiments involved five of the six mea-
variable has an approximately unit normal distribu-           sures (the PSF measure was left out due to a mis-
tion). In this P
               case we can identify
                                 P the scale for Rn1 by       take in the model setup). We ran three Markov chains
constraining s µs = 0 and S1 s υs = 1, where S is             using random starting positions and found that the
the total number of schools in the study.                     models did not converge. Or more properly, the ev-
Because this is a temporal model, there exists another        idence model parameters (ai , bi , and ωi ) for the DI-
complication. We need to identify the scale of Rnm for        BELS NWF (nonsense word fluency) measure did not
m > 1. In particular, the mean and variance of the            converge. Table 2 shows the posterior mean of the evi-
innovations γ0m and σtm can cause similar identifia-          dence model parameters for the five measures (because
bility to the scale and location for Rnm that the initial     the MCMC chain did not reach the stationary state,
mean and variance caused for Rnm . In this case we            this may not be the true posterior).
apply a different solution. We assume that the prop-          Note in Table 2 that the estimated residual variance
erties of the instruments, and their relationships to the     is extremely low, indicating a nearly perfect correla-
latent reading proficiency do not vary across time (at        tion between the latent Reading variable and the NWF
least for the time points they are in use). Note that         measure. In this case, the MCMC chain looks like it is
in Equation 3, the slope, bi and intercept, ai do not         somehow using that measure to identify the scale of the
vary across time. This establishes a common scale for         latent variable. Furthermore, the slope for that vari-
all time points.                                              ables is twice as high as the slope for other variables in
Our initial thinking was that this would be enough               1
to identify the model. Unfortunately, because of                   Actually, we did some of our early model fitting using
                                                              WinBUGS (D. J. Lunn, Thomas, Best, & Spiegelhalter,
the structural missing data additional constraints are        2000). Some of the identification problems we were having
needed. These are described below.                            in WinBUGS we are not having in JAGS. JAGS may be
                                                              using slightly better samplers which may take care of issues
Bafumi, Gelman, Park, and Kaplan (2005) present a             that occur when the predictor variables in regressions are
                                                              not centered (Plummer, 2012). Similar improvements may
different approach to enforcing identifiability. They let     have been made in OpenBUGS (D. Lunn, Spiegelhalter,
the model be unidentified while fitting the data, but         Thomas, & Best, 2009), the successor to WinBUGS, but
then transform the estimates when evaluating the data         we have not tested this model using OpenBUGS.
                                                                                                                                    It is likely that the problem is some complex interac-
Table 2:               Evidence Model Parameters, 5 Measure
                                                                                                                                    tion between using γ0m and the b’s to identify mean
Model
                                                                                                                                    growth, or the a’s which define the starting point for
                  LW                      PV            LNF                           ISF                        NWF                growth. Note that the problematic measure, NWF,
        a       105.37                  99.90           25.76                        13.97                      -4.27*              was not measured at the first time point. Thus the
        b         0.15                   0.05            0.49                         0.32                       0.87*              constraint on the distribution of Rn0 will not define its
        ω         6.15                   4.92            6.31                         4.38                        0.09              scale in the second or third measurement occasions.
* indicates parameter did not converge
                                                                                                                                    4.2   THREE MEASURE MODEL

the model. Table 3 shows some of the difficulty. The                                                                                As the problematic measure may be the ones which
NWF measure is the only one showing a large increase                                                                                were not recored at all three time points, we ran the
between the Winter and Spring testing periods. So                                                                                   model again, dropping the ISF and NWF measures
naturally, there is a tendency to track that measure.                                                                               (the ones not observed at the first or third measure-
                                                                                                                                    ment occasion). The new model also did not converge,
                                                                                                                                    although the focus of the problem has now moved from
Table 3: Mean Scores on Each Measure at Each Ad-                                                                                    the NWF measure to the LNF measure.
ministration
                                                                                                                                    Table 4 shows the new estimates from the unconverged
              LW      PV LNF ISF NWF
                                                                                                                                    posterior. Again, the variance for the measure that did
    Fall     108.5 100.6 27.3 14.2
                                                                                                                                    not converge is substantially smaller than that of the
    Winter 110.8 102.3 42.9 25.5         27.9
                                                                                                                                    other measures, and the slope is substantially higher.
    Spring 111.2 101.7 51.3              43.2
                                                                                                                                    Again the trace plots (Figure 3) show poor mixing,
                                                                                                                                    as do similar plots for the Rnm measures for m > 1.
Trace plots of the evidence models show the problem.                                                                                There is also an indication of a trend that indicates
Figure 2 shows an example of extremely slow mixing,                                                                                 that the chains have not covered the whole of the pos-
that is characteristic of identifiability problems. De-                                                                             terior distribution.
pending on the values of the other variables in the sys-
tem (particularly the latent reading variables) higher                                                                              Table 4:   Evidence Model Parameters, 3 Measure
or lower slopes may be sensible. Looking at the trace                                                                               Model
plots of Rnm for several students show similar poor
                                                                                                                                                        LW       PV      LNF
mixing for m > 1. We would expect similar problems
                                                                                                                                                 a    103.74   98.95    23.02*
with the trace plots for γ0m , but the mixing looks good
                                                                                                                                                 b      1.59    0.64     4.55*
on those chains.
                                                                                                                                                 ω      5.55    4.78      0.11
                        Trace of a[5]                                                        Density of a[5]                        * indicates parameter did not converge
                                                          1.2
    −4.0




                                                          0.8




                                                                                                                                    4.3   MISSING IDENTIFICATION
                                                          0.4
    −5.0




                                                          0.0




            0   1000    2000     3000     4000   5000                     −5.5        −5.0        −4.5          −4.0         −3.5         CONSTRAINT
                           Iterations                                                N = 5000 Bandwidth = 0.08513



                        Trace of b[5]                                                        Density of b[5]
                                                                                                                                    Looking back to the problems in the model fit in Sec-
                                                                                                                                    tion 4.1, note that the lack of fit could be explained by
    0.880




                                                          10 20 30 40




                                                                                                                                    the interaction between a5 (the intercept for the NWF
    0.865




                                                                                                                                    measure) and γ1 (the average proficiency change be-
    0.850




                                                                                                                                    tween the first and second time points). As NWF is
                                                          0




            0   1000    2000     3000     4000   5000                         0.85         0.86          0.87           0.88

                           Iterations                                                N = 5000 Bandwidth = 0.00134                   not measured in the first time point, any arbitrary
                       Trace of itau[5]                                                  Density of itau[5]                         change between the first and second time point can
                                                                                                                                    be created by changing a5 , b5 and γ1 . The other
    145




                                                          0.08




                                                                                                                                    four measures were all collected in the fall, so in these
    135




                                                          0.04
    125




                                                                                                                                    cases, ai should have been fixed by the constraint that
    115




                                                          0.00




            0   1000    2000     3000     4000   5000                   110          120            130                140
                                                                                                                                    E[Rn1 ] = 0.
                           Iterations                                                N = 5000 Bandwidth = 0.6323

                                                                                                                                    What is required is a method for fixing the value of
                                                                                                                                    ai for measures that were not collected at the initial
Figure 2: Trace plots of evidence model parameters                                                                                  time point. One possible way to do this would be to
for measure NWF                                                                                                                     simply set ai = 0. This is not unreasonable, if all of
                                        Trace of a[3]                                      Density of a[3]                                                     Trace of a[3]                                                   Density of a[3]




                                                                                                                                     0.0 0.5 1.0
      24.5




                                                                        0.6




                                                                                                                                                                                                         0.8
                                                                                                                                                                                               Density
                                                                        0.4
      23.0




                                                                                                                                                                                                         0.4
                                                                        0.2




                                                                                                                                     −1.0
      21.5




                                                                        0.0




                                                                                                                                                                                                         0.0
                            0   1000    2000     3000     4000   5000                22          23          24             25                     0   1000    2000     3000     4000   5000                   −1.0   −0.8       −0.6         −0.4    −0.2   0.0

                                           Iterations                                N = 5000 Bandwidth = 0.1265                                                  Iterations                                                             y



                                        Trace of b[3]                                      Density of b[3]                                                     Trace of b[3]                                                   Density of b[3]
      3.5 4.0 4.5 5.0 5.5




                                                                        0.6




                                                                                                                                     1.1




                                                                                                                                                                                                         4
                                                                        0.4




                                                                                                                                                                                                         3
                                                                                                                                     0.9




                                                                                                                                                                                                         2
                                                                                                                                     0.7
                                                                        0.2




                                                                                                                                                                                                         1
                                                                                                                                     0.5
                                                                        0.0




                                                                                                                                                                                                         0
                            0   1000    2000     3000     4000   5000          3.5        4.0         4.5        5.0    5.5                        0   1000    2000     3000     4000   5000                   0.4      0.6             0.8          1.0     1.2

                                           Iterations                                N = 5000 Bandwidth = 0.08763                                                 Iterations                                           N = 5000 Bandwidth = 0.01449



                                       Trace of itau[3]                                   Density of itau[3]                                                  Trace of itau[3]                                                Density of itau[3]
                                                                        0.15




                                                                                                                                                                                                         40
      90




                                                                                                                                                                                                         30
                                                                        0.10




                                                                                                                                     0.36
      85




                                                                                                                                                                                                         20
      80




                                                                        0.05




                                                                                                                                                                                                         10
                                                                                                                                     0.32
      75




                                                                        0.00




                                                                                                                                                                                                         0
                            0   1000    2000     3000     4000   5000            75             80          85         90                          0   1000    2000     3000     4000   5000                   0.30   0.32       0.34         0.36    0.38   0.40

                                           Iterations                                N = 5000 Bandwidth = 0.3927                                                  Iterations                                          N = 5000 Bandwidth = 0.001537




Figure 3: Trace plots of evidence model parameters                                                                               Figure 4: Trace plots of evidence model parameters
for measure LNF, Three Measure Model                                                                                             for measure NWF, Six Measure Model with ai = 0


the variables are on a standardized scale: it implies
                                                                                                                                 5    FUTURE DIRECTIONS AND
that the average trajectory of the average student will
pass through the average of the scores.                                                                                               CHALLENGES
This required that the scores all be on the same scale
                                                                                                                                 The key to getting this model to converge was the
(especially problematic with the WJ-III and DIBELS
                                                                                                                                 standardization of the measure scales. Fortunately,
scores based on different development and norming
                                                                                                                                 this data set had a time period where all six measures
sample. Fortunately, for these data all six measures
                                                                                                                                 were applied to the same population. Consequently,
were collected in the winter time period. Subtract-
                                                                                                                                 standardizing the scales at this time point put the mea-
ing the mean of the Winter scores and dividing by the
                                                                                                                                 sures on a comparable scale, which then made the fixed
standard deviation for each measure produced stan-
                                                                                                                                 intercept constraint meaningful.
dardized scores. This standardization together with
the constraint ai = 0 caused the models to converge.                                                                             It is difficult to see how this generalizes to cases in
                                                                                                                                 which there is not a single time point in which all mea-
4.4                         SIX MEASURE MODEL                                                                                    sures are collected. This is a problem with the cohort
                                                                                                                                 examined in this study when we look at the data gath-
Using the standardized data and the additional con-                                                                              ered in first and second grades. As the students read-
straint of a1 = 0, we again fit the model using MCMC.                                                                            ing abilities develop, new and more difficult measures
This time, we got convergence on all of the evidence                                                                             of reading become appropriate. Linking these back
model parameters (Figure 4).                                                                                                     to the old scale is a difficult problem. This problem
                                                                                                                                 is well known in the educational literature under the
Table 5 shows the mean of the latent Reading variable
                                                                                                                                 name “vertical scaling” (von Davier, Carstensen and
for the first five students in the sample. This appears
                                                                                                                                 von Davier, 2006, provide a review of the literature).
to be well behaved with all of the students showing
growth across the three time points.                                                                                             Now that the model without teacher or school effects
                                                                                                                                 converges, the next step is to add those back into the
                                                                                                                                 model. Also, we should use cross-validation to evalu-
Table 5: Mean values for Reading for first five stu-                                                                             ate how well the model predicts students scores. The
dents, Six Measure model with ai = 0                                                                                             Al Otaiba et al. (2011) data set has long term follow-up
             S1      S2     S3       S4       S5                                                                                 for a substantial portion of the students, so we can see
    F -0.394 -0.351 -0.375 -1.556 -0.773                                                                                         how well the model can predict First and Second grade
   W      0.029   0.104  0.031 -1.278 -0.415                                                                                     reading scores as well. Finally, we can look at the
     S    0.864   1.016  0.852 -0.514     0.419                                                                                  rules for classification in to special instruction, to see
                                                                                                                                 whether integrating the data across multiple measures
provides a better picture of the student than looking             ed.) [Computer software manual]. Available from
at one measure alone.                                             https://dibels.uoregon.edu/
                                                            Lunn, D., Spiegelhalter, D., Thomas, A., & Best, N.
Acknowledgments                                                   (2009). The BUGS project: Evolution, critique
                                                                  and future directions (with discussion). Statistics
We would like to thank the Florida Center for Read-               in Medicine, 28 , 3049–3082.
ing Research for allowing us access to the data used in     Lunn, D. J., Thomas, A., Best, N., & Spiegelhalter,
this paper. The data were originally collected as part            D. (2000). WinBUGS – a Bayesian modeling
of a larger National Institute of Child Health and Hu-            framework: concepts, structure, and extensibil-
man Development Early Child Care Research Network                 ity. Statistics and Computing, 10 , 325–337.
study.                                                      Marcotte, A. M., & Hintze, J. M. (2009). Incremen-
                                                                  tal and predictive utility of formative assessment
                                                                  methods of reading comprehension. Journal of
References                                                        School Psychology, 47 , 315-335.
                                                            Mislevy, R. J., Steinberg, L. S., & Almond, R. G.
Almond, R. G.            (2007).      Cognitive mod-
                                                                  (2003). On the structure of educational assess-
     eling to represent growth (learning) using
                                                                  ment (with discussion). Measurement: Interdis-
     Markov decision processes. Technology, In-
                                                                  ciplinary Research and Perspective, 1 (1), 3-62.
     struction, Cognition and Learning (TICL),
                                                            Pelligrino, J., Glaser, R., & Chudowsky, N. (Eds.).
     5 , 313–324.        Available from http://www
                                                                  (2001). Knowing what students know: The sci-
     .oldcitypublishing.com/TICL/TICL.html
                                                                  ence and design of educational assessment. Na-
Almond, R. G. (2009). Estimating parameters of pe-
                                                                  tional Research Council.
     riodic assessment models (Research Report No.
                                                            Plummer, M. (2012, May). JAGS version 3.2.0
     To appear). Educational Testing Service.
                                                                  user manual (3.2.0 ed.) [Computer software
Almond, R. G. (2010). Using evidence centered design              manual]. Available from http://mcmc-jags
     to think about assessments. In V. J. Shute &                 .sourceforge.net/
     B. J. Becker (Eds.), Innovative assessment for         Plummer, M., Best, N., Cowles, K., & Vines, K.
     the 21st century: Supporting educational needs.              (2006). coda: Output analysis and diagnostics
     (pp. 75–100). Springer.                                      for MCMC [Computer software manual]. (R
Al Otaiba, S., Folsom, J. S., Schatschnneider, C.,                package version 0.10-7)
     Wanzek, J., Greulich, L., Meadows, J., et al.          R Development Core Team. (2007). R: A language
     (2011). Predicting first-grade reading perfor-               and environment for statistical computing [Com-
     mance from kindergarten response to tier 1 in-               puter software manual]. Vienna, Austria. Avail-
     struction. Exceptional Children, 77 (4), 453–470.            able from http://www.R-project.org
Bafumi, J., Gelman, A., Park, D. K., & Kaplan, N.           Raudenbush, S. W., & Byrk, A. S. (2002). Hierar-
     (2005). Practial issues in implementing and un-              chical linear models (second edition ed.). Sage
     derstanding bayesian ideal point estimation. Po-             Publications.
     litical Analysis, 13 , 171-187.                        Singer, J. D., & Willett, J. B. (2003). Applied longitu-
Bennett, R. E. (2007, May). Assessment of, for,                   dinal data analysis: Modeling change and event
     and as learning: Can we have all three? Paper                occurrence (1st ed.). Oxford University Press,
     presented at the Institute of Educational As-                USA.
     sessors National Conference, London, England.          von Davier, M., Carstensen, C. H., & von Davier,
     Available from http://www.ioea.org.uk/                       A. A. (2006). Linking competencies in educa-
     Home/news and events/annual conference/                      tional settings and measuring growth (Research
     day1/randy bennett.aspx                                      Report No. RR-06-12). ETS.
Black, P., & Wiliam, D. (1998). Assessment and class-       Wiggins, G. P. (1998). Educative assessment: Design-
     room learning. Assessment in Education: Prin-                ing assessments to inform and improve student
     ciples, Policy, and Practice, 5 (1), 7–74.                   performance. Jossey-Bass.
Boutilier, C., Dean, T., & Hanks, S.             (1999).    Woodcock, R. W., McGrew, K. S., & Mather, N.
     Decision-theoretic planning: Structural assump-              (2001). Wj-iii tests of cognitive abilities and
     tions and computational leverage.             Jour-          achievement [Computer software manual].
     nal of Artificial Intelligence Research, 11 , 1-
     94. Available from citeseer.ist.psu.edu/
     boutilier99decisiontheoretic.html
Good, R. H., & Kaminkski, R. A. (Eds.). (2002). Dy-
     namic indicators of basic early literacy skills (6th