=Paper=
{{Paper
|id=None
|storemode=property
|title=Using POMDPs to Forecast Kindergarten Students' Reading Comprehension
|pdfUrl=https://ceur-ws.org/Vol-962/paper01.pdf
|volume=Vol-962
|dblpUrl=https://dblp.org/rec/conf/uai/AlmondTO12
}}
==Using POMDPs to Forecast Kindergarten Students' Reading Comprehension==
Using POMDPs to Forecast Kindergarten Students Reading Comprehension Russell G. Almond Umit Tokac Stephanie Al Otaiba∗ Educational Psychology and Educational Psychology and Department of Teaching and Learning Lerning Systems Lerning Systems Southern Methodist University Flordia State University Flordia State University Dallas, TX 75275 Tallahassee, FL 32306 Tallahassee, FL 32306 salotaiba@smu.edu ralmond@fsu.edu ut08@my.fsu.edu Abstract coverage and reliability. Moreover, a proper model for student growth allows forecasting of the students eventual status at the end of the year. Consequently, Summative assessment of student abilities teachers and administrators can form plans for stu- typically comes at the end of the instructional dents which maximize learning outcomes and identify period, too late for educators to use the infor- students for whom the goals are unreachable for spe- mation for planning instruction. This paper cial instruction. In this sense, the periodic assessments explores the possibility of using Hierarchi- play a role somewhere between traditional summative cal Linear Models to forecast students end of assessment and formative assessment — assessing stu- year performance. Because these models are dent learning for the purpose of improving instruc- closely related to partially observed Markov tion (Black & Wiliam, 1998; Wiggins, 1998; Pelligrino, decision processes (POMDPs), these should Glaser, & Chudowsky, 2001). support extensions to instructional planning to meet educational goals. Despite the new Almond (2007) noted that the forecasting could be notation, the POMDP models are subject done using a partially observed Markov decision pro- to a familiar problem from the educational cess (POMDP; Boutilier, Dean, & Hanks, 1999): the context: scale identifiability. This paper de- latent variables describing student proficiency form an scribes how this problem manifests itself and unobserved Markov process, and the periodic assess- looks at one potential solution. ments provide observable evidence about the state of those latent variables. The instructional activities cho- sen between time points are the measurement space, 1 INTRODUCTION and in fact, the students response to instruction often provides important clues about their proficiency and There is a long tradition in education of separating specific learning problems (Marcotte & Hintze, 2009). instruction and assessment: summative assessment Almond (2009) notes the similarity between POMDPs of what a student learns comes at the end of the and other frameworks more commonly used in educa- unit/semester/year. As limited time is allocated for tion, such as latent growth modeling (Singer & Willett, assessment, such assessments are typically limited in 2003) and hierarchical linear modeling (HLM; Rauden- their reliability (accuracy of measurement) and con- bush & Byrk, 2002). The principle difference is one of tent validity (coverage of the targeted knowledge, skills emphasis: in the POMDP framework, the emphasis is and ability). Because summative assessment comes usually on estimating the individuals latent state for at the end of instructions, instructors are not able to the purpose of planning. In the HLM and multilevel make changes to their instructions to maximize stu- growth model, the emphasis is usually on estimating dent learning (Almond, 2010). the effectiveness of various activities. This paper looks at the problem of forecasting using HLM models both Bennett (2007) suggested breaking the summative as- directly and through conversion to POMDP parame- sessment into four or six periodic assessments. First, terizations. spreading the cost (student time taken away from di- rect instruction) over multiple measurement occasions The purpose of our study is to try to fit a POMDP- allows for longer testing providing both greater content based latent growth model using Bayesian methods to a set of data documenting the development of Reading ∗ Some of the work took place while she was at the skills in a number of Kindergarten students. Once the Florida Center for Reading Research, Tallahassee, FL model is successfully fit, we will use it to predict the given special instructions nor a prescribed curriculum, end-of-year status of the students. although most of them used the same curriculum. 2 THE DATA 3 THE POMDP FRAMEWORK This study uses longitudinal data about reading de- Almond (2007) provides a generalized model for how velopment originally collected by the Florida Center a POMDP can represent measurement of a developing for Reading Research (Al Otaiba et al., 2011). The proficiency across multiple time points (Figure 1). reading skills for this initial cohort of students was measured three times (Fall, Winter and Spring) dur- Activity Activity ing Kindergarten, and follow-up measurements were taken at the end of 1st, 2nd and 3rd grade. There S Growth S S were 247 students in the initial sample, but only 224 Assessment were still in the area at the end of the first year. O O O During Kindergarten, children rapidly develop in t=1 t=2 t=3 Reading and pre-Reading skills (e.g., oral vocabu- lary and letter identification). Consequently, not all Figure 1: Measurement across time as POMDP measures are appropriate for all time points. Conse- quently, different measures were collected at different In this figure, the nodes marked S represent the la- time points. Table 1 shows the measures that were tent student proficiency as it evolves over time. At collected during Kindergarten: each time slice, there is generally some kind of mea- surement of student progress represented by the ob- Table 1: Measures Collected By Occasion servable outcomes O. Note that these may be differ- Measure Fall Winter Spring ent for different time slices (c.f., Table 1). Following LW X X X the terminology of evidence-centered assessment de- PV X X X sign (ECD; Mislevy, Steinberg, & Almond, 2003) we ISF X X call this an evidence model. In general, both the profi- PSF X X ciency variables at Measurement Occasion m, Sm , and NWF X X the observable outcome variables on that occasion, Om LNF X X X are multivariate. Extending the ECD terminology, Almond (2007) calls The measures are taken from the Woodcock-Johnson the model for the Sm ’s, the proficiency growth model. III Cognitive Test (WJ-III; Woodcock, McGrew, & Following the normal logic of POMDPs this is ex- Mather, 2001) and the Dynamic Indicators of Basic pressed with two parts: the first is the initial profi- Early Literacy Skills (DIBELS; Good & Kaminkski, ciency model, which gives the population distribution 2002). The measures used were: for proficiency at the first measurement occasion. The second is an action model, which gives a probability distribution for change in proficiency over time that LW – Letter-Word Identification (WJ-III) depends on the instructional activity chosen between PV – Picture Vocabulary (WJ-III) measurement occasions. LNF – Letter Naming Fluency (DIBELS) 3.1 PROFICIENCY GROWTH MODEL ISF – Initial Sound Fluency (DIBELS) For the data from the Al Otaiba et al. (2011) study, the PSF – Phoneme Segmentation Fluency (DIBELS) latent proficiency is obviously Reading. The question immediately arises as to how many dimensions to use NWF – Nonsense Word Fluency (DIBELS) to represent the reading construct. As the students are entering the study in Kindergarten, components of the The Woodcock-Johnson measures are available in sev- reading skill, such as oral vocabulary and phonemic eral forms. We used the “W” scale (which is scaled awareness are less tightly correlated than they are with to an item response theory model), as it showed more older children. (In the fall of the Kindergarten year the variation than the scale scores. correlation between the LW and PV scores in the Al Additionally, teacher and school identifiers are avail- Otaiba et al. study was r = .46, n = 247, while in able for each child. For this cohort teachers were not the spring it had increased to r = .56, n = 224.) As a starting point, we will fit a unidimensional model and 6 years 4 months at the time of the first testing of Reading, representing it with a single continuous (with a few students 7 years or older). This represents variable: Rnm the reading ability of Individual n on a considerable variation in maturity, and potentially Measurement Occasion m. in initial ability. We define the following model for Measurement Occa- 3.1.1 Model for Growth sion 1: In the first cohort of the Al Otaiba et al. (2011) study, Rn1 ∼ N (µs(n) , υs(n) ) (2) teachers were not given specific instructions about cur- riculum or activity between the time points. We there- 3.2 EVIDENCE MODELS fore do not have a dependency on activity to mea- Because we are assuming that Reading proficiency is sure here. However we do expect there to be some unidimensional, we do not need to specify which of the classroom-to-classroom differences, so we will make measures in Table 1 are relevant to which proficiencies. the growth parameters dependent on the classroom Thus, the evidence model is a collection of simple re- (The teacher effect is part of the classroom effect, gressions, for each observation Ynmi for Individual n however aspects of the peer group and environment at Measurement Occasion m on Instrument i, we have: are captured as well). Let c(n) be the classroom to which Student n belongs. Note also that classrooms Ynmi = ai + bi Rnm + nmi (3) are nested within schools, so school effects are consid- nmi ∼ N (0, ωi ) (4) ered part of the general classroom effect. Note that the slope parameter bi actually encodes a Following this logic, for Measurement Occasion m > 1, relative importance for the various measures. define: One advantage of this structure is that we we do not Rnm = Rn(m−1) + (γc(n)m + γ0m )∆Tnm + ηnm need to explicitly specify the data collection structure (1) (Table 1). Instead, we can simply set the values of p ηnm ∼ N (0, σc(n)m ∆Tnm ) measures not recorded in each wave to missing values. Because each of the instruments are well established Here ∆Tnm is the time between Measurement Occa- (Woodcock et al., 2001; Good & Kaminkski, 2002), sions m and m − 1 for individual n. Here γ0m is an we know some of their critical psychometric proper- average growth rate, and γcm is a classroom specific ties. In particular, the reliability of Instrument i, ρi growth rate. Note that the residual standard devia- is documented in the handbooks for the measures. In tion depends on both a classroom specific rate, σcm , classical test theory, the reliability is the squared cor- and the time elapsed between measurements. This is relation between the true score of an examinee and the consistent with the model that student ability is grow- observed score. With a bit of algebra, this definition ing according to a nonstationairy Wiener (Brownian is is equivalent to: motion) process. Varn (nmi ) ρi = 1 − . (5) 3.1.2 Model For Initial Proficiency Varn (Ynmi ) Here the notation Varn (·) indicates that the variance Children entering Kindergarten have very diverse lan- is taken over individuals (with measurement occasion guage and early literacy backgrounds. There are con- and instrument held constant). Solving Equation 5 siderable differences in the amount of experience with for Varn (nmi ) yields an estimate for ωi2 for each mea- print material the child experiences at home, breadth surement occasion. We took the median of the three and depth of vocabulary used with the child, as well estimates as our base estimate for ωi2 , ω̃i . as a wide variety of preschool experiences. As a child’s preschool and early home experiences are at least par- One drawback of the classical test theory concept of re- tially dependent on their parents’ social and economic liability is that it is dependent on the population being status, and within-school socio-economic status tends measured. Thus, as the sample in the Al Otaiba et al. to be more homogeneous than across school status, we (2011) is slightly different from the norming samples model the initial status as dependent on the school. used in the development of the WJ-III and DIBELS Let s(n) be the school attended (during Kindergarten) measures, we expect our observed reliability will dif- for Student n. fer slightly from the published values. What we do is set up priors for ωi using ω̃i as the prior mean. In There is also a considerable variation in the age at particular, entry. In the Al Otaiba et al. (2011) study, 95% of the children were between the ages of 5 years 2 months 1/ωi2 ∼ Gamma(α, αω̃i2 ) , (6) where Gamma(α, β) is a gamma distribution with (i.e., they enforce the constraint by manipulating the shape parameter α and rate parameter β. We note samples in R and coda R Development Core Team, that any gamma distribution with β = αω̃i2 will have 2007; Plummer, Best, Cowles, & Vines, 2006 rather the proper mean. The shape parameter α is then effec- than in BUGS P or JAGS). For example, rather than tively a tuning parameter giving the strength the prior constraining µs = 0, they would estimate µs freely, distribution, or equivalently the relative weight of the but post hoc would adjust the sample from the rth cy- published reliabilities and the observed error distribu- (r)0 (r) P (r) cle, µs = µs − µs , making appropriate adjust- tion. We initially chose a value of α = 100 weights the ments to the other parameters. They claim that the prior knowledge as equivalent to 100 observations, but resulting model mixes better, however, there is some later increased it to 1000 when we were experiencing difficulty in figuring out how the post hoc adjustments convergence problems. will affect other parameters in the model. 3.3 SCALE IDENTIFICATION 4 PROBLEMS WITH MODEL A problem that frequently arises in educational models FITTING using latent variables is the identifiability of the scale. In particular, suppose we replaced Rnmi with Rnmi 0 = We attempted to fit the model described in the pre- Rnmi + c for an arbitrary constant c, and replaced ai vious section with MCMC using JAGS (Plummer, with a0i = ai − bi c. The likelihood of the observed data 2012).1 After some initial difficulties we removed the Ynmi (implicit in Equation 3) would be identical. A teacher and school effects (intending to add them again similar problem arises if we replace Rnmi with Rnmi 00 = after we fit the simpler model). This also allowed us 00 cRnmi and bi with bi = bi /c. Additional constraints to restrict the prior distribution for Rn1 to be a unit must be added to the model to identify the scale and normal distribution (zero mean, variance one). This location of the latent variable R. is a common identifiability constraint imposed in psy- chometric models. A frequently used convention in psychometrics is to identify the scale and location of the latent variable 4.1 FIVE MEASURE MODEL by assuming that the population mean and variance for the latent variable is 0 and 1 (i.e., that the latent Our initial experiments involved five of the six mea- variable has an approximately unit normal distribu- sures (the PSF measure was left out due to a mis- tion). In this P case we can identify P the scale for Rn1 by take in the model setup). We ran three Markov chains constraining s µs = 0 and S1 s υs = 1, where S is using random starting positions and found that the the total number of schools in the study. models did not converge. Or more properly, the ev- Because this is a temporal model, there exists another idence model parameters (ai , bi , and ωi ) for the DI- complication. We need to identify the scale of Rnm for BELS NWF (nonsense word fluency) measure did not m > 1. In particular, the mean and variance of the converge. Table 2 shows the posterior mean of the evi- innovations γ0m and σtm can cause similar identifia- dence model parameters for the five measures (because bility to the scale and location for Rnm that the initial the MCMC chain did not reach the stationary state, mean and variance caused for Rnm . In this case we this may not be the true posterior). apply a different solution. We assume that the prop- Note in Table 2 that the estimated residual variance erties of the instruments, and their relationships to the is extremely low, indicating a nearly perfect correla- latent reading proficiency do not vary across time (at tion between the latent Reading variable and the NWF least for the time points they are in use). Note that measure. In this case, the MCMC chain looks like it is in Equation 3, the slope, bi and intercept, ai do not somehow using that measure to identify the scale of the vary across time. This establishes a common scale for latent variable. Furthermore, the slope for that vari- all time points. ables is twice as high as the slope for other variables in Our initial thinking was that this would be enough 1 to identify the model. Unfortunately, because of Actually, we did some of our early model fitting using WinBUGS (D. J. Lunn, Thomas, Best, & Spiegelhalter, the structural missing data additional constraints are 2000). Some of the identification problems we were having needed. These are described below. in WinBUGS we are not having in JAGS. JAGS may be using slightly better samplers which may take care of issues Bafumi, Gelman, Park, and Kaplan (2005) present a that occur when the predictor variables in regressions are not centered (Plummer, 2012). Similar improvements may different approach to enforcing identifiability. They let have been made in OpenBUGS (D. Lunn, Spiegelhalter, the model be unidentified while fitting the data, but Thomas, & Best, 2009), the successor to WinBUGS, but then transform the estimates when evaluating the data we have not tested this model using OpenBUGS. It is likely that the problem is some complex interac- Table 2: Evidence Model Parameters, 5 Measure tion between using γ0m and the b’s to identify mean Model growth, or the a’s which define the starting point for LW PV LNF ISF NWF growth. Note that the problematic measure, NWF, a 105.37 99.90 25.76 13.97 -4.27* was not measured at the first time point. Thus the b 0.15 0.05 0.49 0.32 0.87* constraint on the distribution of Rn0 will not define its ω 6.15 4.92 6.31 4.38 0.09 scale in the second or third measurement occasions. * indicates parameter did not converge 4.2 THREE MEASURE MODEL the model. Table 3 shows some of the difficulty. The As the problematic measure may be the ones which NWF measure is the only one showing a large increase were not recored at all three time points, we ran the between the Winter and Spring testing periods. So model again, dropping the ISF and NWF measures naturally, there is a tendency to track that measure. (the ones not observed at the first or third measure- ment occasion). The new model also did not converge, although the focus of the problem has now moved from Table 3: Mean Scores on Each Measure at Each Ad- the NWF measure to the LNF measure. ministration Table 4 shows the new estimates from the unconverged LW PV LNF ISF NWF posterior. Again, the variance for the measure that did Fall 108.5 100.6 27.3 14.2 not converge is substantially smaller than that of the Winter 110.8 102.3 42.9 25.5 27.9 other measures, and the slope is substantially higher. Spring 111.2 101.7 51.3 43.2 Again the trace plots (Figure 3) show poor mixing, as do similar plots for the Rnm measures for m > 1. Trace plots of the evidence models show the problem. There is also an indication of a trend that indicates Figure 2 shows an example of extremely slow mixing, that the chains have not covered the whole of the pos- that is characteristic of identifiability problems. De- terior distribution. pending on the values of the other variables in the sys- tem (particularly the latent reading variables) higher Table 4: Evidence Model Parameters, 3 Measure or lower slopes may be sensible. Looking at the trace Model plots of Rnm for several students show similar poor LW PV LNF mixing for m > 1. We would expect similar problems a 103.74 98.95 23.02* with the trace plots for γ0m , but the mixing looks good b 1.59 0.64 4.55* on those chains. ω 5.55 4.78 0.11 Trace of a[5] Density of a[5] * indicates parameter did not converge 1.2 −4.0 0.8 4.3 MISSING IDENTIFICATION 0.4 −5.0 0.0 0 1000 2000 3000 4000 5000 −5.5 −5.0 −4.5 −4.0 −3.5 CONSTRAINT Iterations N = 5000 Bandwidth = 0.08513 Trace of b[5] Density of b[5] Looking back to the problems in the model fit in Sec- tion 4.1, note that the lack of fit could be explained by 0.880 10 20 30 40 the interaction between a5 (the intercept for the NWF 0.865 measure) and γ1 (the average proficiency change be- 0.850 tween the first and second time points). As NWF is 0 0 1000 2000 3000 4000 5000 0.85 0.86 0.87 0.88 Iterations N = 5000 Bandwidth = 0.00134 not measured in the first time point, any arbitrary Trace of itau[5] Density of itau[5] change between the first and second time point can be created by changing a5 , b5 and γ1 . The other 145 0.08 four measures were all collected in the fall, so in these 135 0.04 125 cases, ai should have been fixed by the constraint that 115 0.00 0 1000 2000 3000 4000 5000 110 120 130 140 E[Rn1 ] = 0. Iterations N = 5000 Bandwidth = 0.6323 What is required is a method for fixing the value of ai for measures that were not collected at the initial Figure 2: Trace plots of evidence model parameters time point. One possible way to do this would be to for measure NWF simply set ai = 0. This is not unreasonable, if all of Trace of a[3] Density of a[3] Trace of a[3] Density of a[3] 0.0 0.5 1.0 24.5 0.6 0.8 Density 0.4 23.0 0.4 0.2 −1.0 21.5 0.0 0.0 0 1000 2000 3000 4000 5000 22 23 24 25 0 1000 2000 3000 4000 5000 −1.0 −0.8 −0.6 −0.4 −0.2 0.0 Iterations N = 5000 Bandwidth = 0.1265 Iterations y Trace of b[3] Density of b[3] Trace of b[3] Density of b[3] 3.5 4.0 4.5 5.0 5.5 0.6 1.1 4 0.4 3 0.9 2 0.7 0.2 1 0.5 0.0 0 0 1000 2000 3000 4000 5000 3.5 4.0 4.5 5.0 5.5 0 1000 2000 3000 4000 5000 0.4 0.6 0.8 1.0 1.2 Iterations N = 5000 Bandwidth = 0.08763 Iterations N = 5000 Bandwidth = 0.01449 Trace of itau[3] Density of itau[3] Trace of itau[3] Density of itau[3] 0.15 40 90 30 0.10 0.36 85 20 80 0.05 10 0.32 75 0.00 0 0 1000 2000 3000 4000 5000 75 80 85 90 0 1000 2000 3000 4000 5000 0.30 0.32 0.34 0.36 0.38 0.40 Iterations N = 5000 Bandwidth = 0.3927 Iterations N = 5000 Bandwidth = 0.001537 Figure 3: Trace plots of evidence model parameters Figure 4: Trace plots of evidence model parameters for measure LNF, Three Measure Model for measure NWF, Six Measure Model with ai = 0 the variables are on a standardized scale: it implies 5 FUTURE DIRECTIONS AND that the average trajectory of the average student will pass through the average of the scores. CHALLENGES This required that the scores all be on the same scale The key to getting this model to converge was the (especially problematic with the WJ-III and DIBELS standardization of the measure scales. Fortunately, scores based on different development and norming this data set had a time period where all six measures sample. Fortunately, for these data all six measures were applied to the same population. Consequently, were collected in the winter time period. Subtract- standardizing the scales at this time point put the mea- ing the mean of the Winter scores and dividing by the sures on a comparable scale, which then made the fixed standard deviation for each measure produced stan- intercept constraint meaningful. dardized scores. This standardization together with the constraint ai = 0 caused the models to converge. It is difficult to see how this generalizes to cases in which there is not a single time point in which all mea- 4.4 SIX MEASURE MODEL sures are collected. This is a problem with the cohort examined in this study when we look at the data gath- Using the standardized data and the additional con- ered in first and second grades. As the students read- straint of a1 = 0, we again fit the model using MCMC. ing abilities develop, new and more difficult measures This time, we got convergence on all of the evidence of reading become appropriate. Linking these back model parameters (Figure 4). to the old scale is a difficult problem. This problem is well known in the educational literature under the Table 5 shows the mean of the latent Reading variable name “vertical scaling” (von Davier, Carstensen and for the first five students in the sample. This appears von Davier, 2006, provide a review of the literature). to be well behaved with all of the students showing growth across the three time points. Now that the model without teacher or school effects converges, the next step is to add those back into the model. Also, we should use cross-validation to evalu- Table 5: Mean values for Reading for first five stu- ate how well the model predicts students scores. The dents, Six Measure model with ai = 0 Al Otaiba et al. (2011) data set has long term follow-up S1 S2 S3 S4 S5 for a substantial portion of the students, so we can see F -0.394 -0.351 -0.375 -1.556 -0.773 how well the model can predict First and Second grade W 0.029 0.104 0.031 -1.278 -0.415 reading scores as well. Finally, we can look at the S 0.864 1.016 0.852 -0.514 0.419 rules for classification in to special instruction, to see whether integrating the data across multiple measures provides a better picture of the student than looking ed.) [Computer software manual]. Available from at one measure alone. https://dibels.uoregon.edu/ Lunn, D., Spiegelhalter, D., Thomas, A., & Best, N. Acknowledgments (2009). The BUGS project: Evolution, critique and future directions (with discussion). Statistics We would like to thank the Florida Center for Read- in Medicine, 28 , 3049–3082. ing Research for allowing us access to the data used in Lunn, D. J., Thomas, A., Best, N., & Spiegelhalter, this paper. The data were originally collected as part D. (2000). WinBUGS – a Bayesian modeling of a larger National Institute of Child Health and Hu- framework: concepts, structure, and extensibil- man Development Early Child Care Research Network ity. Statistics and Computing, 10 , 325–337. study. Marcotte, A. M., & Hintze, J. M. (2009). Incremen- tal and predictive utility of formative assessment methods of reading comprehension. Journal of References School Psychology, 47 , 315-335. Mislevy, R. J., Steinberg, L. S., & Almond, R. G. Almond, R. G. (2007). Cognitive mod- (2003). On the structure of educational assess- eling to represent growth (learning) using ment (with discussion). Measurement: Interdis- Markov decision processes. Technology, In- ciplinary Research and Perspective, 1 (1), 3-62. struction, Cognition and Learning (TICL), Pelligrino, J., Glaser, R., & Chudowsky, N. (Eds.). 5 , 313–324. Available from http://www (2001). Knowing what students know: The sci- .oldcitypublishing.com/TICL/TICL.html ence and design of educational assessment. Na- Almond, R. G. (2009). Estimating parameters of pe- tional Research Council. riodic assessment models (Research Report No. Plummer, M. (2012, May). JAGS version 3.2.0 To appear). Educational Testing Service. user manual (3.2.0 ed.) [Computer software Almond, R. G. (2010). Using evidence centered design manual]. Available from http://mcmc-jags to think about assessments. In V. J. Shute & .sourceforge.net/ B. J. Becker (Eds.), Innovative assessment for Plummer, M., Best, N., Cowles, K., & Vines, K. the 21st century: Supporting educational needs. (2006). coda: Output analysis and diagnostics (pp. 75–100). Springer. for MCMC [Computer software manual]. (R Al Otaiba, S., Folsom, J. S., Schatschnneider, C., package version 0.10-7) Wanzek, J., Greulich, L., Meadows, J., et al. R Development Core Team. (2007). R: A language (2011). Predicting first-grade reading perfor- and environment for statistical computing [Com- mance from kindergarten response to tier 1 in- puter software manual]. Vienna, Austria. Avail- struction. Exceptional Children, 77 (4), 453–470. able from http://www.R-project.org Bafumi, J., Gelman, A., Park, D. K., & Kaplan, N. Raudenbush, S. W., & Byrk, A. S. (2002). Hierar- (2005). Practial issues in implementing and un- chical linear models (second edition ed.). Sage derstanding bayesian ideal point estimation. Po- Publications. litical Analysis, 13 , 171-187. Singer, J. D., & Willett, J. B. (2003). Applied longitu- Bennett, R. E. (2007, May). Assessment of, for, dinal data analysis: Modeling change and event and as learning: Can we have all three? Paper occurrence (1st ed.). Oxford University Press, presented at the Institute of Educational As- USA. sessors National Conference, London, England. von Davier, M., Carstensen, C. H., & von Davier, Available from http://www.ioea.org.uk/ A. A. (2006). Linking competencies in educa- Home/news and events/annual conference/ tional settings and measuring growth (Research day1/randy bennett.aspx Report No. RR-06-12). ETS. Black, P., & Wiliam, D. (1998). Assessment and class- Wiggins, G. P. (1998). Educative assessment: Design- room learning. Assessment in Education: Prin- ing assessments to inform and improve student ciples, Policy, and Practice, 5 (1), 7–74. performance. Jossey-Bass. Boutilier, C., Dean, T., & Hanks, S. (1999). Woodcock, R. W., McGrew, K. S., & Mather, N. Decision-theoretic planning: Structural assump- (2001). Wj-iii tests of cognitive abilities and tions and computational leverage. Jour- achievement [Computer software manual]. nal of Artificial Intelligence Research, 11 , 1- 94. Available from citeseer.ist.psu.edu/ boutilier99decisiontheoretic.html Good, R. H., & Kaminkski, R. A. (Eds.). (2002). Dy- namic indicators of basic early literacy skills (6th