=Paper=
{{Paper
|id=Vol-2354/w3paper7
|storemode=property
|title=Assessments That Care About Student Learning
|pdfUrl=https://ceur-ws.org/Vol-2354/w3paper7.pdf
|volume=Vol-2354
|authors=Stephen E. Fancsali,Steven Ritter
|dblpUrl=https://dblp.org/rec/conf/its/FancsaliR18
}}
==Assessments That Care About Student Learning==
<pdf width="1500px">https://ceur-ws.org/Vol-2354/w3paper7.pdf</pdf>
<pre>
       Assessments That Care About Student Learning

                           Stephen E. Fancsali & Steven Ritter

                   Carnegie Learning, Inc., Pittsburgh PA 15219, USA
                {sfancsali, sritter}@carnegielearning.com


       Abstract. We argue that an important requirement of assessments that care is
       that they focus on student learning. Intelligent tutoring systems (ITSs) are a ba-
       sis for such assessments; they provide a means by which to continually assess
       what students know as they learn. Given widespread dissatisfaction with high-
       stakes assessments, we present a review of recent work targeted at replacing
       high-stakes exams with regular use of an ITS. We conclude by discussing some
       areas for future research and development.

       Keywords: Intelligent Tutoring Systems, Mathematics Education, High-Stakes
       Testing, Formative Assessment, Summative Assessment, Instruction-Embedded
       Assessment.


1      Introduction

1.1    Characteristics of Assessments that “Care”

   John Self’s [1] description of ITSs as systems that “care” about students focused on
the way that the personalization in such systems allows them to care about students in
a way that other systems cannot. With respect to caring assessments, we agree with
Zapata-Rivera [2] that personalization can enable assessments to address students at
their individual level of understanding. Personalization in caring assessments might
also enable students to demonstrate their knowledge in different ways and, perhaps, at
different times. However, the most important characteristic of a caring assessment is
not a result of personalization but of the goal of the assessment. For an assessment to
be “caring,” the experience must be beneficial to the student. Summative assessments
are typically, though not always, designed to benefit institutions by providing them
with information about the effectiveness of some aspect of instruction (e.g., the teach-
er, institution, or materials). Students are merely measurement instruments in this
process. In contrast, caring assessments are fundamentally formative and directly
assist the students in learning.
    We posit that an exciting opportunity exists wherein ITSs, augmented by several
tools and affordances that still need to be developed, are used as caring assessments.
Such assessments are fundamentally formative, focused on student learning, and
adaptive to student differences, but they also can serve a summative purpose to the
institution.


                                             135
   In what follows, we argue that the time has come, both technologically and politi-
cally, to push forward with innovative approaches to assessment that use technologies
like ITSs, embedded within the learning process, to provide continual, on-going,
formative assessment while students learn to replace high-stakes, end-of-year summa-
tive assessment approaches. Accomplishing this goal, relying on systems like ITSs
that attend to Self’s notion of “caring” about students (e.g., by having a student model
of what learners know and do not know during the learning process), will better allow
a broad swath of educators, courseware and ITS developers, and others to (eventually)
bask “in the positive glow associated with the term” caring [1]. More importantly,
innovative approaches will increase instructional time, provide better measures of
what students actually know, and improve learning outcomes. We detail recent work
in developing statistical models that predict students’ end-of-year test scores in math-
ematics using data from an “ITS that cares,” namely Carnegie Learning’s MATHia
ITS, based on its Cognitive Tutor technology [3].

1.2    The Problem(s) with High-Stakes, Summative Assessments

High-stakes summative assessments, by design and implementation, often contradict
what we know to be beneficial to instruction [4]. The fact that only the student’s
knowledge on the particular day of the test is important leads to cramming, which
optimizes short-term performance, at the expense of long-term memory [5,6]. Item
Response Theory (IRT) assumes that student knowledge is fixed for the period of the
exam, and so the examination environment is set up to minimize student learning
(even though we do know that prompted memory retrieval, as practiced in tests, does
improve learning [7]).
    Most high-stakes assessments only provide coarse measures of learning like multi-
ple-choice items, which, even when well designed (e.g., with demonstrated validity
and reliability), provide minimal opportunities to illuminate student misconceptions or
the extent to which learners have mastered particular micro-competencies, skills, or
knowledge components (KCs [8]).
    In addition to the aforementioned shortcomings, standardized, high-stakes, sum-
mative assessments crowd out instructional time. Not only does taking the tests take
time, but also teachers often spend several instructional periods (and in many cases
weeks’ worth of instructional time) preparing for such high-stakes assessments. Fur-
ther, there are often numerous tests given. The Council of Great City Schools reports
that, among large school districts recently surveyed in the U.S., the typical eighth
grader, in a typical academic year, spends 25.3 hours taking 10.3 district-administered
tests, which alone would consume 2% of instructional time in a 180 instructional-day
academic year, without accounting for preparation time and other summative assess-
ments [9].
1.3    Responses to the Problem(s)
Public backlash to perceived and actual shortcomings of high-stakes, standardized
testing reflects perceptions that testing takes up too much instructional time while not
being well-aligned to such instruction [10]. On a national level in the U.S., the Every


                                         136
Student Succeeds Act (ESSA) encourages innovative assessment approaches, demon-
strating recognition that the existing framework is less than satisfactory. At a state and
local level, so-called “opt-out” movements [11] have led to parents and students exer-
cising their rights to not be required to take certain high-stakes, standardized assess-
ments. As we noted in [12], in 2017, 27% of students in the U.S. state of New York
opted out of high-stakes math testing [13], and so many students in Minneapolis re-
cently opted-out of state exams for 10th and 11th grade math that the state does not
believe that the exam results can be judged to be reliable [14]. Officials and legislators
in Georgia (and elsewhere) are presently working to pursue possible alternatives to
high-stakes, end-of-year assessments via possibilities like more frequent, formative
assessments via short quizzes and other possible alternatives [15]. What these re-
sponse tend to have in common is a recognition that accountability and assessment of
learning and knowledge are important but that the methods presently employed to
assess such learning and knowledge are inadequate.

1.4    MATHia & Cognitive Tutor

MATHia is an ITS for middle school and high school mathematics, based on Carnegie
Learning’s Cognitive Tutor technology, that typically is a part of a blended mathemat-
ics curriculum. Carnegie Learning generally recommends that the instructional mix of
this blended curriculum be a 60%-40% split between instructor-facilitated, student-
centered classroom activities that facilitate collaborative learning and deep conceptual
understanding (60% of the time) and individual student work in a computer lab or
classroom with the MATHia ITS (40% of the time).
   MATHia is based on an adaptive, mastery learning [16] approach and relies on a
fine-grained model of KCs (e.g., Grade 6 mathematics comprised of approximately
700 KCs) that students must master to make progress through content. Content is
presented to students in topical “workspaces,” each of which focuses on a set of KCs
that must be mastered to move on to the next workspace. Within each workspace,
students work on multi-step, complex, real-world problems (see Fig. 1), and student
responses at each step provide rich data about student problem-solving strategies and
a fine-grained understanding of what students know and do not know.


2      Using MATHia Data to Predict Standardized Test Scores

Recent efforts [12, 17] have focused on using student MATHia performance data to
predict standardized test scores in large school districts in the U.S. states of Virginia
(VA) and Florida (FL). This work follows in the tradition of work using data from the
ASSISTments system [18] and considers the relative contributions of various
measures of MATHia performance (and transformations thereof) (e.g., workspaces
mastered per hour, hints requested, errors made), prior year test performance or a pre-
test score (i.e., prior knowledge), and socio-demographic data (e.g., socio-economic
status via free/reduced-price lunch status, English language learner status, etc.).


                                          137
Fig. 1. A screenshot of problem-solving in the MATHia platform.

   Specifics of model construction, specification, and selection are beyond the scope
of the present discussion (see [12, 17]), but Table 1 provides a brief summary of re-
sults to demonstrate our success so far. While various model goodness-of-fit metrics
are considered in detail in the original work reporting these results, we rely on the
relatively simple to interpret adjusted R2 values of the best models for particular aca-
demic years in Table 1.
   In FL, the Florida Comprehensive Assessment Test (FCAT) was used in 2013-14,
and the Florida Standards Assessment (FSA) was used in 2014-15 and 2015-16. Re-
sults for FL are reported are for the best model learned on data from another academic
year’s data, so in each case, results reported are for the situation in which an academic
year’s data served as a held-out test set for the statistical model learned [12]. In VA,
models were learned to predict scores on the Standards of Learning (SOL) exam for
mathematics [17], but data were only available for a single academic year. R2 values
reflect the proportion of variance in SOL exam scores explained by a model learned
on data for 7th graders. Cross-validation results indicated that these values do not seem
to reflect substantial over-fitting.
   Table 1 shows that we can account for up to 73% of the variation in FSA scores,
and we see the relative contribution of different categories of variables, starting with a
model including pre-test scores (M1) and progressively increasing the complexity of
models through M5. Importantly, we see that there are relatively small differences
between M5 and M6 (which does not include demographics), so demographic varia-
bles do not provide for substantial predictive power. Ideally, we would be able to rely
on process variables (i.e., MATHia performance) only, and especially for predicting
FCAT/FSA, we explain over 50% of the variation in these scores with pro-
cess/performance data alone.


                                           138
  Table 1. Adjusted R2 values for best linear regression models reported in [12, 17]. Variable
categories are pre-test performance (pre-test), MATHia process data (process), and demograph-
                   ic data (demog). An M6 model was not considered by [17].

      Model               Variables            VA SOL              FL FCAT/FSA
                                               2011-12      2013-14 2014-15 2015-16
                                                n=940       n=7,491 n=7,368 n=8,065
       M1      pre-test                           .5         .6001     .6035   .6528
       M2      process                            .43         .5271      .5393        .593
       M3      process + demog                    .45         .5443      .5656       .6185
       M4      pre-test + demog                   .51         .6059       .629       .6684
       M5      pre-test + demog + process         .57         .6642       .689       .7349
       M6      pre-test + process                             .6707      .6326       .7258


3      Future R&D

Being able to predict standardized test scores with reasonable success using perfor-
mance data from systems like MATHia is insufficient for such systems to replace
such tests. Further, systems like MATHia are designed to be used and generally,
though not exclusively, are used as a part of a blended curriculum. To transition to
using such systems in an assessment role, we see several important areas of R&D to
pursue both for Carnegie Learning and the broader community of ITS and assessment
researchers working on developing caring assessments. In addition to improving mod-
els like those for which we have here briefly reported results, we need to identify
minimally sufficient sets of content that contribute to successful predictive models.
This will help to identify subsets of content that should be used as a part of assess-
ments in ITSs like MATHia. Content management, assessment design, and editing
tools will be required to allow for state-by-state and possibly local customization.
Security tools will be required to insure that students do their own work. More work
needs to be done to establish the validity and reliability of this approach to assess-
ment, likely by continuing to build bridges between traditional IRT approaches and
the knowledge tracing approaches of systems like MATHia.


References
 1. Self, J.A.: The distinctive characteristics of intelligent tutoring systems research: ITSs care,
    precisely. International Journal of Artificial Intelligence in Education 10, 350–364 (1999).
 2. Zapata-Rivera, D.: Toward caring assessment systems. In: Tkalcic, M, Thakker, D., Ger-
    manakos, P., Yacef, K., Paris, C., Santos, O. (eds.) Adjunct Publication of the 25 th Conf.
    on User Modeling, Adaptation and Personalization, UMAP ’17, pp. 97–100. ACM, New
    York (2017).
 3. Ritter, S., Anderson, J.R., Koedinger, K.R., Corbett, A.T.: Cognitive Tutor: applied re-
    search in mathematics education. Psychonomic Bulletin & Review 14, 249–255 (2007).


                                               139
 4. Snow, R.E., Lohman, D.F.: Implications of cognitive psychology for educational meas-
    urement. In: Linn, R.L. (ed.) Educational Measurement, 3rd ed., pp. 263–331. American
    Council on Education/Macmillan, New York (1989).
 5. Bloom, K.C., Shuell, T.J.: Effects of massed and distributed practice on the learning and
    retention of second-language vocabulary. Journal of Educational Research 74(4), 245–248
    (1981).
 6. Rea, C.P., Modigliani, V.: The effect of expanded versus massed practice on the retention
    of multiplication facts and spelling lists. Human Learning: Journal of Practical Research &
    Applications 4(1), 11–18 (1985).
 7. Roediger, H.L., Karpicke, J.D.: The power of testing memory: Basic research and implica-
    tions for educational practice. Perspectives on Psychological Science 1, 181–210 (2006).
 8. Koedinger, K.R., Corbett, A.T., Perfetti, C.: The Knowledge-Learning-Instruction (KLI)
    framework: Bridging the science-practice chasm to enhance robust student learning. Cog-
    nitive Science 36(5), 757–798 (2012).
 9. Hart, R., Casserly, M., Uzzell, R., Palacios, M., Corcoran, A., Spurgeon, A.: Student test-
    ing in America’s great city schools: An inventory and preliminary analysis. Council of
    Great City Schools, Washington, DC (2015).
10. PDK/Gallup.: 47th annual PDK/Gallup poll of the public’s attitudes toward the public
    schools: Testing doesn't measure up for Americans. Phi Delta Kappan 97(1), (2015).
11. Bennett, R.E.: Opt out: An examination of issues. ETS Research Report No. RR-16-13
    (ETS Research Report Series). Educational Testing Service, Princeton, NJ (2016)
    doi:10.1002/ets2.12101
12. Fancsali, S.E., Zheng, G., Tan, Y., Ritter, S., Berman, S.R., Galyardt, A.: Using embedded
    formative assessment to predict state summative test scores. In: Proceedings of the 8th In-
    ternational Conf. on Learning Analytics and Knowledge, pp. 161–170. ACM, New York
    (2018).
13. Moses, S.: State testing starts today; opt out CNY leader says changes are 'smoke and mir-
    rors.'              Syracuse.com                 (28             March               2017).
    http://www.syracuse.com/schools/index.ssf/2017/03/opt-
    out_movement_ny_teacher_union_supports_parents_right_to_refuse_state_tests.html, last
    accessed 2018/03/29.
14. State of Minnesota, Office of the Legislative Auditor: Standardized student testing: 2017
    evaluation report. State of Minnesota, Office of the Legislative Auditor, St. Paul, MN
    (2017).
15. Tagami, T.: Smaller tests could replace state’s big Milestones exams. The Atlanta Journal-
    Constitution (02 February 2018). https://www.myajc.com/news/local-education/smaller-
    tests-could-replace-state-big-milestones-exams/xbdXop4VvI2Tmf6EFl7fVN/, last ac-
    cessed 2018/03/29.
16. Bloom, B.S.: Learning for mastery. Evaluation Comment 1(2), (1968).
17. Ritter, S., Joshi, A., Fancsali, S.E., Nixon, T.: Predicting Standardized Test Scores from
    Cognitive Tutor Interactions. In: Proceedings of the Sixth International Conference on Ed-
    ucational Data Mining, pp. 169–176. (2013).
18. Junker, B.W.: Using on-line tutoring records to predict end-of-year exam scores: experi-
    ence with the ASSISTments project and MCAS 8th grade mathematics. In: Lissitz, R.W.
    (ed.) Assessing and modeling cognitive development in school: intellectual growth and
    standard settings. JAM, Maple Grove, MN (2006).


                                             140

</pre>