<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>ming Skill via Online Practice</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hongwen Guo</string-name>
          <email>hguo@ets.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mo Zhang</string-name>
          <email>mzhang@ets.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amy J Ko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Min Li</string-name>
          <email>minli@uw.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Zhou</string-name>
          <email>benzhou@uw.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jared Lim</string-name>
          <email>jlim419@gatech.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul Pham</string-name>
          <email>pkdpham@uw.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chen Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Psychometrics, Python coding task, Scoring, Item Bias, Fairness</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CS1 course, Davidson and coauthors applied DIF methods to</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ETS Research Institute</institution>
          ,
          <addr-line>Princeton, New Jersey</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Georgia Tech University</institution>
          ,
          <addr-line>Atlanta</addr-line>
          ,
          <country country="GE">Georgia</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>It is particularly challenging to measure student's</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Washington</institution>
          ,
          <addr-line>Seattle, Washington</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>and across students during practice. For this</institution>
          ,
          <addr-line>it is necessary</addr-line>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>of attempts within individual students. It is critical as well</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Allowing students to practice a programming task multiple times fosters resilience and improves learning. However, it is challenging to measure their programming skills in the dynamic and adaptive learning environment, in terms of determining the maximum allowed number of attempts, measuring progress during learning, and producing a fair performance score for diferent student groups. It is particularly challenging to do so when diferent students adaptively practice diferent sets of programming tasks. In this study, we leveraged data collected from an online learning platform in a pilot study and applied psychometric models to address two research questions: 1) How to measure students' progress in an adaptive practice setting that allows for multiple attempts and inform grading policies? and 2) How do diferent scoring rules afect bias analysis of the programming tasks? From the log data, we extracted two practice features (numbers of attempts and number of passed test cases), created six diferent scoring rules for scoring student's intermediate responses and final responses based on these practice features. We then used psychometric models and best practices to create a common scale to measure the dynamic performance that were comparable within individual students who made diferent attempts and across students who practiced diferent sets of programming tasks. This common scale ensured the comparability of performance within and across diferent student groups. It furthered enabled us to evaluate potential task biases between gender groups using the diferential item functioning (DIF) analysis. Our preliminary results suggest that the final-attempt-based scoring rule not only boosted students' performance, but it also reduced the potential bias of programming tasks. This study contributes to methodologies in using log data to measure the dynamic programming skills and evaluate task biases in the adaptive online practice setting.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        As with any skill, practice makes perfect for programming
skills in computer science (CS) education. Decades of
research in learning theory has demonstrated the importance
of deliberate practice and having a ”coach” who provides
feedback for ways of optimizing performance, no matter
whether it is learning a new language, a new math topic, or
a new programming skill [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. With advances in
technologies, many online learning platforms and computer science
courses have been developed to align with the learning
theory by allowing learners to practice a problem multiple
times while providing immediate feedback, which
encourage students to learn from their mistakes and strive for
success [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. A recent large-scale randomized control study
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] showed that students who used such a math learning
platform learned significantly more, and the impact was
greater for students with lower prior mathematics
achievement. The benefit of timely feedback for programming
assignments in CS education is also evident in a recent review
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>However, repeated attempts in online practice pose
challenges to assess students’ skills. For example, how many
attempts should be allowed? How to measure progress in
practice? How can we generate fair scores for diverse
student groups in CS learning contexts? These questions are
especially important for assessing programming skills, since
there are many diferent potential approaches to assessing
students’ performance when multiple attempts are allowed,
while some methods may unintentionally further
marginalize students from non-dominant groups in CS education
that the common scale ensures that performance scores are
comparable across students who practiced diferent sets of
programming tasks. We created and examined six
scorCEUR</p>
      <p>
        ceur-ws.org
ing rules based on diferent maximum numbers of allowed
attempts. Under the diferent scoring rules, we applied
rigorous statistical and psychometric methods [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to create
comparable performance scores, which further allowed for
psychometric analysis of task biases [
        <xref ref-type="bibr" rid="ref10 ref11 ref7 ref8">7, 10, 11, 8</xref>
        ] between
the men (the dominant CS student group) and the non-men
groups.
      </p>
      <p>The following sections provide a more in-depth
description of our approaches. We first introduce the study design
and data collection, and present our proposed
methodologies to create a common scale for producing comparable
performance scores and for detecting potential task bias.
We conclude the study with a discussion section on our
preliminary findings, the study limitations, as well as our
future direction.</p>
      <p>The current work explores an emerging terrain in
adaptive learning and assessment in CS education. Our work
is pioneering in integrating educational data mining (i.e.,
creation of programming practice features extracted from
log data) and the innovative applications of psychometric
models to build a common scale, measure processes, and
evaluate potential task bias on online practice platforms.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodologies</title>
      <p>This study was approved by an Institutional Review Board
(IRB) prior to engagement with participants for data
collection.</p>
      <sec id="sec-2-1">
        <title>2.1. Study Design and Data Collection</title>
        <p>
          The online learning platform we used for data collection
is capable of providing immediate feedback on test cases
(showing pass or fail) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. More importantly, in this study,
the research team implemented an adaptive
learning/assessment approach to recommend programming task sets that
are suitable to students with diferent programming skills
for efective learning and meaningful engagement [ 13].
        </p>
        <p>To answer our research questions, we conducted a pilot
study and recruited about college students on a
compensated, volunteer basis from universities in North American
who were enrolled in introductory Python programming
courses currently or prior to participation. The participating
students majored in diferent fields and varied in the
number of programming related courses and years of experience
in programming. For the bias analysis, we focus on
selfdescribed gender, which included categories such as men,
women, non-binary, and free response options. In order to
produce reliable bias analysis results, large samples are
recommended; thus, we compromised on two student groups:
men (N=91), the dominant gender group in computer
science (CS) learning contexts who tend to be privileged in
CS, and non-men (N = 63), including women, non-binary
students, and students who reported other gender
identities (exclude missing responses). This allowed us to study
whether any programming task favored men over students
with other gender identities, but did not allow us to do more
ifne-grained analysis in the pilot study.</p>
        <p>
          Students practiced Python programming tasks on an
online learning platform designed by researchers to facilitate
research on programming language learning [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. CS
content experts on our team developed 21 Python coding tasks
with 7 to 8 test cases per task. The tasks vary in their
measurement specifics and dificulty level, which allowed for an
        </p>
        <sec id="sec-2-1-1">
          <title>Stage 1: Routing</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>Stage 2: Easy</title>
        </sec>
        <sec id="sec-2-1-3">
          <title>Stage 2: Hard</title>
          <p>implementation of an adaptive two-stage test design (refer
to Figure 1).</p>
          <p>This adaptive design has been gaining attention and
popularity in learning and, particularly assessment field, mainly
because it takes into account the diferences in student skill
levels and provides diferent tasks or diferent sets of tasks
to suit students’ skill levels ([14, 15, 13]). All students
answered a common task set of medium dificulty at the first
stage and were routed to either an easy or a hard task set
at the second stage depending on how well they did in the
ifrst stage (refer to Figure 1). Based their final submitted
responses on the common task set, students who scored in
the lower half of the score distribution were routed to the
easy block, those who scored in the upper half were routed
to the hard task set. The task sets at the second stage were
more aligned to students’ skill levels, and hopefully students
would be more engaged with the practice.</p>
          <p>The task delivery platform provides students with
immediate feedback on which test cases they passed. Students
were allowed to attempt a task as many times as they like,
or until they passed all the test cases. Students were given
a 2-week window to complete all the tasks, and were given
$80 in recognition of their time and efort upon
completing the tasks. Students who invested a mean of less than
3 minutes of efort into each problem set were not
compensated as this indicated either no efort to seek correct
solutions or use of external aids to generate solutions. This
was roughly 5% of the initial set of the participants, and
some of these participants confirmed that they were just
seeking compensation.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data Preparation</title>
        <p>
          2.2.1. Practice Features
Log data were collected from the online platform, which
contained fine grained information on what codes students
produced, what actions they took, and for how long, etc..
The fine-grained data have shown to be very useful in
understanding students’ problem-solving processes and
performance [16]. In this preliminary study, we focused on
two practice features that capture students’ intermediate
steps before they submitted their final codes: the number of
attempts (i.e., number of running/testing their codes ) and
how many test cased were passed in each attempt (please
also refer to Table 1 below for coding rules). More
comprehensive analysis of the log data will be conducted in further
studies and in the next phase when larger samples are
collected. Note that in the following, a programming task is
also referred to as an item in psychometric modeling.
which could only support the use of the simplest
psychometric model with dichotomouse responses (i.e, Rasch model;
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]), and the empirical data showed that the numbers of
passed cases concentrated in the two ends (either a very
low number or a very high number). Please refer to the
Results section and Figure 3. In this study, we experimented
with six scoring/grading rules based on the process features.
on an item, where  stands for the item score of a student,
and  for the event that the student passed at least half of
the test cases.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Data Modeling</title>
        <p>
          In this section, we provide concise and conceptual
descriptions of the used psychometric and statistical methods.
Interested readers are recommended to refer to the cited papers
for technique details of these methods.
2.3.1. Item Response Models
Because of the adaptive study design, there was missing
data by design. The total sum of passed cases of all tasks
is not comparable across students as some students took
the harder task set and some took the easier one. It was
necessary to create a common (base) scale for
comparability. We applied the Item Response Theory (IRT) models [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
in psychometrics that uses information from the observed
task responses for task dificulty calibration (i.e.,
estimation). Due to the relatively small sample size (N = 159) we
applied Rasch model, the IRT model with the least number
of parameters, to obtain reliable parameter estimates [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In
a Rasch model (i.e., a latent logistic regression model), the
probability for Student i to get a correct answer on Task j
(i.e.,   = 1) depends on the student’s latent ability   and
the item dificulty
 :

 (  = 1|  ,   ) =
        </p>
        <p>1
1 + exp{−(  +   )}
.</p>
        <p>(1)
A student with higher programming proficiency (i.e., larger
 ) has a higher probability of getting the item correct, but
a harder item (indicated by a lower value of  ) decreases
that chance. Note that in the following, to be consistent
with psychometric terminology, a programming task is also
referred to as an item.</p>
        <sec id="sec-2-3-1">
          <title>Common set</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>Easy set</title>
        </sec>
        <sec id="sec-2-3-3">
          <title>Hard set</title>
        </sec>
        <sec id="sec-2-3-4">
          <title>Sample</title>
        </sec>
        <sec id="sec-2-3-5">
          <title>Every Student</title>
        </sec>
        <sec id="sec-2-3-6">
          <title>Lower half scorers</title>
        </sec>
        <sec id="sec-2-3-7">
          <title>Upper half scorers</title>
        </sec>
        <sec id="sec-2-3-8">
          <title>Item dificulty</title>
          <p>ℎ
and accurate estimates of parameters  s (refer to [18] and
estimates were
references therein).</p>
          <p>Note that the item calibrations were conducted on the
ifrst attempt only; that is, items were scored using S1 in
as our common (based) scale for the subsequent analyses,
which ensured that all subsequent performance scores based
on diferent scoring rules were comparable.</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Performance Scores</title>
        <p>
          [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] for Student  is defined as
For a test form consisting of  = 21 items, the true score  

=1
  = ∑  (  = 1|  ,   ) = ∑
        </p>
        <p>1
=1 1 + exp{−(  +   )}
.</p>
        <p>
          (2)
However, again, because diferent students took diferent
test forms, either Form 1 (common item set plus the easy
item set;  1 = 15 ), or Form 2 (common item set plus the
hard item set;  2 = 15), the true score on diferent test forms
are not comparable either. Thus, it is necessary to ”equate”
true scores from one form to the other form to produce
comparable scores. The equating process was realized through
the IRT equating procedure [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] to produce an equated score
for each student through the common scale determined by
item parameters  s obtained in the above item calibration
steps.
        </p>
        <p>In our study, for each of the six scoring rules, two IRT
equatings were conducted: one from Form 1 to Full Form
(the full set; J=21), and the other from Form 2 to Full Form.
As such, the IRT equating acted as an imputing method, and
the equated score was set on Full Form to reflect a student’s
true score as if the student took the full set of 21 items.</p>
        <p>Applying the six diferent scoring rules in Table 1 for item
score  in Equation 1, we produced six equated scores for
each student. Because item parameters  s were the base
scale and used in all IRT equatings, the equated scores were
comparable across students and scoring rules. The equated
scores can, therefore, be used as performance scores for
comparison and the changes in these performance scores
(based on diferent scoring rules) reflect skill progress during
the programming processes on the tasks.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Item Bias Detection</title>
        <p>
          It is sensible to evaluate biases at the item level since
performance score is a function of aggregated item scores (using
the psychometric models as described above). For each item,
we conducted diferential item functioning (DIF) analyses
using the average item scores between the non-men (focal)
group to the men (reference) group [19]. There are various
statistical approaches for DIF analyses [
          <xref ref-type="bibr" rid="ref6">6, 19</xref>
          ]. In this study,
we applied a DIF approach that can evaluate the diference
in the average item scores (standardized mean diference,
SMD), as well as the diference in the average attempt
(diferential number of attempts), conditioning on students who
have similar programming skills.
        </p>
        <p>More specifically, let  be the item score (0 or 1), and  be
the total performance score [20]. To assess whether an item
functioned diferently for students in two diferent groups,
a studied/focal group (Group  ) and a comparison/reference
group (Group  ), comparison is made between the expected
item scores for given total scores,   ( | )
and   ( | )
SMD is a weighted sum of the diferences of conditional
. The
expectations between the focal and reference groups for an
item; that is</p>
        <p>
          SMD = ∑    [  ( | = ) − 
 ( | = )],

where    is the proportion of the focal group members in
the  -score group. In practice, ( | )
is estimated by the
average item scores in the  -score group. In the DIF context,
 is often called the matching variable [
          <xref ref-type="bibr" rid="ref10 ref11">11, 10, 20</xref>
          ].
        </p>
        <p>A statistically significant and negative SMD may indicate
that the studied item favors the reference group, and a
statistically significant and positive one the focal group. The
choice of this DIF method was based on the small sample
sizes in our pilot data and the intuitive interpretation of the
DIF efect size (i.e. SMD). Similarly, diferential number of
attempts is the diference in the weighted average attempt
numbers between the two groups after matching on the
performance score.
are also available in [18].
each panel from left-to-right and top-to-bottom, the plot shows
the frequency Distribution of passed cases based on the first one,
first two, first three, first four and first five attempts, respectively,
on one task that had eight test cases. Note that in the first two
attempts, no students got all 8 cases correct.</p>
        <p>25
)
(%20
d
ena
i
l
p
xE
e
aV
10
Scree Plot
1
2
3
7
8</p>
        <p>9
4 Principal C5omponents 6</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <sec id="sec-3-1">
        <title>3.1. Data Summary</title>
        <p>As discussed earlier, if students kept trying a programming
task, they were most likely to pass all the test cases. Figure
3 shows the typical data pattern on one task. Most students’
numbers of passed cases were either close to 1 or close to
7, and the counts around 1 shifted to 7 or 8 as they made
more attempts on the task. The same pattern was observed
on most of the tasks, and thus they supported the choice of
scoring the tasks dichotomously.</p>
        <p>The preliminary PCA analysis on the common item set
showed that the dichotomously scored responses had one
dominant dimension (i.e., uni-dimensionality was
acceptable. Refer to Figure 4).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Item Dificulty Distribution</title>
        <p>Figure 5 presents item dificulty, calibrated on the first
attempt (S1) from the Rasch model. It was observed that item
dificulty of the 21 programming tasks had a good spread:
Tasks 8, 19, 20 and 21 were challenging to the participants,
Task 10 was very easy, and the rest tasks were somewhere
Item Difficulty Parameters: d</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Performance Score Distributions</title>
        <p>Performance score distributions based on diferent scoring
rules (S1 to S5, and SF as in Table 1) are shown in Figure 6. As
expected, as students practiced more on the programming
tasks, they made progresses and their performance became
better (except for a few students on the left tails of the
distributions who were likely to give up early). Overall,
students’ performance scores based on the final attempts
(SF) are clearly better than the other intermediate scores
(the solid black curve as shown in Figure 6).</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Potential Item Bias</title>
        <p>
          Item bias analysis results show that there was an overall
trend of reduced bias as students attempted more times on
the tasks. Using an efect size of 0.1 as a threshold, the
commonly used cut point to flag a meaningful diference in SMD
(refer to [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and references therein), Figure 7 shows that,
in the first attempts (denoted as the solid orange circles),
Items 12 and 14 were easier, but Items 2, 11, 16 and 21 were
harder for the non-men group. However, the magnitude of
SMD dropped close to zero after the final attempt (denoted
as pink stars). In fact, the mean absolute error (MAE) of
the SMDs based on the final attempt had the smallest value
(0.036) compared to those based on the first 1 to 5 attempts
(0.07 ∼ 0.08).
        </p>
        <p>Note that, because of the relatively small sample sizes,
these SMDs are not statistically significant, except for Item
21 (which had a sample size of 46 for the men group and
19 for the non-men group). Item 21 (a task that involved
programming skills such as loops, nested branching, and
string manipulation) seems to be much harder for the
nonmen group in all the first 5 attempts, particularly when we
0
1
ty .0
i
s
n
e
D .50
0
5
1
.
0
0
0
.
0
2
1
0
2
−
only count the first 4 attempts. However, similar to the
other items, the SMD of Item 21 was reduced to almost zero
after the final attempt.</p>
        <p>Figure 8 shows the diferential number of attempts
between the two groups. The non-men group attempted more
times on Items 1, 3, 4, 16,17, 19, and 21. Particularly on Item
21, the non-men group attempted about 5 more times on
average (with a statistical significance) than the men group
with comparable programming skills. In other words, the
diminished SMD on Item 21 between the two groups might
be resulted from significantly more efort and more attempts
on this item by the non-men group.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>In this pilot study, we attempted to address an emerging
research topic on how to measure learning progress in the
adaptive learning and practice platform. We leveraged log
data collected from the online learning platform to extract
programming practice features and developed a general
approach to integrate statistical and psychometric methods
and best practice with log data features to measure students’
progress in programming practice. The statistical and
psychometric methods helped to create dynamic performance
scores that captured students’ progress during practice and
were comparable within individual students who made
different attempts and across students who practiced diferent
sets of programming tasks. Such comparability of
performance scores not only helps to measure progress in
programming practice, but it also helps to evaluate potential
task biases between diverse student groups. The current
study contributes to methodologies in measuring learning
progress and evaluating programming tasks in the dynamic
and adaptive online practice setting, which is likely to be
applicable to other practice and assessment scenarios.</p>
      <p>Our preliminary results may have implications for
assessing programming tasks in online practice. First, even in a
complex situation where multiple attempts are allowed and
diferent programming tasks are adaptively recommended
to students, we can still use features extracted from log
ifles to capture students’ programming processes and take
advantage of, and develop if necessary, statistical and
psychometric methods to assess students’ progress and produce
comparable performance scores. Most importantly, these
practice features and rigorous methods can help identify
what tasks may have potential bias and favor one student
group over the other. Overall, our preliminary results show
that repeated practice with immediate feedback improved
all students performance and reduced item biases.</p>
      <p>These preliminary findings also suggest assessing
students’ performance on their final attempt, which gives
control back to students and allows them decide how many
times they want to practice a task. This may boost students’
confidence and improve their performance. As long as
students keep trying on a task with feedback, they may
eventually pass all test cases, and their efort may help mitigate
potential task bias. In addition, our preliminary findings
indicate that programming tasks that showed initial bias need
more investigation from content perspectives, especially if
they are used in contexts that do not allow repeated practice
and immediate feedback and that have high stakes.</p>
      <p>In terms of recommending tasks for students with
different skill levels to practice, performance scores based on
the first attempt, or the first few attempts, may have better
diferentiability than those based on the final attempt, thus
they are recommended for targeted classroom instruction,
if needed.</p>
      <p>One major limitation of the current study was the
relatively small sample sizes of the pilot data and limited
numbers of programming tasks, which makes it challenging to
generalize the preliminary findings to broader CS
education scenarios. The team is preparing for a larger scale data
collection in the coming year to reexamine the research
questions and evaluate whether similar findings remain. In
particular, follow-up studies will investigate why some tasks
may be biased from content perspectives and aim to provide
guidelines for developing fair programming tasks. Further
studies will also make use of the fine-grained log data and
experiment with machine learning techniques in better
understanding students’ learning behaviors and programming
styles, as well as their association with characteristics of
programming tasks.</p>
      <p>Finally, we end with implications for CS educators. For
CS educators, these preliminary results clearly support
formative assessment policies that allow for repeated practice
and feedback, to encourage learning and mitigate potential
biases in task design that might advantage some groups in
CS education. They also suggest that restricting attempts
may limit the opportunity of observing students to
demonstrate their skills and learn from feedback. Future work
should further explore these recommendations, particularly
in other learning contexts (e.g., with diferent forms of
feedback, such as auto-graders with hints), other identities (e.g.,
race, ethnicity, ability), and items (e.g., more complex
programming assignments often found in formal education).</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This material is based upon work supported by the National
Science Foundation under Grant No. 2055550 and 2100296.
[13] K. Yamamoto, H. J. Shin, L. Khorramdel, Multistage
adaptive testing design in international large-scale
assessments, Educational Measurement: Issues and
Practice 37 (2018) 16–27. doi:10.1111/emip.12226.
[14] H. Wainer, R. J. Mislevy, Item response theory, item
calibration, and proficiency estimation, in: H. Wainer
(Ed.), Computerized adaptive testing: A primer,
Erlbaum, 1990, pp. 65–102.
[15] D. Mead, An introduction to multistage testing,
Applied Measurement in Education 19 (2006) 185–187.
URL: https://doi.org/10.1207/s15324818ame1903_1.
doi:10.1207/s15324818ame1903_1.
[16] K. Ercikan, H. Guo, H.-H. Por, Uses of process data
in advancing the practice and science of
technologyrich assessments, OECD Publishing, 2023. URL:
https://www.oecd-ilibrary.org/content/component/
7b3123f1-en. doi:https://doi.org/https:
//doi.org/10.1787/7b3123f1-en.
[17] S. Kim, A comparative study of IRT fixed parameter
calibration methods, Journal of Educational
Measurement 43 (2006) 353–381. URL: https://doi.org/10.1111/
j.1745-3984.2006.00021.x. doi:10.1111/j.1745-3984.
2006.00021.x.
[18] H. Guo, M. S. Johnson, D. F. McCafrey, L. Gu, Practical
considerations in item calibration with small samples
under multistage test design: A case study, Technical
Report RR-24-03, ETS, 2024. URL: https://doi.org/10.
1002/ets2.12376.
[19] R. Zwick, A review of ETS diferential item
functioning assessment procedures: Flagging rules,
minimum sample size requirements, and criterion
reifnement, Research Report No. RR-12-08, 2012. URL:
https://doi.org/10.1002/j.2333-8504.2012.tb02290.x.
[20] P. W. Holland, D. T. Thayer, Diferential item
functioning and the Mantel-Haenszel procedure, in: H. Wainer,
H. I. Braun (Eds.), Test validity, Erlbaum, Hillsdale, NJ,
1988, pp. 129–145.
[21] R. P. Chalmers, mirt: A multidimensional item
response theory package for the r environment,
Journal of Statistical Software 48 (2012) 1–29. URL:
https://doi.org/10.18637/jss.v048.i06. doi:10.18637/
jss.v048.i06.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Bransford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Brown</surname>
          </string-name>
          , R. R. Cocking (Eds.),
          <article-title>How people learn: Brain, mind, experience, and school</article-title>
          , National Academy Press, Washington, DC,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Smith</surname>
          </string-name>
          <string-name>
            <surname>IV</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ottaway</surname>
          </string-name>
          , J. Wilson, T. Greer,
          <article-title>Towards understanding the effective design of automated formative feedback for programming assignments</article-title>
          ,
          <source>Computer Science Education</source>
          <volume>32</volume>
          (
          <year>2022</year>
          )
          <fpage>105</fpage>
          -
          <lpage>127</lpage>
          . URL: https://doi.org/10. 1080/08993408.
          <year>2020</year>
          .
          <volume>1860408</volume>
          . doi:
          <volume>10</volume>
          .1080/08993408.
          <year>2020</year>
          .
          <volume>1860408</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Murphy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Roschelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Mason</surname>
          </string-name>
          ,
          <article-title>Investigating eficacy, moderators and mediators for an online mathematics homework intervention</article-title>
          .
          <source>journal of research on educational efectiveness, Journal of Research on Educational Efectiveness</source>
          <volume>13</volume>
          (
          <year>2020</year>
          )
          <fpage>235</fpage>
          -
          <lpage>270</lpage>
          . URL: https://doi.org/10.1080/19345747.
          <year>2019</year>
          .
          <volume>1710885</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hamilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Luxton-Reilly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Augar</surname>
          </string-name>
          , et al.,
          <article-title>Gender equity in computing: International faculty perceptions and current practices</article-title>
          ,
          <source>in: ITiCSE '16: Proceedings of the 2016 ITiCSE</source>
          , Working Group Reports,
          <year>2016</year>
          , pp.
          <fpage>81</fpage>
          -
          <lpage>102</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Oleson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Salac</surname>
          </string-name>
          , et al.,
          <article-title>A decade of demographics in computing education research: A critical review of trends in collection, reporting, and use</article-title>
          ,
          <source>in: ICER '22: Proceedings of the 2022 ACM Conference on International Computing Education Research</source>
          ,
          <year>2022</year>
          . URL: https://faculty.washington.edu/ ajko/papers/OlesonXie2022Demographics.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>R. J. De Ayala</surname>
          </string-name>
          ,
          <article-title>The theory and practice of item response theory</article-title>
          , Guilford Press,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wortzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Investigating item bias in a cs1 exam with diferential item functioning</article-title>
          ,
          <source>in: Proceedings of the 52nd ACM Technical Symposium on Computer Science Education</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1142</fpage>
          -
          <lpage>1148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
          <article-title>Domain experts' interpretations of assessment bias in a scaled, online computer science curriculum</article-title>
          ,
          <source>in: Proceedings of the 52nd ACM Technical Symposium on Computer Science Education (SIGCSE '21)</source>
          , ACM,
          <string-name>
            <surname>Virtual</surname>
            <given-names>Event</given-names>
          </string-name>
          , USA,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Kolen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Brennan</surname>
          </string-name>
          ,
          <article-title>Test equating, scaling, and linking: Methods and practices</article-title>
          , 2nd. ed.,
          <source>SpringerVerlag</source>
          , New York, NY,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ercikan</surname>
          </string-name>
          ,
          <article-title>Comparing test-taking behaviors of English language learners (ELLs) to non-ELL students: Use of response time in measurement comparability research</article-title>
          ,
          <source>ETS Research Report Series</source>
          (
          <year>2021</year>
          ). URL: https://doi.org/10.1002/ets2.
          <fpage>12340</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ercikan</surname>
          </string-name>
          ,
          <article-title>Diferential rapid responding across language and cultural groups</article-title>
          ,
          <source>Educational Research and Evaluation</source>
          <volume>26</volume>
          (
          <year>2021</year>
          ). URL: https://doi.org/ 10.1080/13803611.
          <year>2021</year>
          .
          <volume>1963941</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. O.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. K. D. Pham</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>A. J.</given-names>
          </string-name>
          <string-name>
            <surname>Ko</surname>
          </string-name>
          ,
          <article-title>Developing novice programmers' self-regulation skills with code replays</article-title>
          ,
          <source>in: Proceedings of the 2023 ACM Conference on International Computing Education Research</source>
          , Volume
          <volume>1</volume>
          .,
          <year>2023</year>
          , pp.
          <fpage>298</fpage>
          -
          <lpage>313</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>