=Paper= {{Paper |id=Vol-2734/paper10 |storemode=property |title=A Data-Driven Approach to Automatically Assessing Concept-Level CS Competencies Based on Student Programs |pdfUrl=https://ceur-ws.org/Vol-2734/paper10.pdf |volume=Vol-2734 |authors=Bita Akram,Hamoon Azizolsoltani,Wookhee Min,Eric Wiebe,Anam Navied,Bradford Mott,Kristy Elizabeth Boyer,James Lester |dblpUrl=https://dblp.org/rec/conf/edm/AkramAMWNMBL20 }} ==A Data-Driven Approach to Automatically Assessing Concept-Level CS Competencies Based on Student Programs== https://ceur-ws.org/Vol-2734/paper10.pdf
                    A Data-Driven Approach to Automatically
                   Assessing Concept-Level CS Competencies
                          Based on Student Programs

            Bita Akram1, Hamoon Azizolsoltani1, Wookhee Min1, Eric Wiebe1, Anam Navied1,
                         Bradford Mott1, Kristy Elizabeth Boyer2, James Lester1

          1North Carolina State University, Raleigh, North Carolina                     2University of Florida, Gainesville, Florida

         {bakram, wmin, hazizso, wiebe, anavied, bwmott,                                             keboyer@ufl.edu
 ABSTRACT                                                                rapidly growing [15,37]. However, the complexity of syntax in
 The rapid increase in demand for CS education has given rise to         text-based programming has been found to be a barrier for novice
 increased efforts to develop data-driven tools to support adaptive      learners [13,14]. To address this challenge, block-based
 CS education. Automated assessment and personalized feedback            programming environments have replaced textual syntax with
 are among the most important tools for facilitating effective           visual and elaborative blocks that utilize descriptive text, color, and
 learning experiences for novice students. An important first step in    shape to facilitate programming for novice learners [13,25]. This
 providing effective feedback tailored to individual students is         is particularly beneficial for traditionally underrepresented groups
 assessing their areas of strength and weaknesses with regard to core    in computer science [6]. Despite the syntax barrier elimination,
 CS concepts such as loops and conditionals. In this work, we            effective and tailored scaffolding and feedback is still required to
 propose a hypothesis-driven analytics approach to assessing             support students’ mastery of computer science (CS) concepts that
 students’ competencies of core CS concepts at a fine-grained level.     are essential for programming. Providing students with effective
 We first label programs obtained from middle grades students’           scaffolding and feedback would significantly benefit from reliable
 interactions with a game-based CS learning environment featuring        assessments that can evaluate student competencies with respect to
 block-based programming, based on a rubric that was designed to         core CS concepts [12,20]. Effective assessment can inform
 assess students' competency in core CS concepts from their              adaptive pedagogical strategies such as offering hints and feedback
 submitted programs. Then, we train a variety of regression models       and performing tailored problem selection. Automated assessments
 including linear, ridge, lasso, and support vector regression models,   can bridge the gap between the growth in demand for CS education
 as well as Gaussian process regression models to infer students’        and the limited supply of qualified teachers.
 scores for each of the identified CS concepts. The evaluation results   While research on conducting automated assessment of student-
 suggest that Gaussian process regression models often outperform        generated programs is gaining momentum, limited previous work
 other baseline models for predicting student competencies of core       has yielded methods to infer students’ mastery of fine-grained CS
 CS concepts with respect to mean squared error and adjusted             concepts exercised in a particular computational problem. Previous
 coefficient of determination. Our approach shows significant            work has focused on predicting an overall score to represent
 potential to provide students with detailed, personalized feedback      students’ general level of mastery in programming [17,19].
 based on their inferred CS competency levels.                           However, identifying students’ strength and weakness on specific
                                                                         CS concepts could enable instructors to provide students with
 KEYWORDS                                                                adaptive scaffolding and tailored practices needed to master the
 Automated Program Assessment, Concept-Level CS Assessment,              those concepts. In addition, fine-grained assessment of CS
 Gaussian Process Regression, Evidence-Centered Assessment               competencies can inform intervention strategies for intelligent
 Design                                                                  learning environments to perform student-adaptive hint and
                                                                         feedback generation as well as problem selection. Furthermore,
 1.    INTRODUCTION                                                      using this information in an open learner model could enable
                                                                         students to focus on areas in which they need more practices by
 As programming has become a fundamental skill in the digital
                                                                         monitoring mastery in CS concepts [7, 35].
 economy, the interest in learning how to program at early ages is
                                                                         We follow a hypothesis-driven learning analytic approach [14]
                                                                         based on Evidence-Centered Design (ECD) [22] to identify core CS
                                                                         concepts highlighted in a learning environment and to assess
                                                                         students’ competencies in relation to each of the target concepts.
                                                                         We explore this in the context of a bubble-sort challenge within the
                                                                         ENGAGE game-based learning environment (Figure 1) that requires

implementing a program using a block-based programming                  is then evaluated to determine its degree of correctness, efficiency
interface. Based on the hypothesis-driven learning analytic             and quality. Since all programming languages, including block-

        Figure 1. ENGAGE game-based learning environment: students write a program to filter a data set and loop over it.

approach, we first identify the CS concepts that are targeted by this   based programming languages can be represented using the same
activity. We also collect students’ submitted solutions to this         intermediate representations, static assessment techniques are
challenge in the form of snapshots of their submitted programs.         syntax-free and can be adapted to assess any programming
Content area experts then use this information to devise a rubric       language.
that can identify students’ mastery of targeted CS concepts based
on evidence captured from their program artifacts. Examples of CS       In static assessments, correctness is usually assessed through
concepts and practices for the bubble sort challenge are developing     character analysis, string analysis, syntax analysis, and semantic
appropriate algorithms and programs, and appropriate use of             analysis. Quality is assessed by software metrics such as the
computer science constructs, such as loops and conditionals.            number of lines of code, the number of variable statements, and the
                                                                        number of expressions. For example, work by Wang and colleagues
We use the rubric to label our training dataset. We further extract     presented a semantic similarity-based approach to assess the
structural and semantic information from program snapshots by           correctness of a C program by comparing it against a correct
encoding them as structural n-grams following the approach in [1].      program model [34]. In this work, they first reduced the state space
A variety of regression models including lasso, ridge, support          of programs by conducting a set of program standardizations
vector regression and Gaussian process regression (GPR) models          including expression, control structure, and function invocation
are applied on the generated feature set to infer students’             standardization. Then, they calculated a similarity factor based on
competencies for their overall grade and for each of the identified     size, structure and statement similarity subfactors weighted by
target CS concepts. We hypothesize that GPR models are                  grading criteria.
particularly suitable for this type of inference task as they are
capable of handling the noise resulting from the subjective process     Most work on static assessment utilizes a variety of similarity
of grading programs. The results demonstrate the effectiveness of       measurements to calculate the relative correctness of a program in
these models for predicting students’ CS competencies.                  reference to other scored programs [30, 33]. However, the
                                                                        approaches described above typically yield an overall score rather
2.    RELATED WORK                                                      than a fined-grained analysis of student competencies on specific
                                                                        CS concepts, skills, and knowledge. In an educational context, this
Two approaches to automatic program assessment can be
                                                                        detailed diagnostic information is essential for providing students
distinguished: dynamic assessment and static assessment [5, 16].
                                                                        with the targeted scaffolding and support they need. In this paper,
Dynamic assessment is used to assess the correctness of completed
                                                                        we propose an automated assessment framework that can provide
programs using pre-defined test data [16, 18, 31]. Static
                                                                        students and their instructors with an automated assessment tool
assessments on the other hand, can assess partial programs for
                                                                        that is both detailed and interpretable.
partial correctness. To perform this latter form of assessment an
important step is transforming the program into an intermediate         3.   Learning Environment and Dataset
representation such as abstract syntax trees, control flow graphs,
and program dependence graphs. The intermediate representation          In this study, we collect data from middle-grade students’
                                                                        interactions with the ENGAGE game-based learning environment
      Figure 2. ENGAGE game-based learning environment. (Left) The bubble sort task in the game-based learning environment.
   (Right) Program for the bubble sort task: the read-only code for opening the door and an example of a correct implementation
                                                   of the bubble sort written by a student.
that is a computational thinking (CT) focused education game for           then transform students’ program snapshots to a feature set utilizing
middle school students (ages 11-13) [2]. The CS content of the             a novel n-gram encoding approach following [1]. Finally, we infer
ENGAGE game is based on the AP CS Principles curriculum [9].               students’ scores by applying regression models on the structural n-
Students learned CS competencies ranging from abstraction and              gram-based feature set.
algorithmic thinking to computational problem solving and
programming. The computational challenges within the game were             In comparison to [1], our work focuses on assessing students’
designed to prepare students for computer science work in high             mastery of identified, individual CS concepts underlying the bubble
school, and to promote positive attitudes toward computer science.         sort challenge. By utilizing this assessment framework, we can
This game features an underwater research station that has lost            train separate regression models utilizing the n-gram feature set to
connection with the outside world and students are sent as computer        assess student programs based on each individual CS concept. In
science specialist to investigate the issue (Figure 1, Left) [21]. To      this study, we utilize a variety of regression models including
                                                                           linear, lasso, ridge, support vector regression (SVR), and Gaussian
successfully complete the game, students need to move around the
station and solve different computational thinking problems with           process (GP) regression models to predict algorithmic quality of
block-based programming.                                                   students’ programs.

The focus of this work is on a particular CT challenge where                4.1    Rubric Design
students need to implement a bubble sort algorithm using block-            Students’ programs are labeled utilizing a rubric that is devised
based programming (Figure 1, Right) to escape a room. Students             following and evidence-centered assessment design (ECD)
implement an algorithm to sort six randomly positioned containers          approach [27]. An important first step in ECD is domain modeling
within a containment device. Once the containers are sorted student        where relevant CS concepts are identified through the collaborative
can open the door and escape the room. To write the bubble-sort            work of domain experts and teachers [11, 22]. The CS concepts are
algorithm students have access to a limited number of necessary            then used to develop the specifications of an assessment of the CS
blocks including: a repeat block that repeats every nested block for       concepts. The conceptual assessment framework consists of the
a certain amount of time specified by students; a conditional block        following: 1) the student model, which represents what students
that checks whether the right container is smaller than the left           know or can do; 2) the evidence model, which contains evidence
container; a conditional block that checks whether the cart has hit        that drives the student model; and 3) the task model, which contains
the right wall; a swap containers block that swaps a container with        tasks, interactions with which can generate evidence [22, 29].
its adjacent right container; a move block that moves the cart one
                                                                           Evidence is derived from students’ actions during the learning tasks
position toward the right; and a reset block that brings the cart to
                                                                           to predict their mastery of CS concepts. An important requirement
the left most position.
                                                                           is to match evidence derived from student programs to proficiency
On average, students played the game over the course of two weeks.         in each CS concept covered in the assessment. In this work, the
As students interacted with the game, all of their interactions with       student model represents students’ knowledge of particular CS
the game were logged, such as dragging programming blocks to               concepts, the evidence model is based on evidence rules that extract
write a program and programs being executed. For this study, we            the program structures from their programs representing their
collected data from middle grade students’ interactions with the           knowledge of each identified CS concept, and the task model is the
bubble sort algorithm. Data was collected from five classrooms in          bubble sort challenge in the ENGAGE game-based learning
three schools in the United States. The data used for this study is        environment. Following this approach, we design evidence rules
from 69 consented students.                                                specific to the task at hand to provide assessment arguments for the
                                                                           proficiency of the CS concepts in our student model. Table 1 shows
4.   METHOD                                                                the rubric for assessing the CS concepts identified through the
A supervised learning approach is utilized to infer students’              domain modeling phase [3, 4].
competency for each of the CS concepts identified for the bubble-
sort algorithm task. To infer students’ grades, we first label their       4.2     Data Annotation
program snapshots utilizing the rubric presented in Table 1. We
Our training dataset contains 1,570 programs submitted by 69             each program. Note that unigrams are repeated in both hierarchical
students when solving the bubble sort challenge. The algorithmic         and ordinal n-gram encoding of the ASTs, and thus, only one copy
“Effectiveness” and “Conciseness” scores are two mutually                of unigram features is used in the final feature set. The occurrence
exclusive metrics designed to capture core qualities of programs.
For example, a program that contains all the necessary coding            Table 1. Assessment items, and detailed rubrics for each item.
constructs to receive full points for “appropriate use of conditional
statements” might contain redundant copies of the same coding               CS Concepts and
                                                                               Practices                       Detailed Rubric
constructs that interfere with the correctness of the algorithm. This
deficiency is captured in the “Conciseness” score. A similar
program might have the wrong ordering of the coding constructs                                      •   The program contains all necessary
that negatively affects the correctness of the algorithm. This is                                       code elements.
captured through by the “Effectiveness” score. In this rubric, the        Design        and         •   The code elements have the correct
range of possible scores for the “design and implementation of            implementation of             order,        and        hierarchy
effective and generalizable algorithm” is between 0-10. Similarly,        effective     and             (Effectiveness).
this range is between 0 to 3 for “Appropriate use of loop                 generalizable             •   The program does not contain
statements,” between 0 to 6 for “Appropriate us of conditional            algorithms                    redundant code elements that falsify
statements,” and between 0 to 3 for “Appropriate combination of                                         the logic of the algorithm
loops and conditional statements.” The overall score (overall                                           (Conciseness).
algorithmic quality score) range is between 0 to 22.
Two annotators with CS background annotated 20% of the                                              •   The repeat block is present.
submissions for algorithmic effectiveness and conciseness scores.         Appropriate use of        •   The iteration value is set to a
Using Cohen’s kappa [8], an inter-rater agreement of 0.848 for            loop statements               positive number.
“effectiveness” and 0.865 for “conciseness” was achieved. The two                                   •   It encompasses at least one block.
annotators discussed their disagreements and one annotator tagged
the remainder 80% of the dataset. These annotations are served as                                   •   Both necessary conditional
the ground-truth for our data corpus. It is important to note that the                              •   statements are used.
annotation process introduces noise into the training dataset [23].                                 •   A conditional statement checks the
This is because different scorers may have different perceptions of       Appropriate use of            size of two adjacent containers and
a program’s algorithmic “Effectiveness” and “Conciseness.” As a           conditional                   swaps them if they are not ordered
result, the dataset is inherently noisy, which must be taken into         statements                    properly.
account when designing the models for the automated assessment                                      •   A conditional statement checks if
framework. To handle this uncertainty, we adopt a Gaussian                                              the arm has reached the right wall
process regression model that returns a distribution for the                                            and reset it to the left wall.
inference values including an average with standard deviation.

4.3    Feature Engineering                                                Appropriate               •   There is at least one instance of each
                                                                          combination        of         conditional nested under a repeat
We use abstract syntax trees (ASTs) as the intermediate
                                                                          loops             and         statement.
representation for our automated assessment task. After
                                                                          conditional               •   There is at least one instance of two
transforming students’ program snapshots to their corresponding
                                                                          statements                    conditionals at the same level.
ASTs [28], we encode them as structural n-grams to extract features
that are representative of the semantic information in students’
programs following the previous work [1]. Hierarchical and
ordinal n-grams are two important structures in an AST. The parent       of similar n-grams for n values more than one (unigrams) in both
child relationship between different blocks are encoded in               hierarchical and vertical encodings demonstrate presence of
hierarchical structures and the placement order of blocks are            different structures in in the AST and thus, both will be preserved.
encoded in ordinal structures. To enable the proposed automated          Preliminary explorations revealed that including sequences of
assessment to assign partial scores to incomplete solutions, we need     lengths larger than 4 for hierarchical n-grams and 3 for ordinal n-
to extract n-grams with varying lengths of n to capture the most         grams exponentially increases the sparsity of the dataset. To
fine-grained structural information present in an AST.                   address the sparsity issue, we capped the n-gram size at 4 for the
An AST generated from a sample program is demonstrated in                hierarchical n-gram encoding and at 3 for the ordinal n-gram
Figure 3. A partial hierarchical (left) and ordinal (right) n-gram       encoding. The final feature set consists of sequences of length one
encoding is also demonstrated in this Figure. In Figure 3, each          (i.e., unigrams) to sequences of length four for hierarchical (i.e., 4-
colored circle shows the n-gram encoding of a specific n. In this        grams) and three for ordinal (i.e., 3-grams) that are repeated at least
example, encoding of n-grams of size one is represented with green       three times (again to address the data sparsity issue) throughout the
ovals, n-grams of size two with blue ovals and n-grams of size three     dataset, resulting in 184 distinct features.
with purple ovals. The frequency values for each n-gram encoded
feature are shown beside the AST. All of the other n-gram feature
                                                                         4.4    Inferring Program Scores
values are zero since they are not in this AST. We then merge the        We infer students’ overall program scores in addition to their scores
two feature sets together to build the final feature set containing      for each of the essential CS concept by training a variety of
both hierarchical and ordinal n-gram encodings corresponding to          regression models on the structural n-gram-encoded feature set. As
  Figure 3: AST generated from a sample program submitted for the bubble sort challenge and its hierarchical and ordinal n-gram
    encoding. (Left) An AST and its partial hierarchical unigrams, bigrams, and 3-grams marked by green, blue and purple ovals
    respectively on the left and the partial feature set generated from hierarchical n-gram encoding of the AST along with feature-
  level frequencies on the right. (Right) An AST and its ordinal unigrams, bigrams, and 3-grams marked by green, blue and purple
    ovals respectively on the left and the partial feature set generated from partial ordinal n-gram encoding of the AST along with
                                                   feature-level frequencies on the right.
our baseline model, we use linear regression. We use four                “Appropriate use of loop statements” score, and l=1 the best value
additional regression models including lasso regression [32], ridge      for “Appropriate use of conditional statements” and “Appropriate
regression, support vector regression (SVR) [10], and Gaussian           combination of loops and conditional statements” scores.
process (GP) regression [26]. Lasso and Ridge regression are
utilized since they can reduce overfitting and variance issues in        4.4.2 Lasso Regression
comparison with linear regression. SVR and GP regression on the          We used the set [0.05, 0.1, 0.5, 1.0, 10] as in ridge regression to
other hand are used since kernel methods can do well with datasets       tune the value for l and found l=0.05 to be the best value for all
with proportionally large number of features. More importantly, GP       the inferred scores.
regression can handle the noise resulting from the subjective nature
of human grading [6, 36]. To infer students’ overall program scores      4.4.3 Support Vector Regression
as well as their scores for “Design and implementation of effective      For our regression task, we explored with linear, polynomial, and
and generalizable algorithms,”, “Appropriate use of loop                 radial basis function (RBF) kernels. For each kernel, we tuned the
statements,” “Appropriate use of conditional statements,” and            hyperparameters of penalty parameter (C), epsilon, and kernel
“Appropriate combination of loops and conditional statements,” we        coefficient (gamma). For polynomial kernels, we also tuned the
train each regression model utilizing the n-gram encoded feature         parameter of the kernel projection (coef0) and degree
set mentioned above, while predicting the scores of each core            hyperparameters. Utilizing cross-validation, we found the
concept.                                                                 polynomial kernel with a degree of four to be the best kernel for
To infer students’ grades using the n-gram encoded feature set, we       our dataset when inferring the “Overall Score”, “Appropriate use
use the Python scikit-learn library [24] to perform linear, lasso        of of loop statements,” and “Appropriate combination of loops and
regression, ridge regression, SVR, and GP regression. We first split     conditional statements” scores. Also, the grid search returned C=1,
our dataset to 80% training and 20% held-out test sets. We the use       coef0=10, epsilon=0.2, gamma= 0.0001 as the best parameters for
a 5-fold cross-validation approach to tune the hyperparameters of        this kernel. For inferring the “Appropriate use of conditional
lasso, ridge, and SVR regression based on the training set. We also      statements” score we found the radial basis function kernel with
use the 5-fold cross-validation approach to identify the appropriate     C=100, epsilon=0.1, and gamma= 0.001 to be the best parameter
kernel for the GP regression model. Gaussian process regression          values. Finally, we found the radial basis function kernel with
model uses an internal limited-memory BFGS approach to tune its          C=100, epsilon=0.2, and gamma= 0.001 to be the best parameter
other hyperparameters such as length scale and noise level. After        values when inferring the “Design and implementation of effective
tuning the hyperparameters of each regression model, we train the        and generalizable algorithm” score.
models on the training set and evaluate it on the held-out test set.
This process is repeated to infer each CS concept score separately.      4.4.4 Gaussian Process Regression
The results of applying each of the regression models to infer each      We expect the GP regression to outperform other regression
of the CS concept scores is presented in Table 2.                        techniques due to its capability of handling noise and its propriety
                                                                         for our dataset. After conducting a hyper-parameter tuning for the
4.4.1 Ridge Regression                                                   kernels such as radial basis functions (RBF), rational quadratic, and
We used the set [0.05, 0.1, 0.5, 1.0, 10] to tune the value for l, the   Matern kernels, we found RBF to perform the best on our dataset
penalty coefficient, and found l=10 to be the best value for             for all the inferred scores. Utilizing a limited-memory BFGS
inferring the “Overall grade” and the “Design and implementation         optimization technique the GP regression model tuned other
of effective and generalizable algorithm” scores based on cross-         hyperparameters including the length vector and noise level during
validation. Furthermore, we found l=0.5 the best value for               the training process.
            Table 2. Average predictive performance of regression models trained with the structural n-gram feature set.

                                                  Design and                                                         Appropriate
                                                                                                  Appropriate use of
                                                  implementation of       Appropriate use of                         combination of loops
 Grade                  Overall Grade                                                             conditional
                                                  effective and           loop statements                            and conditional
                                                  generalizable algorithm                                            statements

 Regression             MSE           R2          MSE           R2        MSE         R2          MSE       R2         MSE         R2

 Linear                 7.88E+10      -2.58E+9 1.8E+10          -3.9E+9   2.8E+9      -2.1E+9     2.43E+9 -4.9E+8 2.0E+9           -1.6E+9

 Ridge                  4.84          0.84        1.62          0.64      0.36        0.74        0.59      0.88       0.09        0.93

 Lasso                  5.67          0.81        2.38          0.49      0.88        0.35        0.73      0.85       0.48        0.62

 SVR                    3.74          0.88        1.56          0.66      0.57        0.57        0.49      0.9        0.08        0.93

 Gaussian Process       1.67          0.94        1.02          0.78      0.18        0.86        0.28      0.94       0.02        0.98

5. DISCUSSION                                                             immediate actionable formative information while also training
Effective automated assessment of students’ programming efforts           automated assessment tools that can provide ongoing, future
has become increasingly important. This work investigates an n-           adaptive support.
gram encoding approach to encode students’ programs into their            Though we show the application of this automated assessment
essential structural and semantic features. Utilizing the n-gram          framework on one particular task, it can be generalized to assess
encoding approach, we can extract structural information with             any well-structured programs as the feature representations are
varying levels of granularity. Utilizing this feature set labeled by      readily scalable to other programming tasks. Furthermore, our
the ECD-based designed rubric enables our models to learn                 rubric design approach can be used as a guideline for rubric design
evidence from programs that are representative of students’               and assessment for non-expert CS teachers. A teacher dashboard
mastery of identified CS.                                                 incorporating the automated assessment framework can further be
After extracting an n-gram encoded feature set from students’             utilized to analyze and aggregate the results and inform teachers
programs, we apply a variety of regression models to infer their          about students’ learning and the quality of their instruction.
scores for each of the targeted CS concepts. We conduct an 80-20
                                                                          6. CONCLUSION AND FUTURE WORK
split on our dataset to generate training and held-out test sets. We
train our models on the training set and evaluate the trained models      Effective scaffolding of programming efforts for novice
on the held-out test. This process is repeated for each CS concept.       programmers require accurate automated assessment of their
The results of our prediction demonstrate the effectiveness of the        competency in each core CS concept. In this paper, we presented
n-gram encoded feature set in capturing important semantic and            an automated assessment framework for assessing programs’
structural information in students’ programs, as all regression           algorithmic quality following a hypothesis-driven learning analytic
models outperformed the linear regression model. As expected, GP          approach. We investigate a hierarchical, ordinal feature
regression also outperformed other baseline models in terms of            representation method based on n-gram-encoded hierarchical and
both mean squared error and R-squared across all prediction tasks.        ordinal coding constructs that extract two-dimensional structural
This is expected, since GP regression is well-equipped to handle          information from students’ programs, and investigated Gaussian
noise in the data set and is particularly appropriate for datasets with   process regression to induce models that can accurately predict
a large number of features relative to the number of data points.         students’ grades for individual CS concepts based on their
                                                                          submitted programs. Evaluation results suggest that Gaussian
We utilized an evidence-centered assessment design (ECD)                  process regression models utilizing n-gram-encoded features that
approach to label the training dataset. ECD holds significant             extract salient semantic and structural information from programs
promise for guiding educators in designing mindful assignments for        achieved the highest predictive performance with respect to mean
learners by focusing on key conceptual ideas rather than surface-         squared error and R squared. These results suggest that Gaussian
level features of the program. This means that an ECD-derived             process regression models are robust in dealing with noise that
rubric can provide granular information structured around core CS         underlies our human-annotated dataset.
concepts, which guides development of robust automated
assessment models, but also provide immediate formative data to           In the future it will be important to investigate the potential for
instructors. Thus, as new problems and activities are introduced          utilizing a data-driven approach for devising a rubric based on
into a course, the first-pass human scoring with the rubric provides      identified correct solutions. Furthermore, the effectiveness of the n-
                                                                          gram encoded feature set can be further evaluated by performing
an automatic feature-selection process and compare the results with               Workshop in Primary and Secondary Computing
expert selected features. Finally, it will be instructive to explore the          education, ACM, 2–11.
potential of the n-gram encoded feature for creating an                    [12]   Grover, S., and Basu, S. 2017. Measuring Student
unsupervised learning approach to accurately inferring students’                  Learning in Introductory Block-Based Programming. In
program scores without requiring labeled training data.                           Proceedings of the 48th ACM SIGCSE Technical
                                                                                  Symposium on Computer Science Education, 267–272.
This research was supported by the National Science Foundation             [13]   Grover, S., Basu, S., Bienkowski, M., Eagle, M., Diana,
under Grants DRL-1640141. Any opinions, findings, and                             N., and Stamper, J. 2017. A Framework for Using
conclusions expressed in this material are those of the authors and               Hypothesis-Driven Approaches to Support Data-Driven
do not necessarily reflect the views of the National Science                      Learning Analytics in Measuring Computational Thinking
Foundation.                                                                       in Block-Based Programming Environments. ACM
                                                                                  Transactions on Computing Education 17, 3, 1–25.
8. REFERENCES                                                              [14]   Hansen, A., Dwyer, H., Iveland, A., Talesfore, M.,
[1]      Akram, B., Azizolsoltani, H., Min, W., Wiebe, E., Navied,                Wright, L., Harlow, D., and Franklin, D. 2017. Assessing
         A., Mott, B., Boyer, K., and Lester, J. 2020. Automated                  Children’s Understanding of the Work of Computer
         Assessment of Computer Science Competencies from                         Scientists: The Draw-a-Computer-Scientist Test. In
         Student Programs with Gaussian Process Regression. To                    Proceedings of the 48th ACM SIGCSE Technical
         appear In Proceedings of the 13th Conference on                          Symposium on Computer Science Education, 279–284.
         Educational Data Mining.
                                                                           [15]   Phantola, P., Ahoniemi, T., Karavirta, V., and Seppälä, O.
[2]      Akram, B., Min, W., Wiebe, E., Mott, B., Boyer, K., and                  2010. Review of Recent Systems for Automatic
         Lester, J. 2018. Improving Stealth Assessment in Game-                   Assessment of Programming Assignments. In
         based Learning with LSTM-based Analytics. In                             Proceedings of the 10th Koli calling International
         Proceedings of the 11th International Conference on                      Conference on Computing Education Research, 86–93.
         Educational Data Mining, 208–218.
                                                                           [16]   Deirdre Kerr and Gregory K W K Chung. 2012.
[3]      Akram, B., Min, W., Wiebe, E., Navied, A., Mott, B.,                     Identifying Key Features of Student Performance in
         Boyer, K., and Lester, J. 2020. A conceptual assessment                  Educational Video Games and Simulations through
         framework for k-12 computer science rubric design. In                    Cluster Analysis. Journal of Educational Data Mining
         Proceedings of the 51th ACM Technical Symposium on                       4, 1, 144-182.
         Computer Science Education, 1328–1328.
                                                                           [17]   Lajis, A., Baharudin, S., Kadir, D., Ralim, N., Nasir, H.,
[4]      Ala-Mutka, K. 2005. A Survey of Automated Assessment                     and Aziz, N. 2018. A Review of Techniques in Automatic
         Approaches for Programming Assignments. Computer                         Programming Assessment for Practical Skill Test. Journal
         Science Education 15, 2, 83–102.                                         of Telecommunication, Electronic and Computer
[5]      Amershi, S. and Conati, C. 2009. Combining                               Engineering 10, 2, 109–113.
         Unsupervised and Supervised Classification to Build User          [18]   Mao, Y., Lin, C., Chi, M., 2018. Deep Learning vs.
         Models for Exploratory Learning Environments. Journal                    Bayesian Knowledge Tracing: Student Models for
         of Educational Data Mining, Article 1, 1, 18–71.                         Interventions. Journal of Educational Data Mining
[6]      Bouchet, F., Harley, J., Trevors, G., and Azevedo, R.                    10, 02, 28–54.
         2013. Clustering and profiling students according to their        [19]   Meerbaum-Salant, O., Armoni, M., and Ben-Ario, M.
         interactions with an intelligent tutoring system fostering               2013. Learning computer science concepts with scratch.
         self-reggulated learning. Journal of Educational Data                    Computer Science Education 23, 3, 239–364.
         Mining, 05, 01, 104–146.
                                                                           [21]   Min, W., Frankosky, M., Mott, B., Rowe, P., Wiebe, E.,
[7]      Cohen, J. 1960. A Coefficient of Agreement for Nominal                   Boyer, E., and Lester, J. 2015. DeepStealth: Leveraging
         Scales. Educational and Psychological Measurement 20,                    Deep Learning Models for Stealth Assessment in Game-
         1, 37–46.                                                                based Learning Environments. In International
[8]      College Board. 2017. AP Computer Science Principles                      Conference on Artificial Intelligence in Education, 277–
         Including the Curriculum Framework. In AP Computer                       286.
         Science Principles: The Course. New York.                         [22]   Mislevy, R., Haertel, G., Riconscente, M., Rutstein, D.,
[9]      Cortes, C., and Vapnik, V. 1995. Support-vector                          and Ziker, C. 2017. Evidence-Centered Assessment
         networks. Machine Learning 20, 3, 273–297.                               Design. In Assessing Model-Based Reasoning Using
[10]     Cui, Y., Chu, M., and Chen, F. 2019. Analyzing Student                   Evidence- Centered Design. SpringerBriefs in Statistics,
         Process Data in Game-Based Assessments with Bayesian
         Knowledge Tracing and Dynamic Bayesian Networks.                  [23]   Multon, K. 2010. Interrater reliability. Encyclopedia of
         Journal of Educational Data Mining 11, 01, 80–100.                       Research Design. SAGE, New York, 626–628.
[11]      Fields, D., Giang, M., and Kafai, Y. 2014. Programming           [24]   Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
         in the wild: trends in youth computational participation in              Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
         the online scratch community. In Proceedings of the 9th                  Weiss, R., Dubourg, V. and Vanderplas, J., 2011. Scikit-
       learn: Machine learning in Python. Journal of Machine
       Learning Research, 12, 2825–2830.
[25]   Price, T., Dong, Y., and Lipovac, D. 2017. iSnap: Towards
       Intelligent   Tutoring      in   Novice      Programming
       Environments. In Proceedings of the 48th ACM SIGCSE
       Technical Symposium on Computer Science Education,
       ACM, 483–488.
[26]   Rasmussen, C.. 2004. Gaussian Processes in machine
       learning. Lecture Notes in Computer Science (including
       subseries Lecture Notes in Artificial Intelligence and
       Lecture Notes in Bioinformatics), 63–71.
[27]   Rupp, A, Pearson, K., Sweet, S., Crawford, A., Levy, I.,
       Fay, D., Kunze, K., Cisco, M., Mislevy, R., and Pearson,
       J. 2012. Putting ECD into Practice: The Interplay of
       Theory and Data in Evidence Models within a Digital
       Learning Environment. Journal of Educational Data
       Mining 4, 1 , 49–110.
[28]   Shamsi, F., and Elnagar, A. 2012. An Intelligent
       Assessment Tool for Students’ Java Submissions in
       Introductory Programming Courses. Journal of Intelligent
       Learning Systems and Applications 04, 01 , 59–69.
[29]   Snow, E., Haertel, G., Fulkerson, D. and Feng, M. 2010.
       Leveraging evidence-centered assessment design in large-
       scale and formative assessment practices. In Proceedings
       of the 2010 Annual Meeting of the National Council on
       Measurement in Education (NCME).
[30]   Striewe, M., and Goedicke, M. 2014. A review of static
       analysis approaches for programming exercises. In
       Computer Assisted Assessment. Research into E-
       Assessment. Springer, 100–113.
[31]   Taherkhani, A., and Malmi, L. 2013. Beacon- and
       Schema-Based Method for Recognizing Algorithms from
       Students’ Source Code. Journal of Educational Data
       Mining 5, 2, 69–101.
[32]   Tibshirani, R. 1996. Regression Shrinkage and Selection
       Via the Lasso. Journal of the Royal Statistical Society:
       Series B (Methodological) 58, 1 267–288.
[33]   Truong, N.,Roe, P., and Bancroft, P. 2004. Static Analysis
       of Students’ Java Programs. In Proceedings of the 6th
       Australasian Conference on Computing Education, 317-
[34]   Wang, T., Su, X., Wang, Y., and Ma, P. 2007. Semantic
       similarity-based grading of student programs. Information
       and Software Technology 49, 2, 99–107.
[35]   Winne, P., and Baker, R. 2013. The Potentials of
       Educational Data Mining for Researching Metacognition,
       Motivation and Self-Regulated Learning. Journal of
       Educational Data Mining 5, 1, 1–8.
[36]   Zen, K., Iskandar, D,. and Linang O. 2011. Using Latent
       Semantic Analysis for automated grading programming
       assignments. In International Conference on Semantic
       Technology and Information Retrieval, 82–88.
[37]   2016. K-12 Computer Science Framework. Retrieved
       August 25, 2018 from http://www.k12cs.org.