A Data-Driven Approach to Automatically Assessing Concept-Level CS Competencies Based on Student Programs Bita Akram1, Hamoon Azizolsoltani1, Wookhee Min1, Eric Wiebe1, Anam Navied1, Bradford Mott1, Kristy Elizabeth Boyer2, James Lester1 1North Carolina State University, Raleigh, North Carolina 2University of Florida, Gainesville, Florida {bakram, wmin, hazizso, wiebe, anavied, bwmott, keboyer@ufl.edu lester}@ncsu.edu ABSTRACT rapidly growing [15,37]. However, the complexity of syntax in The rapid increase in demand for CS education has given rise to text-based programming has been found to be a barrier for novice increased efforts to develop data-driven tools to support adaptive learners [13,14]. To address this challenge, block-based CS education. Automated assessment and personalized feedback programming environments have replaced textual syntax with are among the most important tools for facilitating effective visual and elaborative blocks that utilize descriptive text, color, and learning experiences for novice students. An important first step in shape to facilitate programming for novice learners [13,25]. This providing effective feedback tailored to individual students is is particularly beneficial for traditionally underrepresented groups assessing their areas of strength and weaknesses with regard to core in computer science [6]. Despite the syntax barrier elimination, CS concepts such as loops and conditionals. In this work, we effective and tailored scaffolding and feedback is still required to propose a hypothesis-driven analytics approach to assessing support students’ mastery of computer science (CS) concepts that students’ competencies of core CS concepts at a fine-grained level. are essential for programming. Providing students with effective We first label programs obtained from middle grades students’ scaffolding and feedback would significantly benefit from reliable interactions with a game-based CS learning environment featuring assessments that can evaluate student competencies with respect to block-based programming, based on a rubric that was designed to core CS concepts [12,20]. Effective assessment can inform assess students' competency in core CS concepts from their adaptive pedagogical strategies such as offering hints and feedback submitted programs. Then, we train a variety of regression models and performing tailored problem selection. Automated assessments including linear, ridge, lasso, and support vector regression models, can bridge the gap between the growth in demand for CS education as well as Gaussian process regression models to infer students’ and the limited supply of qualified teachers. scores for each of the identified CS concepts. The evaluation results While research on conducting automated assessment of student- suggest that Gaussian process regression models often outperform generated programs is gaining momentum, limited previous work other baseline models for predicting student competencies of core has yielded methods to infer students’ mastery of fine-grained CS CS concepts with respect to mean squared error and adjusted concepts exercised in a particular computational problem. Previous coefficient of determination. Our approach shows significant work has focused on predicting an overall score to represent potential to provide students with detailed, personalized feedback students’ general level of mastery in programming [17,19]. based on their inferred CS competency levels. However, identifying students’ strength and weakness on specific CS concepts could enable instructors to provide students with KEYWORDS adaptive scaffolding and tailored practices needed to master the Automated Program Assessment, Concept-Level CS Assessment, those concepts. In addition, fine-grained assessment of CS Gaussian Process Regression, Evidence-Centered Assessment competencies can inform intervention strategies for intelligent Design learning environments to perform student-adaptive hint and feedback generation as well as problem selection. Furthermore, 1. INTRODUCTION using this information in an open learner model could enable students to focus on areas in which they need more practices by As programming has become a fundamental skill in the digital monitoring mastery in CS concepts [7, 35]. economy, the interest in learning how to program at early ages is We follow a hypothesis-driven learning analytic approach [14] based on Evidence-Centered Design (ECD) [22] to identify core CS concepts highlighted in a learning environment and to assess students’ competencies in relation to each of the target concepts. We explore this in the context of a bubble-sort challenge within the ENGAGE game-based learning environment (Figure 1) that requires Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). implementing a program using a block-based programming is then evaluated to determine its degree of correctness, efficiency interface. Based on the hypothesis-driven learning analytic and quality. Since all programming languages, including block- Figure 1. ENGAGE game-based learning environment: students write a program to filter a data set and loop over it. approach, we first identify the CS concepts that are targeted by this based programming languages can be represented using the same activity. We also collect students’ submitted solutions to this intermediate representations, static assessment techniques are challenge in the form of snapshots of their submitted programs. syntax-free and can be adapted to assess any programming Content area experts then use this information to devise a rubric language. that can identify students’ mastery of targeted CS concepts based on evidence captured from their program artifacts. Examples of CS In static assessments, correctness is usually assessed through concepts and practices for the bubble sort challenge are developing character analysis, string analysis, syntax analysis, and semantic appropriate algorithms and programs, and appropriate use of analysis. Quality is assessed by software metrics such as the computer science constructs, such as loops and conditionals. number of lines of code, the number of variable statements, and the number of expressions. For example, work by Wang and colleagues We use the rubric to label our training dataset. We further extract presented a semantic similarity-based approach to assess the structural and semantic information from program snapshots by correctness of a C program by comparing it against a correct encoding them as structural n-grams following the approach in [1]. program model [34]. In this work, they first reduced the state space A variety of regression models including lasso, ridge, support of programs by conducting a set of program standardizations vector regression and Gaussian process regression (GPR) models including expression, control structure, and function invocation are applied on the generated feature set to infer students’ standardization. Then, they calculated a similarity factor based on competencies for their overall grade and for each of the identified size, structure and statement similarity subfactors weighted by target CS concepts. We hypothesize that GPR models are grading criteria. particularly suitable for this type of inference task as they are capable of handling the noise resulting from the subjective process Most work on static assessment utilizes a variety of similarity of grading programs. The results demonstrate the effectiveness of measurements to calculate the relative correctness of a program in these models for predicting students’ CS competencies. reference to other scored programs [30, 33]. However, the approaches described above typically yield an overall score rather 2. RELATED WORK than a fined-grained analysis of student competencies on specific CS concepts, skills, and knowledge. In an educational context, this Two approaches to automatic program assessment can be detailed diagnostic information is essential for providing students distinguished: dynamic assessment and static assessment [5, 16]. with the targeted scaffolding and support they need. In this paper, Dynamic assessment is used to assess the correctness of completed we propose an automated assessment framework that can provide programs using pre-defined test data [16, 18, 31]. Static students and their instructors with an automated assessment tool assessments on the other hand, can assess partial programs for that is both detailed and interpretable. partial correctness. To perform this latter form of assessment an important step is transforming the program into an intermediate 3. Learning Environment and Dataset representation such as abstract syntax trees, control flow graphs, and program dependence graphs. The intermediate representation In this study, we collect data from middle-grade students’ interactions with the ENGAGE game-based learning environment Figure 2. ENGAGE game-based learning environment. (Left) The bubble sort task in the game-based learning environment. (Right) Program for the bubble sort task: the read-only code for opening the door and an example of a correct implementation of the bubble sort written by a student. that is a computational thinking (CT) focused education game for then transform students’ program snapshots to a feature set utilizing middle school students (ages 11-13) [2]. The CS content of the a novel n-gram encoding approach following [1]. Finally, we infer ENGAGE game is based on the AP CS Principles curriculum [9]. students’ scores by applying regression models on the structural n- Students learned CS competencies ranging from abstraction and gram-based feature set. algorithmic thinking to computational problem solving and programming. The computational challenges within the game were In comparison to [1], our work focuses on assessing students’ designed to prepare students for computer science work in high mastery of identified, individual CS concepts underlying the bubble school, and to promote positive attitudes toward computer science. sort challenge. By utilizing this assessment framework, we can This game features an underwater research station that has lost train separate regression models utilizing the n-gram feature set to connection with the outside world and students are sent as computer assess student programs based on each individual CS concept. In science specialist to investigate the issue (Figure 1, Left) [21]. To this study, we utilize a variety of regression models including linear, lasso, ridge, support vector regression (SVR), and Gaussian successfully complete the game, students need to move around the station and solve different computational thinking problems with process (GP) regression models to predict algorithmic quality of block-based programming. students’ programs. The focus of this work is on a particular CT challenge where 4.1 Rubric Design students need to implement a bubble sort algorithm using block- Students’ programs are labeled utilizing a rubric that is devised based programming (Figure 1, Right) to escape a room. Students following and evidence-centered assessment design (ECD) implement an algorithm to sort six randomly positioned containers approach [27]. An important first step in ECD is domain modeling within a containment device. Once the containers are sorted student where relevant CS concepts are identified through the collaborative can open the door and escape the room. To write the bubble-sort work of domain experts and teachers [11, 22]. The CS concepts are algorithm students have access to a limited number of necessary then used to develop the specifications of an assessment of the CS blocks including: a repeat block that repeats every nested block for concepts. The conceptual assessment framework consists of the a certain amount of time specified by students; a conditional block following: 1) the student model, which represents what students that checks whether the right container is smaller than the left know or can do; 2) the evidence model, which contains evidence container; a conditional block that checks whether the cart has hit that drives the student model; and 3) the task model, which contains the right wall; a swap containers block that swaps a container with tasks, interactions with which can generate evidence [22, 29]. its adjacent right container; a move block that moves the cart one Evidence is derived from students’ actions during the learning tasks position toward the right; and a reset block that brings the cart to to predict their mastery of CS concepts. An important requirement the left most position. is to match evidence derived from student programs to proficiency On average, students played the game over the course of two weeks. in each CS concept covered in the assessment. In this work, the As students interacted with the game, all of their interactions with student model represents students’ knowledge of particular CS the game were logged, such as dragging programming blocks to concepts, the evidence model is based on evidence rules that extract write a program and programs being executed. For this study, we the program structures from their programs representing their collected data from middle grade students’ interactions with the knowledge of each identified CS concept, and the task model is the bubble sort algorithm. Data was collected from five classrooms in bubble sort challenge in the ENGAGE game-based learning three schools in the United States. The data used for this study is environment. Following this approach, we design evidence rules from 69 consented students. specific to the task at hand to provide assessment arguments for the proficiency of the CS concepts in our student model. Table 1 shows 4. METHOD the rubric for assessing the CS concepts identified through the A supervised learning approach is utilized to infer students’ domain modeling phase [3, 4]. competency for each of the CS concepts identified for the bubble- sort algorithm task. To infer students’ grades, we first label their 4.2 Data Annotation program snapshots utilizing the rubric presented in Table 1. We Our training dataset contains 1,570 programs submitted by 69 each program. Note that unigrams are repeated in both hierarchical students when solving the bubble sort challenge. The algorithmic and ordinal n-gram encoding of the ASTs, and thus, only one copy “Effectiveness” and “Conciseness” scores are two mutually of unigram features is used in the final feature set. The occurrence exclusive metrics designed to capture core qualities of programs. For example, a program that contains all the necessary coding Table 1. Assessment items, and detailed rubrics for each item. constructs to receive full points for “appropriate use of conditional statements” might contain redundant copies of the same coding CS Concepts and Practices Detailed Rubric constructs that interfere with the correctness of the algorithm. This deficiency is captured in the “Conciseness” score. A similar program might have the wrong ordering of the coding constructs • The program contains all necessary that negatively affects the correctness of the algorithm. This is code elements. captured through by the “Effectiveness” score. In this rubric, the Design and • The code elements have the correct range of possible scores for the “design and implementation of implementation of order, and hierarchy effective and generalizable algorithm” is between 0-10. Similarly, effective and (Effectiveness). this range is between 0 to 3 for “Appropriate use of loop generalizable • The program does not contain statements,” between 0 to 6 for “Appropriate us of conditional algorithms redundant code elements that falsify statements,” and between 0 to 3 for “Appropriate combination of the logic of the algorithm loops and conditional statements.” The overall score (overall (Conciseness). algorithmic quality score) range is between 0 to 22. Two annotators with CS background annotated 20% of the • The repeat block is present. submissions for algorithmic effectiveness and conciseness scores. Appropriate use of • The iteration value is set to a Using Cohen’s kappa [8], an inter-rater agreement of 0.848 for loop statements positive number. “effectiveness” and 0.865 for “conciseness” was achieved. The two • It encompasses at least one block. annotators discussed their disagreements and one annotator tagged the remainder 80% of the dataset. These annotations are served as • Both necessary conditional the ground-truth for our data corpus. It is important to note that the • statements are used. annotation process introduces noise into the training dataset [23]. • A conditional statement checks the This is because different scorers may have different perceptions of Appropriate use of size of two adjacent containers and a program’s algorithmic “Effectiveness” and “Conciseness.” As a conditional swaps them if they are not ordered result, the dataset is inherently noisy, which must be taken into statements properly. account when designing the models for the automated assessment • A conditional statement checks if framework. To handle this uncertainty, we adopt a Gaussian the arm has reached the right wall process regression model that returns a distribution for the and reset it to the left wall. inference values including an average with standard deviation. 4.3 Feature Engineering Appropriate • There is at least one instance of each combination of conditional nested under a repeat We use abstract syntax trees (ASTs) as the intermediate loops and statement. representation for our automated assessment task. After conditional • There is at least one instance of two transforming students’ program snapshots to their corresponding statements conditionals at the same level. ASTs [28], we encode them as structural n-grams to extract features that are representative of the semantic information in students’ programs following the previous work [1]. Hierarchical and ordinal n-grams are two important structures in an AST. The parent of similar n-grams for n values more than one (unigrams) in both child relationship between different blocks are encoded in hierarchical and vertical encodings demonstrate presence of hierarchical structures and the placement order of blocks are different structures in in the AST and thus, both will be preserved. encoded in ordinal structures. To enable the proposed automated Preliminary explorations revealed that including sequences of assessment to assign partial scores to incomplete solutions, we need lengths larger than 4 for hierarchical n-grams and 3 for ordinal n- to extract n-grams with varying lengths of n to capture the most grams exponentially increases the sparsity of the dataset. To fine-grained structural information present in an AST. address the sparsity issue, we capped the n-gram size at 4 for the An AST generated from a sample program is demonstrated in hierarchical n-gram encoding and at 3 for the ordinal n-gram Figure 3. A partial hierarchical (left) and ordinal (right) n-gram encoding. The final feature set consists of sequences of length one encoding is also demonstrated in this Figure. In Figure 3, each (i.e., unigrams) to sequences of length four for hierarchical (i.e., 4- colored circle shows the n-gram encoding of a specific n. In this grams) and three for ordinal (i.e., 3-grams) that are repeated at least example, encoding of n-grams of size one is represented with green three times (again to address the data sparsity issue) throughout the ovals, n-grams of size two with blue ovals and n-grams of size three dataset, resulting in 184 distinct features. with purple ovals. The frequency values for each n-gram encoded feature are shown beside the AST. All of the other n-gram feature 4.4 Inferring Program Scores values are zero since they are not in this AST. We then merge the We infer students’ overall program scores in addition to their scores two feature sets together to build the final feature set containing for each of the essential CS concept by training a variety of both hierarchical and ordinal n-gram encodings corresponding to regression models on the structural n-gram-encoded feature set. As Figure 3: AST generated from a sample program submitted for the bubble sort challenge and its hierarchical and ordinal n-gram encoding. (Left) An AST and its partial hierarchical unigrams, bigrams, and 3-grams marked by green, blue and purple ovals respectively on the left and the partial feature set generated from hierarchical n-gram encoding of the AST along with feature- level frequencies on the right. (Right) An AST and its ordinal unigrams, bigrams, and 3-grams marked by green, blue and purple ovals respectively on the left and the partial feature set generated from partial ordinal n-gram encoding of the AST along with feature-level frequencies on the right. our baseline model, we use linear regression. We use four “Appropriate use of loop statements” score, and l=1 the best value additional regression models including lasso regression [32], ridge for “Appropriate use of conditional statements” and “Appropriate regression, support vector regression (SVR) [10], and Gaussian combination of loops and conditional statements” scores. process (GP) regression [26]. Lasso and Ridge regression are utilized since they can reduce overfitting and variance issues in 4.4.2 Lasso Regression comparison with linear regression. SVR and GP regression on the We used the set [0.05, 0.1, 0.5, 1.0, 10] as in ridge regression to other hand are used since kernel methods can do well with datasets tune the value for l and found l=0.05 to be the best value for all with proportionally large number of features. More importantly, GP the inferred scores. regression can handle the noise resulting from the subjective nature of human grading [6, 36]. To infer students’ overall program scores 4.4.3 Support Vector Regression as well as their scores for “Design and implementation of effective For our regression task, we explored with linear, polynomial, and and generalizable algorithms,”, “Appropriate use of loop radial basis function (RBF) kernels. For each kernel, we tuned the statements,” “Appropriate use of conditional statements,” and hyperparameters of penalty parameter (C), epsilon, and kernel “Appropriate combination of loops and conditional statements,” we coefficient (gamma). For polynomial kernels, we also tuned the train each regression model utilizing the n-gram encoded feature parameter of the kernel projection (coef0) and degree set mentioned above, while predicting the scores of each core hyperparameters. Utilizing cross-validation, we found the concept. polynomial kernel with a degree of four to be the best kernel for To infer students’ grades using the n-gram encoded feature set, we our dataset when inferring the “Overall Score”, “Appropriate use use the Python scikit-learn library [24] to perform linear, lasso of of loop statements,” and “Appropriate combination of loops and regression, ridge regression, SVR, and GP regression. We first split conditional statements” scores. Also, the grid search returned C=1, our dataset to 80% training and 20% held-out test sets. We the use coef0=10, epsilon=0.2, gamma= 0.0001 as the best parameters for a 5-fold cross-validation approach to tune the hyperparameters of this kernel. For inferring the “Appropriate use of conditional lasso, ridge, and SVR regression based on the training set. We also statements” score we found the radial basis function kernel with use the 5-fold cross-validation approach to identify the appropriate C=100, epsilon=0.1, and gamma= 0.001 to be the best parameter kernel for the GP regression model. Gaussian process regression values. Finally, we found the radial basis function kernel with model uses an internal limited-memory BFGS approach to tune its C=100, epsilon=0.2, and gamma= 0.001 to be the best parameter other hyperparameters such as length scale and noise level. After values when inferring the “Design and implementation of effective tuning the hyperparameters of each regression model, we train the and generalizable algorithm” score. models on the training set and evaluate it on the held-out test set. This process is repeated to infer each CS concept score separately. 4.4.4 Gaussian Process Regression The results of applying each of the regression models to infer each We expect the GP regression to outperform other regression of the CS concept scores is presented in Table 2. techniques due to its capability of handling noise and its propriety for our dataset. After conducting a hyper-parameter tuning for the 4.4.1 Ridge Regression kernels such as radial basis functions (RBF), rational quadratic, and We used the set [0.05, 0.1, 0.5, 1.0, 10] to tune the value for l, the Matern kernels, we found RBF to perform the best on our dataset penalty coefficient, and found l=10 to be the best value for for all the inferred scores. Utilizing a limited-memory BFGS inferring the “Overall grade” and the “Design and implementation optimization technique the GP regression model tuned other of effective and generalizable algorithm” scores based on cross- hyperparameters including the length vector and noise level during validation. Furthermore, we found l=0.5 the best value for the training process. Table 2. Average predictive performance of regression models trained with the structural n-gram feature set. Design and Appropriate Appropriate use of implementation of Appropriate use of combination of loops Grade Overall Grade conditional effective and loop statements and conditional statements generalizable algorithm statements Regression MSE R2 MSE R2 MSE R2 MSE R2 MSE R2 Linear 7.88E+10 -2.58E+9 1.8E+10 -3.9E+9 2.8E+9 -2.1E+9 2.43E+9 -4.9E+8 2.0E+9 -1.6E+9 Ridge 4.84 0.84 1.62 0.64 0.36 0.74 0.59 0.88 0.09 0.93 Lasso 5.67 0.81 2.38 0.49 0.88 0.35 0.73 0.85 0.48 0.62 SVR 3.74 0.88 1.56 0.66 0.57 0.57 0.49 0.9 0.08 0.93 Gaussian Process 1.67 0.94 1.02 0.78 0.18 0.86 0.28 0.94 0.02 0.98 5. DISCUSSION immediate actionable formative information while also training Effective automated assessment of students’ programming efforts automated assessment tools that can provide ongoing, future has become increasingly important. This work investigates an n- adaptive support. gram encoding approach to encode students’ programs into their Though we show the application of this automated assessment essential structural and semantic features. Utilizing the n-gram framework on one particular task, it can be generalized to assess encoding approach, we can extract structural information with any well-structured programs as the feature representations are varying levels of granularity. Utilizing this feature set labeled by readily scalable to other programming tasks. Furthermore, our the ECD-based designed rubric enables our models to learn rubric design approach can be used as a guideline for rubric design evidence from programs that are representative of students’ and assessment for non-expert CS teachers. A teacher dashboard mastery of identified CS. incorporating the automated assessment framework can further be After extracting an n-gram encoded feature set from students’ utilized to analyze and aggregate the results and inform teachers programs, we apply a variety of regression models to infer their about students’ learning and the quality of their instruction. scores for each of the targeted CS concepts. We conduct an 80-20 6. CONCLUSION AND FUTURE WORK split on our dataset to generate training and held-out test sets. We train our models on the training set and evaluate the trained models Effective scaffolding of programming efforts for novice on the held-out test. This process is repeated for each CS concept. programmers require accurate automated assessment of their The results of our prediction demonstrate the effectiveness of the competency in each core CS concept. In this paper, we presented n-gram encoded feature set in capturing important semantic and an automated assessment framework for assessing programs’ structural information in students’ programs, as all regression algorithmic quality following a hypothesis-driven learning analytic models outperformed the linear regression model. As expected, GP approach. We investigate a hierarchical, ordinal feature regression also outperformed other baseline models in terms of representation method based on n-gram-encoded hierarchical and both mean squared error and R-squared across all prediction tasks. ordinal coding constructs that extract two-dimensional structural This is expected, since GP regression is well-equipped to handle information from students’ programs, and investigated Gaussian noise in the data set and is particularly appropriate for datasets with process regression to induce models that can accurately predict a large number of features relative to the number of data points. students’ grades for individual CS concepts based on their submitted programs. Evaluation results suggest that Gaussian We utilized an evidence-centered assessment design (ECD) process regression models utilizing n-gram-encoded features that approach to label the training dataset. ECD holds significant extract salient semantic and structural information from programs promise for guiding educators in designing mindful assignments for achieved the highest predictive performance with respect to mean learners by focusing on key conceptual ideas rather than surface- squared error and R squared. These results suggest that Gaussian level features of the program. This means that an ECD-derived process regression models are robust in dealing with noise that rubric can provide granular information structured around core CS underlies our human-annotated dataset. concepts, which guides development of robust automated assessment models, but also provide immediate formative data to In the future it will be important to investigate the potential for instructors. Thus, as new problems and activities are introduced utilizing a data-driven approach for devising a rubric based on into a course, the first-pass human scoring with the rubric provides identified correct solutions. Furthermore, the effectiveness of the n- gram encoded feature set can be further evaluated by performing an automatic feature-selection process and compare the results with Workshop in Primary and Secondary Computing expert selected features. Finally, it will be instructive to explore the education, ACM, 2–11. potential of the n-gram encoded feature for creating an [12] Grover, S., and Basu, S. 2017. Measuring Student unsupervised learning approach to accurately inferring students’ Learning in Introductory Block-Based Programming. In program scores without requiring labeled training data. Proceedings of the 48th ACM SIGCSE Technical Symposium on Computer Science Education, 267–272. 7. ACKNOWLEDGMENTS This research was supported by the National Science Foundation [13] Grover, S., Basu, S., Bienkowski, M., Eagle, M., Diana, under Grants DRL-1640141. Any opinions, findings, and N., and Stamper, J. 2017. A Framework for Using conclusions expressed in this material are those of the authors and Hypothesis-Driven Approaches to Support Data-Driven do not necessarily reflect the views of the National Science Learning Analytics in Measuring Computational Thinking Foundation. in Block-Based Programming Environments. ACM Transactions on Computing Education 17, 3, 1–25. 8. REFERENCES [14] Hansen, A., Dwyer, H., Iveland, A., Talesfore, M., [1] Akram, B., Azizolsoltani, H., Min, W., Wiebe, E., Navied, Wright, L., Harlow, D., and Franklin, D. 2017. Assessing A., Mott, B., Boyer, K., and Lester, J. 2020. Automated Children’s Understanding of the Work of Computer Assessment of Computer Science Competencies from Scientists: The Draw-a-Computer-Scientist Test. In Student Programs with Gaussian Process Regression. To Proceedings of the 48th ACM SIGCSE Technical appear In Proceedings of the 13th Conference on Symposium on Computer Science Education, 279–284. Educational Data Mining. [15] Phantola, P., Ahoniemi, T., Karavirta, V., and Seppälä, O. [2] Akram, B., Min, W., Wiebe, E., Mott, B., Boyer, K., and 2010. Review of Recent Systems for Automatic Lester, J. 2018. Improving Stealth Assessment in Game- Assessment of Programming Assignments. In based Learning with LSTM-based Analytics. In Proceedings of the 10th Koli calling International Proceedings of the 11th International Conference on Conference on Computing Education Research, 86–93. Educational Data Mining, 208–218. [16] Deirdre Kerr and Gregory K W K Chung. 2012. [3] Akram, B., Min, W., Wiebe, E., Navied, A., Mott, B., Identifying Key Features of Student Performance in Boyer, K., and Lester, J. 2020. A conceptual assessment Educational Video Games and Simulations through framework for k-12 computer science rubric design. In Cluster Analysis. Journal of Educational Data Mining Proceedings of the 51th ACM Technical Symposium on 4, 1, 144-182. Computer Science Education, 1328–1328. [17] Lajis, A., Baharudin, S., Kadir, D., Ralim, N., Nasir, H., [4] Ala-Mutka, K. 2005. A Survey of Automated Assessment and Aziz, N. 2018. A Review of Techniques in Automatic Approaches for Programming Assignments. Computer Programming Assessment for Practical Skill Test. Journal Science Education 15, 2, 83–102. of Telecommunication, Electronic and Computer [5] Amershi, S. and Conati, C. 2009. Combining Engineering 10, 2, 109–113. Unsupervised and Supervised Classification to Build User [18] Mao, Y., Lin, C., Chi, M., 2018. Deep Learning vs. Models for Exploratory Learning Environments. Journal Bayesian Knowledge Tracing: Student Models for of Educational Data Mining, Article 1, 1, 18–71. Interventions. Journal of Educational Data Mining [6] Bouchet, F., Harley, J., Trevors, G., and Azevedo, R. 10, 02, 28–54. 2013. Clustering and profiling students according to their [19] Meerbaum-Salant, O., Armoni, M., and Ben-Ario, M. interactions with an intelligent tutoring system fostering 2013. Learning computer science concepts with scratch. self-reggulated learning. Journal of Educational Data Computer Science Education 23, 3, 239–364. Mining, 05, 01, 104–146. [21] Min, W., Frankosky, M., Mott, B., Rowe, P., Wiebe, E., [7] Cohen, J. 1960. A Coefficient of Agreement for Nominal Boyer, E., and Lester, J. 2015. DeepStealth: Leveraging Scales. Educational and Psychological Measurement 20, Deep Learning Models for Stealth Assessment in Game- 1, 37–46. based Learning Environments. In International [8] College Board. 2017. AP Computer Science Principles Conference on Artificial Intelligence in Education, 277– Including the Curriculum Framework. In AP Computer 286. Science Principles: The Course. New York. [22] Mislevy, R., Haertel, G., Riconscente, M., Rutstein, D., [9] Cortes, C., and Vapnik, V. 1995. Support-vector and Ziker, C. 2017. Evidence-Centered Assessment networks. Machine Learning 20, 3, 273–297. Design. In Assessing Model-Based Reasoning Using [10] Cui, Y., Chu, M., and Chen, F. 2019. Analyzing Student Evidence- Centered Design. SpringerBriefs in Statistics, 19–24. Process Data in Game-Based Assessments with Bayesian Knowledge Tracing and Dynamic Bayesian Networks. [23] Multon, K. 2010. Interrater reliability. Encyclopedia of Journal of Educational Data Mining 11, 01, 80–100. Research Design. SAGE, New York, 626–628. [11] Fields, D., Giang, M., and Kafai, Y. 2014. Programming [24] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., in the wild: trends in youth computational participation in Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., the online scratch community. In Proceedings of the 9th Weiss, R., Dubourg, V. and Vanderplas, J., 2011. Scikit- learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. [25] Price, T., Dong, Y., and Lipovac, D. 2017. iSnap: Towards Intelligent Tutoring in Novice Programming Environments. In Proceedings of the 48th ACM SIGCSE Technical Symposium on Computer Science Education, ACM, 483–488. [26] Rasmussen, C.. 2004. Gaussian Processes in machine learning. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 63–71. [27] Rupp, A, Pearson, K., Sweet, S., Crawford, A., Levy, I., Fay, D., Kunze, K., Cisco, M., Mislevy, R., and Pearson, J. 2012. Putting ECD into Practice: The Interplay of Theory and Data in Evidence Models within a Digital Learning Environment. Journal of Educational Data Mining 4, 1 , 49–110. [28] Shamsi, F., and Elnagar, A. 2012. An Intelligent Assessment Tool for Students’ Java Submissions in Introductory Programming Courses. Journal of Intelligent Learning Systems and Applications 04, 01 , 59–69. [29] Snow, E., Haertel, G., Fulkerson, D. and Feng, M. 2010. Leveraging evidence-centered assessment design in large- scale and formative assessment practices. In Proceedings of the 2010 Annual Meeting of the National Council on Measurement in Education (NCME). [30] Striewe, M., and Goedicke, M. 2014. A review of static analysis approaches for programming exercises. In Computer Assisted Assessment. Research into E- Assessment. Springer, 100–113. [31] Taherkhani, A., and Malmi, L. 2013. Beacon- and Schema-Based Method for Recognizing Algorithms from Students’ Source Code. Journal of Educational Data Mining 5, 2, 69–101. [32] Tibshirani, R. 1996. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58, 1 267–288. [33] Truong, N.,Roe, P., and Bancroft, P. 2004. Static Analysis of Students’ Java Programs. In Proceedings of the 6th Australasian Conference on Computing Education, 317- 325. [34] Wang, T., Su, X., Wang, Y., and Ma, P. 2007. Semantic similarity-based grading of student programs. Information and Software Technology 49, 2, 99–107. [35] Winne, P., and Baker, R. 2013. The Potentials of Educational Data Mining for Researching Metacognition, Motivation and Self-Regulated Learning. Journal of Educational Data Mining 5, 1, 1–8. [36] Zen, K., Iskandar, D,. and Linang O. 2011. Using Latent Semantic Analysis for automated grading programming assignments. In International Conference on Semantic Technology and Information Retrieval, 82–88. [37] 2016. K-12 Computer Science Framework. Retrieved August 25, 2018 from http://www.k12cs.org.