Investigating Elements of Student Persistence in an Introductory Computer Science Course Juan D. Pinto Yingbin Zhang Luc Paquette Aysa Xuemo Fan University of Illinois at University of Illinois at University of Illinois at University of Illinois at Urbana-Champaign Urbana-Champaign Urbana-Champaign Urbana-Champaign jdpinto2@illinois.edu yingbin2@illinois.edu lpaq@illinois.edu xuemof2@illinois.edu ABSTRACT grades and test scores [8, 17, 41]. Efforts to identify persistence in We explore how different elements of student persistence on log data of game-based learning systems [7, 27, 34] or intelligent computer programming problems may be related to learning tutoring systems (ITS) [15] have shown great promise. Many of outcomes and inform us about which elements may distinguish these efforts have specifically focused on improving persistence between productive and unproductive persistence. We collected detectors for on-the-fly student feedback systems or interventions. data from an introductory computer science course at a large One aspect of persistence that has gained interest in the EDM midwestern university in the U.S. hosted on an open-source, community in particular is the distinction between productive and problem-driven learning system. We defined a set of features unproductive persistence. Persistence is typically characterized by quantifying various aspect of persistence during problem solving a determination to stick with a problem for long durations despite and used a predictive modeling approach to predict student scores facing obstacles, and it has often been portrayed as a positive trait. on subsequent and related quiz questions. We focused on careful However, researchers have come to question this simplistic stance, feature engineering and model interpretation to shed light on the noting that there seem to be two related but opposing sides to intricacies of both productive and unproductive persistence. persistence. On one hand, persistence may produce productive Feature importance was analyzed using SHapley Additive results when it leads to consistent, long-term effort [8] or when exPlanations (SHAP) values. We found that the most impactful students relish the opportunity to overcome challenges [9]. On the features were persisting until solving the problem, rapid guessing, other hand, students who are "stuck" may be better off going back and taking a break, while those with the strongest correlation to learning more about the subject rather than continuing to spend between their values and their impact on prediction were the time working on a problem they don't yet fully understand [3]. In number of submissions, total time, and (again) taking a break. This such cases, the student’s persistence might be characterized as suggests that the former are important features for accurate unproductive. prediction, while the latter are indicative of the differences between productive persistence and wheel spinning in a computer science Given the opposing academic outlook of this dichotomy, context. understanding what differentiates productive from unproductive persistence is of critical importance. The latter has been termed Keywords wheel spinning in the literature and has been defined as "a student who spends too much time struggling to learn a topic without Student modeling, persistence, wheel spinning, predictive achieving mastery" [3]. Recent research has specifically focused on modeling, behavior detection creating and improving automatic detectors of wheel spinning in ITSs [11, 15, 24, 39, 42] and game-based learning systems [27]. 1. INTRODUCTION In the context of computer science education, [23] have suggested Research on student modeling has identified various behaviors and that fostering grit can lead to higher retention among CS students. patterns related to learning outcomes and student success. One Other research has identified a weak correlation between grit and construct has both a history of research outside of Educational Data measures of academic success [17, 25, 41], especially when Mining (EDM) and is receiving renewed attention in the EDM focusing on one of the two main components of grit—perseverance community. Known by the diverse names of grit [8], perseverance of effort—which most closely aligns with definitions of persistence [25], academic tenacity [9], and persistence, studies have focused [35]. on measuring the trait, identifying when students are exhibiting it, and quantifying its effects on various aspects of student learning. In this paper, we add to the existing literature by exploring how More traditional efforts on this front have focused on measuring different elements of persistence on computer programming persistence using questionnaires and testing its effect based on problems may contribute to learning outcomes. We defined a set of features quantifying various aspects of persistence during problem solving and used predictive modeling approaches to predict student scores on subsequent and related quiz questions. We focus on careful feature engineering and model interpretation to shed light on the intricacies of both productive and unproductive persistence. By investigating these constructs within a computer science course, our study also aims to better understand their application in this Copyright © 2021 for this paper by its authors. Use permitted under context. Creative Commons License Attribution 4.0 International (CC BY 4.0). 2. RELATED WORK Studies have also developed features to evaluate how similar or different two computer programs are. [30] used a combination of 2.1 Modeling Productive Persistence vs. the differences in bag of words, abstract syntax tree (AST) edits and Wheel Spinning similarity in calls to the application programing interface (API) to The EDM community’s interest in persistence was sparked by [3], identify similar program states. [6], in addition to using this same who found that students who struggle to master a skill within a method, considered the frequency of changes in a student’s certain timeframe are unlikely to do so at all. Besides identifying program and the magnitude of those changes. wheel spinning and describing how it differs from productive As our goal was to focus on behaviors related to how students persistence, the same study found a clear correlation between wheel approach solving a problem, rather than investigating the content spinning and other negative behaviors such as gaming the system of the submitted solution, we used an approach in line with the first and disengagement. category to investigate elements of student persistence in a series Subsequent studies have devised variations in criteria for of computer programming problems. This allowed us to focus differentiating between productive persistence and wheel spinning specifically on the productive and unproductive behaviors of [42], with many models defining mastery based on the number of persistent students. correct submissions in a row and others relying heavily on the stability of Bayesian knowledge tracing (BKT) student model 3. METHODS predictions [16]. Despite differences in operationalization, 3.1 Data Collection and Label Generation however, predictive machine learning models have been found to We collected data from an introductory computer science course at serve as successful wheel-spinning detectors. Some of the a large midwestern university in the U.S. hosted on PrairieLearn, algorithms that have been used include linear regression [3], an open-source, web-based problem-driven learning system [40]. logistic regression [11, 42], decision trees [15, 27, 39], random Throughout the semester, 733 students used PrairieLearn to submit forest [27, 42], and neural networks [24]. Most of these studies almost-daily programming homework problems, take weekly calculated productive persistence or wheel spinning labels based quizzes, and complete cumulative exams. In addition, students solely on the data gathered rather than relying on human observers were free to practice past problems and questions as much as they or coders. Two notable exceptions are [24, 27]. desired. As our work aims to investigate the relationship between The goal of the most recent studies has been to identify wheel persistence during homework and subsequent assessment, we spinning in ITSs as early as possible. [42] compared different filtered the data to focus on attempts submitted towards solving a criteria and feature sets and have shown that it is possible to make homework problem or a quiz question. After removing practice predictions with acceptable accuracy as early as step four of a submissions and other non-credit assignments, our resulting data problem. They were also surprised to find that a logistic regression set consisted of 290,703 individual homework problem attempts model trained on only one feature (“correct response percentage”) and 313,097 quiz question attempts. resulted in prediction performance that was close to their best models. Relying on hint requests, submission correctness, and time All homework assignments were programming problems with per skill, [39] concluded that models can detect students who will checkstyle, compiler, and problem-specific tests that students’ code wheel spin after only three questions. had to pass to receive full credit. Students had one day to successfully complete each homework problem. They were The studies mentioned thus far have focused almost exclusively on allowed to submit solution attempts as often as required until they ITSs, which are most commonly used to teach math. Detecting and successfully passed all the tests. After each submission, the system studying persistence on computer programming problems requires ran tests to check the correctness of the solution and provided first understanding how data from these tasks has been analyzed in feedback indicating mistakes. First, the system tested whether the past studies. solution had any checkstyle and compiler errors. If such error existed, the system showed feedback about these errors and 2.2 Using Action Logs to Study Programming stopped. If there were no checkstyle or compiler errors, the system Behaviors further used several problem-specific tests to examine whether the There is growing interest in leveraging data analytic methods to solution fulfilled the requirement. For example, given some random study students’ action logs produced during programming activities input, would the solution generate the correct output? If not, the [13], including to better understand the students’ programming system would return feedback about the problem-specific test error. processes, behaviors and strategies. Log data have been used to Otherwise, the solution was regarded as correct. generate visualizations of student behaviors that can be manually inspected to better understand their programming approach [5, 10], We aggregated our dataset at the student-problem level using a series of features specifically related to persistence. While explore how students progress through homework assignments [6, 30], understand the learning pathways of novice programmers [4] persistence can be studied at various grain sizes, we chose this level due to our interest in how students tackle difficulties within a and analyze problem-solving behavior in a debugging game [20]. particular programming problem. Similarly, we only kept instances Generally, two broad categories of features have been used: 1) that demonstrated struggling, as defined in section 3.2.1, since frequencies of behaviors and 2) similarity/distance between these were the cases that could elicit persistence from students. programs. The first category provides aggregated information related to the quantity of actions performed by the student. This Quizzes were conducted weekly as part of regular class activity to includes the number of blocks used in a Scratch program [10], how assess learning and consisted of both multiple-choice questions and often a program was compiled and how many characters it included programming tasks. Quizzes were made available at the end of the [5], the number of actions and logic primitives used [4], and the week and were designed to provide early assessment related to the number of lines added, deleted, and modified [6]. [20] leveraged content of the homework problems assigned earlier that week. We expert judgments to identify meaningful behaviors, such as massive aligned the content of each homework problem to corresponding deletion and replacing loops with repetitive code. multiple-choice quiz questions to directly investigate the relationship between persistence in specific homework problems deviation features would result in perfectly collinear features, so we and outcome on related assessment questions. Once we had these did not standardize them. alignments, we calculated for each student-problem instance the For the time threshold, we used the minimum value between the total number of points obtained on the relevant quiz questions and 75th quantile of students’ total time on each problem and 15 the maximum possible points. Using these values, we then minutes. We combined the 75th quantile and 15 minutes to calculated the point percentage as the indicator of learning. Only determine the time threshold based on several reasons. First, given quiz questions that students attempted were considered for these that the course is only an introductory CS course, it is reasonable calculations. After these changes and calculations, our aggregated that one fourth of students struggled with difficult programming dataset consisted of 7,673 instances of student-problem pairs, problems. Second, the proportion of students who struggled with submitted by a total of 710 students. unchallenging problems would be smaller. Using an absolute The resulting distribution of the score outcome variable had a threshold would be better for these cases. Third, we used 15 strong negative skew, with most instances accumulated at higher minutes as the absolute threshold because 57.56% of problems had scores, as shown in Figure 1. This is because students often a 75th quantile of total time smaller than 15 minutes. It seems managed to obtain a perfect score on their aligned quiz questions. reasonable to regard close to half of problems as unchallenging. Students were typically given two chances to select the right Given that the number of attempts is an important indicator of answer, the second time for half credit. persistence, many attempts on a problem might also be indicative of struggling, even when the total time spent on the problem falls under the time threshold. Analogous to deciding the time threshold, we used the minimum value between the 75% quantile of the number of attempts on a problem and 9 attempts to determine the attempt threshold. If the 75th quantile of the number of attempts on a problem was smaller than 9 attempts, the later became the attempt threshold. We used 9 attempts as the absolute threshold because 56.06% of problems had a 75th quantile of the number of attempts no more than 9 attempts. This number was close to 57.56%, the proportion of problems with a 75th quantile of total time smaller Figure 1. Distribution of score values than 15 minutes. 3.2 Feature Engineering 3.2.2 Solved Whether the student successfully solved the programming problem Given our goal to study how specific behaviors might be related to persistence, our feature engineering efforts focused on developing before the deadline. features based on an underlying rationale about their relationship to This is directly related to wheel spinning as defined by [11]: productive or unproductive persistence. Following the Carnegie "problem solving without making progress towards mastery." Foundation for the Advancement of Teaching’s definition of While PrairieLearn is not suited for measuring mastery the way productive persistence—“tenacity plus the use of good strategies” [11] did with the Cognitive Algebra Tutor and ASSISTments ITSs [18]—we sought to identify good learning strategies and habits (three consecutive, correct responses within a specific skill), based on the available data. Other features were based on more persistence while struggling that does not lead to an eventual generalized applications of the aspects of unproductive persistence correct solution can be considered a form of wheel spinning or that have been identified in the wheel-spinning literature. This unproductive persistence. Based on this, we hypothesized that process resulted in a total of 12 base features. We also standardized solving a challenging problem (productive persistence) would lead most of these at the problem level (by subtracting the problem’s to a higher quiz-question score than not solving the problem (wheel mean and dividing by the problem’s standard deviation) to create spinning). an additional 10 features. The rest of this section describes each feature and our rationale behind it. 3.2.3 Number of submissions The count of how many times the student submitted an attempted 3.2.1 Struggling threshold features solution for the problem. Whether the student went beyond a problem’s corresponding time This is a typical measure used in the persistence literature [15, 39, or attempt threshold. 42]. Since submissions on PrairieLearn typically end when a We defined students as struggling if they worked on a programming student successfully solves a problem, this feature is a count of the problem for a long time or if they submitted a high number of number of failed attempts + 1. In essence, this is one way of solutions to a problem. We considered that students could only measuring the level of persistence demonstrated. We reasoned that show persistence in the context of problems for which they more unsuccessful attempts would indicate more wheel spinning, struggled. resulting in lower quiz scores. This operationalization of struggling depends on identifying both a 3.2.4 Total time on problem time and attempt threshold, each specifically calculated for that The total amount of time (in seconds) spent solving the problem. homework problem. Thus, once we calculated the thresholds for each problem, we created two binary struggling threshold features: As with the number of submissions, the time that students spend on beyond time threshold and beyond attempt threshold. We only kept a challenging problem might indicate the amount of persistence instances of students that satisfied at least one of these two criteria. being demonstrated. We again reasoned that more time (and thus We also created two numerical features that measured a student’s more wheel spinning) may be predictive of more struggling and deviation from each of these thresholds. Because the thresholds lower scores on the quiz questions. were already calculated at the problem level, standardizing these Our platform only allowed us to measure the time between with wheel spinning [3]. Given the nature of programming tasks as submissions, so we had no way of knowing with certainty how opposed to attempts in an ITS, we defined a quick submission as a much time was spent working on a problem. If the time difference gap between attempts of less than 15 seconds. If a student’s between a student’s two consecutive submissions was beyond 15 submission stream to a problem contains three or more consecutive minutes, we regarded this student as being away from this problem quick submissions, we labeled this student as performing rapid during that interval (see the feature taking a break below for the guessing on the problem. We hypothesized that rapid guessing choice of 15 minutes as a threshold). In these cases, we replaced would be associated with a lower score on related quiz questions. this time difference with the student’s mean time difference between other consecutive submissions on this problem so that we 3.2.8 Time interval between consecutive submissions could estimate the student’s total time on the problem more The student’s mean and standard deviation of time intervals accurately. between consecutive attempts on the problem. Shorter time between submissions may indicate more unproductive 3.2.5 Taking a break attempts to push through to an answer without stopping to Whether the student spent time away from the problem after think/work carefully or take breaks [39]. This is also similar to the passing one of the struggling thresholds. common practice of cramming, as opposed to the more effective We defined taking a break as a struggling student being away from practice of spaced repetition. [15] found both the mean and standard the problem at least once. When the time between two consecutive deviation of time differences to be about equally as predictive of submissions on the same problem went beyond 15 minutes, we wheel spinning. We nevertheless chose to include both features in regarded the student as away from the task. As discussed above, 15 our initial model to test this claim. We did not count cases of break minutes might be sufficient for solving unchallenging problems if taking (intervals longer than 15 minutes) towards these features. students did not struggle. Moreover, 81.57% of pairs of consecutive Because of its association with wheel spinning, we hypothesized submissions had a time difference less than 15 minutes. This that we would find a positive correlation between these features and proportion only increased slightly to 83.77% when increasing this score. threshold from 15 minutes to 1 hour. Thus, it is reasonable to use 15 minutes as the threshold for being away from the problem. Note 3.3 Machine Learning and Interpretation that if a student attempted other homework problems between two To test the importance of our various features, we created a random consecutive submissions on the same problem, we regarded this forest model using a shuffled 70/30 validation/testing split grouped student as interleaving rather than taking a break. by student with 5,396 and 2,277 instances respectively. We conducted 500 iterations of Bayesian hyperparameter optimization Our rationale for measuring break taking is based on the idea that a on the validation set using 10-fold cross validation grouped by wheel-spinning state may be overcome by time away from task. student. This hyperparameter tuning was set to optimize the R2 Some of the cognitive benefits of breaks have been documented [1, score. 19, 26, 36] and seem to be especially impactful for intensive and prolonged tasks. The term wheel spinning itself was coined in We originally tested a wide array of models, including various reference to the imagery of a car spinning its wheels but not going linear, tree-based, and ensemble algorithms, and we further tuned anywhere, suggesting that the indiscriminate tactic of subsequent some of the most promising ones. We found that variations of attempts may not always be productive. In their article defining this gradient boosting models performed best. However, we chose to new construct, [3] suggest devising ways to break up fruitless focus on random forest for our feature interpretation for two attempts at solving problems. Our feature tries to capture students reasons: (1) the performance gained by using the best models over who independently choose to break up their homework in this way. random forest was negligible, and (2) random forest models have been shown to be useful for predictions related to persistence in 3.2.6 Interleaving other EDM research [27, 42]. Whether the student switches to a different problem for a time and Once we had constructed our final model, we re-trained it on the then comes back to continue attempting the original problem. entire dataset in preparation for feature interpretation. For the task Interleaved practice, as opposed to blocked practice, refers to a of interpreting feature importance, we analyzed SHapley Additive learning technique that mixes up the order of topics, lessons, or exPlanations (SHAP) values. SHAP is a game-theoretic approach problems presented. Studies have shown that this practice usually that calculates the effect that each value in the feature matrix has improves learning outcomes [32, 38], though—to the best of our on that instance’s prediction, relative to the mean prediction [22]. knowledge—this has not been explored in a CS context. For the That is, we can output a matrix with the same dimensions as the purposes of our study, we measured interleaving as a student features data set, each value serving as an explanation of that attempting a problem without solving it, attempting a different feature’s effect on the prediction made for that particular instance. problem, and then returning to continue working on the original These SHAP values are in the same unit as the target label— problem. We reasoned that such a practice could potentially serve percentage score in our case—further lending themselves for to break up the monotony and potential frustration associated with interpretation. Though SHAP values are very resource-intensive to wheel spinning and thereby lead to better learning. We considered fully and accurately calculate, the nature of tree-based models this an alternative to taking a break and did not double count such makes it possible to optimize the process significantly [21]. instances in the features. It is important to note that the mean of the SHAP values for any 3.2.7 Rapid guessing feature will always be zero. This is because SHAP values are Whether the student submitted at least three quick submissions in a calculated as the difference of each feature-instance from the mean row. predicted score. However, by finding the mean absolute value of the SHAP values for each feature, we can identify which features Quick, consequent submissions may indicate guessing or uncritical have the strongest broad, average impact on prediction. attempts to fix problems without much reflection. This behavior has been associated with students trying to game the system [2, 28] and While mean Gini impurity has been used to interpret features in the relationship between many other impactful standardized features persistence literature [42], and permutation feature importance is and their original, binary, far less impactful counterparts. commonly used as well, numerous studies have identified potential issues with these approaches that can lead to misleading Because we standardized features at the problem level, the correlation between each unstandardized and corresponding interpretations [12]. This is especially true when using highly correlated features [37], which is the case with our data. standardized feature is never quite perfect, but some do come close. Furthermore, SHAP allows for investigation into the interplay Random forest models typically do not suffer from collinear features the way more traditional statistical regression methods do. between features beyond what these other methods can do. This is largely because of the way features are randomly sampled We conducted all our work using open-source Python packages for each tree. Even when both collinear features are part of the built on top of Scikit-learn [29]. We tested and tuned a variety of feature subset, a decision tree will typically ignore one in favor of models using PyCaret [31], performed Bayesian optimization [14] the other. We suspect that much of our model’s preference for the with scikit-optimize [33], and investigated feature importance standardized features over unstandardized ones is the added using SHAP [22]. problem-level information they contain, which could be interpreted as information regarding the difficulty of the problem. 4. RESULTS AND DISCUSSION However, while the predictive power of a random forest is not 4.1 Model Results and Preliminary Analysis affected by collinear features, model interpretability suffers, as we Our tuned random forest model attained an average cross-validated found through our preliminary analysis. Given our goal of better R2 of 0.133 and an average RMSE of 0.129 on the validation set. understanding the different aspects of persistence and their On the held-out testing set, the resulting R2 was 0.145 and the relationships, we decided to remove the original non-standardized RMSE was 0.130. Our persistence features accounted for roughly features. We also removed time_threshold_deviation and 14% of the variation in related quiz scores. attempt_threshold_deviation, which were very highly correlated with total_time_std and num_submissions_std respectively. We A preliminary analysis of our model uncovered certain important patterns. For one, our least impactful features were all binary then re-trained and re-tested our model. measures—such as whether interleaving, rapid guessing, or break- After removing these features, we found that our model’s average taking were observed—whereas our top features were the cross-validated R2 on the validation set increased slightly, from standardized measures of those binary features. Figure 2 shows the 0.133 to 0.134, while RMSE remained constant. On the held-out entire set of feature rankings based on mean absolute SHAP values. testing set, its R2 also increased, from 0.145 to 0.147, while RMSE remained constant. We then re-trained our model on the entire dataset in preparation for our in-depth feature analysis. 4.2 Feature Importance and Interpretation 4.2.1 Feature rankings Our analysis using SHAP values found that the solved_std and rapid_guessing_std features had the biggest effect, accounting for an average impact of 0.0215 and 0.0172 on the predicted score respectively. The third most important feature, taking_break_std, had an average impact less than half as strong at 0.0076. Together, these three features account for 75% of all features’ total impact on the predicted score. Figure 3 shows the feature rankings based on mean absolute SHAP values, while Table 1 allows for comparison with other methods such as Gini-impurity-based importance and permutation importance. Rankings based on these three different approaches yielded almost identical results with only minor variations, strengthening the reliability of our findings. Figure 2. Preliminary model feature rankings A detailed exploration of these features revealed what appears to be an opposing impact between some binary features and their standardized counterparts. For example, the feature solved has a negative correlation between its values and its SHAP values (r = -0.217, p < 0.0001), whereas its standardized version, solved_std, has a positive correlation (r = 0.33, p < 0.0001). Measuring this correlation between feature and SHAP values allows us to better understand how the model is using the feature. Higher correlation, and thus a stronger linear relationship, suggests a more straight-forward interpretation for the feature’s role in the model. While the impact of solved is very small in the overall model (ranked 17th, mean absolute SHAP = 0.00003), solved_std is our top feature in terms of overall impact on the predicted score (mean absolute SHAP = 0.02129). We found this same inverted Figure 3. Final model feature rankings Besides ranking the features by impact on the predicted score, This confirms our hypothesis. It suggests that solving a challenging SHAP values allow us to explore the nature of that impact more problem (productive persistence) may be related to a better deeply, as well as the interactions between features. Figure 4 is a understanding of the underlying concepts, whereas not solving the beeswarm plot of SHAP values by feature with color indicating the problem (wheel spinning) suggests a lack of understanding. value of each individual instance. 4.2.3 Rapid guessing Our models’ second most impactful feature, rapid_guessing_std, is in many ways the opposite. Most students did not engage in rapid guessing. Those who did, particularly on homework problems where few others did—identified by high rapid_guessing_std, or red color in the beeswarm plot (Figure 4)—were more generally affected negatively in their predicted score based on this feature. This effect can more clearly be seen when plotting the SHAP values for the feature against the values of the feature itself (Figure 5). This view allows us to get a better sense of how most instances with a higher rapid_guessing_std value impact the predicted score negatively. This aligns with our hypothesis: rapid guessing, with its potential implications of wheel spinning [3] and gaming the system [2, 28], is indicative of lower learning outcomes. Figure 4. SHAP beeswarm plot To further aid our interpretation, we also explored which features had the highest absolute correlation between their values and their corresponding SHAP values. In essence, this correlation is a measure of just how linear each feature’s effect is on the predicted score. We calculated Pearson’s r for all features (see Table 1) and found that all p values were below 0.0001, except for sd_time_diff. Throughout this analysis, we point out when a feature’s correlation is indicative of a linear relationship. 4.2.2 Solved We can see (Figure 4) that the bulk of solved_std is composed of high values (red color), indicating that most students managed to solve most homework problems. The long positive skew suggests Figure 5. Correlation between rapid_guessing_std and its that small, positive variations in this feature could potentially push SHAP values, with solved_std as color the predicted quiz score up by about 0.1. The few lower values in this feature (blue color) are found on the left side of the plot, By adding the values of our top impactful feature, solved_std, as suggesting that not solving the problem tended to pull the predicted the color of the plot illustrated in Figure 5, we can also see an score down. Indeed, we found a moderate positive linear interesting interaction between the two features. It appears that the relationship between solved_std and its SHAP values (r = 0.34), impact of the high rapid_guessing_std values is at least partly further confirming our initial analysis. dependent on solved_std—instances where the student failed to solve the problem (in blue) were less negatively impacted by rapid_guessing_std (as indicated by their mostly positive SHAP Table 1. Feature impact measures (r is correlation between feature values and corresponding SHAP values) feature mean absolute SHAP Gini importance permutation importance r p solved_std 0.02149 0.26554 0.17083 0.340 < 0.0001 rapid_guessing_std 0.01724 0.20987 0.17059 -0.076 < 0.0001 taking_break_std 0.00758 0.10511 0.04310 -0.608 < 0.0001 beyond_time_threshold_std 0.00373 0.07514 0.02790 -0.349 < 0.0001 beyond_attempt_threshold_std 0.00369 0.07028 0.02760 -0.080 < 0.0001 num_submissions_std 0.00326 0.07751 0.03082 -0.833 < 0.0001 total_time_std 0.00257 0.07871 0.02550 -0.773 < 0.0001 avg_time_diff_std 0.00099 0.05398 0.01558 -0.158 < 0.0001 sd_time_diff_std 0.00098 0.06166 0.01856 0.008 > 0.5 interleaving_std 0.00017 0.00219 0.00028 0.126 < 0.0001 values). One explanation may be that students who rely on rapid 4.2.5 Struggling threshold features guessing and manage to solve the problem may come away with For beyond_time_threshold_std, we can see in Figure 4 that lower more misguided confidence in their mastery of the material than values generally lead to increases in the predicted score and vice those who fail to solve the problem and are thus less likely to versa. This is indicative of the underlying attribute this feature consider reviewing before a quiz. However, this hypothesis was not attempts to capture—going beyond the time threshold yields investigated further. smaller (generally negative) SHAP values, whereas not going beyond the time threshold yields larger (generally positive) values, 4.2.4 Taking a break the exact value being heavily affected by how much other students Our model’s third most impactful feature, taking_break_std, has a crossed the threshold on the same problem. For students who take very clear pattern that is easily observable in Figure 4. Lower longer than the norm, this generally has a negative effect on their feature values generally lead to a positive impact on predicted score. The relationship here is moderately linear with an r of -0.349. score, whereas taking a break is more likely to have a negative impact on score. We found a negative linear relationship between The fifth top feature that we identified, taking_break_std and its SHAP value (Figure 6), with Pearson’s r beyond_attempt_threshold_std, does not have such a clear pattern. of -0.608. The distribution of SHAP values for this feature indicates The SHAP values seem to be widely spread irrespective of the a potential negative impact about three times as large as the positive feature’s values. The feature’s distribution is bimodal, as is the case one. with most of the features that standardize a binary variable, and we did find a small distinction in the SHAP values between the two modes (Figure 7). While the mean for each mode is essentially zero, higher instances of beyond_attempt_threshold_std, which correspond with student-problem instances that went beyond that problem’s attempt threshold, have a moderate negative correlation with their SHAP values (r = -0.409, p < 0.0001) and lower instances, on the other hand, have a positive correlation about equally as strong (r = 0.382, p < 0.0001). This suggests that the impact of this feature on predicted score is highly dependent on how much one’s status on the underlying binary variable (beyond_attempt_threshold) varies from the norm for that given homework problem. Figure 6. Correlation between taking_break_std and its SHAP values This result is the opposite of what we hypothesized. Since taking breaks during a difficult task has been shown to improve cognition [36], we hypothesized that students who took a break while struggling would ultimately be more productive. We specifically marked a student as taking a break only if there was a large gap between submissions (15 minutes) after they had passed one of the two struggling thresholds. One possible explanation is that students who took a break did, in fact, perform better than they would have otherwise. Since our method does not directly test causation, our model may be using Figure 7. Correlation between beyond_attempt_threshold_std this feature as a proxy for students who struggled more than others. and its SHAP values (with annotations) Another possibility is that this feature is not solely capturing intentional break-taking, but also interruptions to students’ work, 4.2.6 Number of submissions which may serve as distractions—certainly not an ideal learning We found that num_submissions_std, our model’s sixth top feature situation. We did not calculate how many times students took a in terms of impact, has the strongest correlation between its feature break, only if there was at least one 15-minute gap between values and SHAP values (r = -0.833). This fits with our hypothesis. submissions when struggling. Finally, because homework The more attempts that students submit, the more likely they are to problems were due at midnight on the day they became available, be struggling, and the less likely they are to perform well when students may simply not have had sufficient time for effective break tested on the same skills during their weekly quiz. taking. Without additional information about learning context or 4.2.7 Time features calculating additional features, we have no way of knowing which We found that our three time-related features—not including of these explanations, if any, are the most likely. beyond_time_threshold_std, which is of a very different nature since its non-standardized version is a binary feature—had some of the weakest predictive power in our model. total_time_std had a still moderate mean absolute SHAP at 0.00257 and a very strong did not take into account students’ prior knowledge and skills. This correlation between its feature and SHAP values with r = -0.773. made it difficult to measure the impact of students’ productive vs. avg_time_diff_std and sd_time_diff_std, by comparison, had a unproductive persistence directly. much lower mean absolute SHAP (respectively 0.00099 and These factors likely led to our model’s limited predictive 0.00098) and no correlation. performance (R2 = 0.147 on the held-out test set). While we believe The strong, negative correlation between total_time_std and its that our final model’s performance was sufficient for our purposes SHAP values mean that the model is interpreting longer time on a of interpreting the relationship between elements of persistence and problem as being related to lower learning outcomes, or at the very learning outcomes, it should be possible to create a more accurate least as a student struggling enough with a problem to lead to a model without severely sacrificing interpretability. lower score on the weekly quiz. This latter possibility is in line with our hypothesis and with what we found for 5. CONCLUSION beyond_time_threshold_std. Interestingly, this pattern is far more The most impactful features were those related to solving the pronounced for instances that went beyond the time threshold (red problem, rapid guessing, and taking a break. Those with the most points in Figure 8), whereas the relationship is seemingly reversed straightforward linear effect were the number of submissions, total for cases where students did not go beyond the time threshold time, and (again) taking a break. All three of the latter had a strong (blue/purple points in Figure 8). negative correlation between their feature values and their impact on prediction. In other words, more attempts, taking a longer time, and taking a break are all correlated with lower scores on related quiz questions. Solving the problem—our most impactful feature— had a moderate positive correlation, highlighting the positive nature of the relationship between successfully completing homework problems and score on subsequent related quiz questions. This all suggests that solving the problem and rapid guessing are important features for accurate prediction, while the number of submissions and total time are indicative of the differences between productive persistence and wheel spinning in a computer science context. Taking a break fits into both of these categories. Perhaps most important, we were able to identify features that are directly related to learning strategies. Our findings suggest that students should avoid rapidly submitting subsequent programming Figure 8. Correlation between total_time_std and its SHAP attempts without actively trying to address problems in their code values, with beyond_time_threshold_std as color (rapid guessing). Taking a break may also be unproductive behavior, though this finding may be an artifact of the specific As for the two features that specifically look at time between context in which students were able to submit homework in this submissions (avg_time_diff_std and sd_time_diff_std), their course, as well as the particular way in which we calculated this weakness both in predictive impact and correlation with SHAP feature. As for interleaving, its predictive strength in our model was values suggest at face value that this factor has little value at low, but its effects nevertheless suggest that a future investigation predicting learning success (or lack thereof) when students struggle should study whether it can be an effective practice when struggling with a problem. These features’ impact may also have been affected on a problem. by the high correlation between them (r = 0.76). Similar information may have also been captured by a combination of In order to address the limitations of our study, we suggest that beyond_time_threshold_std and beyond_attempt_threshold_std. future research focus on devising a more robust measure of learning that takes into account students’ individual starting points. 4.2.8 Interleaving Additionally, for the CS context of this study, a valid measure of Finally, our model’s least impactful feature, interleaving_std, had programming proficiency that considers the problem-solving by far the lowest mean absolute SHAP value (0.00017) and a low process would be superior to the quiz scores we used as proxy. correlation between its features and its SHAP values (r = 0.126). We originally hypothesized that this feature would play a bigger role in predicting students’ scores, considering that the practice of 6. ACKNOWLEDGMENTS We would like to acknowledge NSF grant #DRL-1942962 for interleaving when struggling is generally considered a good making this work possible. learning practice [32, 38]. However, its low impact in our model is likely because we had so few instances of interleaving—only nine out of 7,673 instances. Most of these nine did lead to an increase in 7. REFERENCES predicted score, but without more examples of the practice, we are [1] Ariga, A. and Lleras, A. 2011. Brief and rare mental “breaks” unable to make any sound conclusions regarding its role. keep you focused: Deactivation and reactivation of task goals preempt vigilance decrements. Cognition. 118, 3 4.3 Limitations (Mar. 2011), 439–443. DOI:https://doi.org/10/b8qg78. Our study suffers from limitations primarily related to the aligned- [2] Baker, R.S., Corbett, A.T., Koedinger, K.R. and Wagner, A.Z. quiz-question scores we calculated for each student-problem 2004. Off-task behavior in the cognitive tutor classroom: instance. For one, the score distribution was heavily skewed due to When students “game the system.” CHI ’04: Proceedings of the abundance of almost perfect quiz scores. Additionally, while the SIGCHI Conference on Human Factors in Computing the PrairieLearn platform allowed us to use the course’s quizzes Systems (2004), 8. without requiring students to take an additional posttest, the scores [3] Beck, J.E. and Gong, Y. 2013. Wheel-spinning: Students who [18] Krumm, A.E., Beattie, R., Takahashi, S., D’Angelo, C., Feng, fail to master a skill. Artificial Intelligence in Education M. and Cheng, B. 2016. Practical measurement and (Berlin, Heidelberg, 2013), 431–440. productive persistence: Strategies for using digital learning [4] Berland, M., Martin, T., Benton, T., Petrick Smith, C. and system data to drive improvement. Journal of Learning Davis, D. 2013. Using learning analytics to understand the Analytics. 3, 2 (Sep. 2016), 116–138. learning pathways of novice programmers. Journal of the DOI:https://doi.org/10/ggxwxt. Learning Sciences. 22, 4 (Oct. 2013), 564–599. [19] Kühnel, J., Zacher, H., Bloom, J. de and Bledow, R. 2017. DOI:https://doi.org/10/gg7fkh. Take a break! Benefits of sleep and short breaks for daily [5] Blikstein, P. 2011. Using learning analytics to assess students’ work engagement. European Journal of Work and behavior in open-ended programming tasks. Proceedings of Organizational Psychology. 26, 4 (Jul. 2017), 481–491. the 1st International Conference on Learning Analytics and DOI:https://doi.org/10/gfzk8b. Knowledge (Banff, Alberta, Canada, Feb. 2011), 110–116. [20] Liu, Z., Zhi, R., Hicks, A. and Barnes, T. 2017. Understanding [6] Blikstein, P., Worsley, M., Piech, C., Sahami, M., Cooper, S. problem solving behavior of 6–8 graders in a debugging and Koller, D. 2014. Programming pluralism: Using game. Computer Science Education. 27, 1 (Jan. 2017), 1– learning analytics to detect patterns in the learning of 29. DOI:https://doi.org/10/gftxxk. computer programming. Journal of the Learning Sciences. [21] Lundberg, S.M., Erion, G., Chen, H., DeGrave, A., Prutkin, 23, 4 (Oct. 2014), 561–599. J.M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N. and Lee, DOI:https://doi.org/10.1080/10508406.2014.954750. S.-I. 2020. From local explanations to global understanding [7] DiCerbo, K.E. 2014. Game-based assessment of persistence. with explainable AI for trees. Nature Machine Intelligence. Journal of Educational Technology & Society. 17, 1 (2014), 2, 1 (Jan. 2020), 56–67. DOI:https://doi.org/10/ggjtp4. 17–28. [22] Lundberg, S.M. and Lee, S.-I. 2017. A unified approach to [8] Duckworth, A.L., Peterson, C., Matthews, M.D. and Kelly, interpreting model predictions. Advances in Neural D.R. 2007. Grit: Perseverance and passion for long-term Information Processing Systems. 30, (2017). goals. Journal of Personality and Social Psychology. 92, 6 [23] Mahatanankoon, P. and Sikolia, D.W. 2017. Intention to (2007), 1087–1101. DOI:https://doi.org/10.1037/0022- remain in a computing program: Exploring the role of 3514.92.6.1087. passion and grit. Twenty-third Americas Conference on [9] Dweck, C.S., Walton, G.M. and Cohen, G.L. 2014. Academic Information Systems. (2017). tenacity: Mindsets and skills that promote long-term [24] Matsuda, N., Chandrasekaran, S. and Stamper, J. 2016. How learning. Bill & Melinda Gates Foundation. quickly can wheel spinning be detected? Proceedings of The [10] Fields, D.A., Quirke, L., Amely, J. and Maughan, J. 2016. 9th International Conference on Educational Data Mining Combining big data and thick data analyses for (EDM 2016) (2016), 607–608. understanding youth learning trajectories in a summer [25] McDermott, R., Daniels, M. and Cajander, Å. 2015. coding camp. Proceedings of the 47th ACM Technical Perseverance measures and attainment in first year Symposium on Computing Science Education (New York, computing science students. Proceedings of the 2015 ACM NY, USA, Feb. 2016), 150–155. Conference on Innovation and Technology in Computer [11] Gong, Y. and Beck, J.E. 2015. Towards detecting wheel- Science Education (Vilnius, Lithuania, Jun. 2015), 302–307. spinning: Future failure in mastery learning. Proceedings of [26] McGinley, L. 2011. Test performance and study breaks. Fort the Second (2015) ACM Conference on Learning @ Scale Hays State University. (Vancouver BC Canada, Mar. 2015), 67–74. [27] Owen, V.E., Roy, M.-H., Thai, K.P., Burnett, V., Jacobs, D., [12] Hooker, G. and Mentch, L. 2019. Please stop permuting Keylor, E. and Baker, R.S. 2019. Detecting wheel-spinning features: An explanation and alternatives. preprint and productive persistence in educational games. arXiv:1905.03151. (May 2019). Proceedings of The 12th International Conference on [13] Ihantola, P. et al. 2015. Educational data mining and learning Educational Data Mining (EDM 2019) (Jul. 2019), 378– analytics in programming: Literature review and case 383. studies. Proceedings of the 2015 ITiCSE on Working Group [28] Paquette, L., de Carvalho, A.M.J.A. and Baker, R.S. 2014. Reports (Vilnius Lithuania, Jul. 2015), 41–63. Towards Understanding Expert Coding of Student [14] Joy, T.T., Rana, S., Gupta, S. and Venkatesh, S. 2016. Disengagement in Online Learning. Proceedings of the 36th Hyperparameter tuning for big data using Bayesian Annual Cognitive Science Conference (2014), 1126–1131. optimisation. 2016 23rd International Conference on [29] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Pattern Recognition (ICPR) (Dec. 2016), 2574–2579. Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, [15] Kai, S., Almeda, M.V., Baker, R.S., Heffernan, C. and R., Dubourg, V., Vanderplas, J., Passos, A. and Cournapeau, Heffernan, N. 2018. Decision tree modeling of wheel- D. 2011. Scikit-learn: Machine learning in Python. Journal spinning and productive persistence in skill builders. JEDM of Machine Learning Research. 12, (2011), 2825–2830. | Journal of Educational Data Mining. 10, 1 (Jun. 2018), 36– [30] Piech, C., Sahami, M., Koller, D., Cooper, S. and Blikstein, P. 71. DOI:https://doi.org/10.5281/zenodo.3344810. 2012. Modeling how students learn to program. Proceedings [16] Käser, T., Klingler, S. and Gross, M. 2016. When to stop? of the 43rd ACM technical symposium on Computer Science Towards universal instructional policies. Proceedings of the Education (New York, NY, USA, Feb. 2012), 153–160. Sixth International Conference on Learning Analytics & [31] PyCaret: An open source, low-code machine learning library Knowledge - LAK ’16 (Edinburgh, United Kingdom, 2016), in Python: 2020. https://www.pycaret.org. 289–298. [32] Rohrer, D., Dedrick, R.F. and Stershic, S. 2015. Interleaved [17] Kench, D., Hazelhurst, S. and Otulaja, F. 2016. Grit and practice improves mathematics learning. Journal of growth mindset among high school students in a computer Educational Psychology. 107, 3 (2015), 900–908. programming project: A mixed methods study. ICT DOI:https://doi.org/10/gf7dfp. Education (Cham, 2016), 187–194. [33] Scikit-optimize: Sequential model-based optimization in [38] Taylor, K. and Rohrer, D. 2010. The effects of interleaved Python: 2020. https://scikit-optimize.github.io/. practice. Applied Cognitive Psychology. 24, 6 (2010), 837– [34] Shute, V.J., D’Mello, S., Baker, R., Cho, K., Bosch, N., 848. DOI:https://doi.org/10/fkm7mp. Ocumpaugh, J., Ventura, M. and Almeda, V. 2015. [39] Wang, Y., Kai, S. and Baker, R.S. 2020. Early detection of Modeling how incoming knowledge, persistence, affective wheel-spinning in ASSISTments. Artificial Intelligence in states, and in-game progress influence student learning from Education. I.I. Bittencourt, M. Cukurova, K. Muldner, R. an educational game. Computers & Education. 86, (Aug. Luckin, and E. Millán, eds. Springer International 2015), 224–235. Publishing. 574–585. DOI:https://doi.org/10.1016/j.compedu.2015.08.001. [40] West, M., Herman, G. and Zilles, C. 2015. PrairieLearn: [35] Sigurdson, N. and Petersen, A. 2018. An exploration of grit in Mastery-based online problem solving with adaptive a CS1 context. Proceedings of the 18th Koli Calling scoring and recommendations driven by machine learning. International Conference on Computing Education 2015 ASEE Annual Conference and Exposition Proceedings Research (Koli, Finland, Nov. 2018). (Seattle, Washington, Jun. 2015), 26.1238.1-26.1238.14. [36] Steinborn, M.B. and Huestegge, L. 2016. A walk down the [41] Wolf, J.R. and Jia, R. 2015. The role of grit in predicting lane gives wings to your brain: Restorative benefits of rest student performance in introductory programming courses: breaks on cognition and self-control. Applied Cognitive An exploratory study. SAIS 2015 Proceedings. 21, (2015). Psychology. 30, 5 (2016), 795–805. [42] Zhang, C., Huang, Y., Wang, J., Lu, D., Fang, W., Fancsali, DOI:https://doi.org/10/ghcrj3. S., Holstein, K. and Aleven, V. 2019. Early detection of [37] Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T. and wheel spinning: Comparison across tutors, models, features, Zeileis, A. 2008. Conditional variable importance for and operationalizations. Proceedings of The 12th random forests. BMC Bioinformatics. 9, 1 (Dec. 2008). International Conference on Educational Data Mining DOI:https://doi.org/10/d7p3rw. (EDM 2019). (2019), 468–473.