How Long is Enough? Predicting Student Outcomes with Same-Day Gameplay Data in an Educational Math Game Rachel Harred Christa Cody Mehak Maniktala North Carolina State North Carolina State North Carolina State University University University rlharred@ncsu.edu cncody@ncsu.edu mmanikt@ncsu.edu Preya Shabrina Tiffany Barnes Collin Lynch North Carolina State North Carolina State North Carolina State University University University pshabri@ncsu.edu tmbarnes@ncsu.edu cflynch@ncsu.edu ABSTRACT that students who “struggled quietly” often went unnoticed. Curriculum-integrated games can provide teachers with data In other work, the authors found that when students pos- to help them decide when and how to intervene with indi- sibly need intervention but do not receive it, they might vidual students. Based on our prior work observing teach- get frustrated and give up or replay an easier game instead ers using ST Math, teachers may not be able to attend to a [9]. Other research has also shown that teachers can often dashboard or student screens to determine who might need unintentionally favor or give assistance to certain types of intervention. We therefore set out to determine how much students due to differences in perceptions or help-seeking data we need from the current ST Math gameplay session to behaviors [4, 22, 5]. Therefore, providing teachers with in- predict performance. Based on the available log data that formation to help them determine who needs assistance the tracks student performance over SETS of puzzles, we per- most may be crucial to some low-performing students. formed two experiments to predict performance. The first uses data from one game level, which is about 3 minutes Despite the amount of data gathered with each playthrough, long, to predict the performance on the next level, and the teachers in our system are only provided with a student’s second uses the first 6 minutes of gameplay to predict how current progress in the curriculum and a feature that al- many levels a student can complete in 20 minutes, a typical lows a student to “raise” their hand through the system. class length. Our results show that our data are not fine- However, this is only visible on the student’s screen via a grained enough to allow for paired level prediction, but that purple hand indicator and often goes unnoticed. Therefore, 6 minutes of gameplay can be used to rank students in or- we sought to determine if there was a way to provide teach- der of performance for a class session. These results can be ers with knowledge regarding students projected progress used as a basis for an alert system that could help teachers as fast as possible, so that the teachers can determine who prioritize their time in the classroom. to help from there. With the machine learning techniques that can process such data and help predict outcomes, we 1. INTRODUCTION wanted to find the correct technique to answer our question. Educational games can be a useful tool for teachers to pro- Machine learning and educational data mining techniques vide additional practical learning for students [3]. As more have been successfully used in educational game research educational games become curriculum-integrated, a signifi- for many years [11, 21, 10, 18]. cant portion of a students time can be spent in these sys- tems. However, teachers cannot monitor and assist each In this paper, we tried to determine the smallest amount of student at the same time, struggling to identify students time needed to predict student outcomes for one gameplay who need help the most. In previous work, we observed session by investigating multiple feature selection algorithms teachers assistance often was influenced by things such as and prediction models on student gameplay data for an edu- classroom layout and disruptive behavior rather than learner cational game, Spatial Temporal Math (ST Math). We tried proficiency or needs [13]. Furthermore, the work identified two methods of prediction using data analysis and machine learning: 1) Trying to predict student outcomes for play- ing one level of a game using gameplay data from only the previous level, 2) Using the least amount of time of a stu- dent’s gameplay data to predict the number of levels they will pass in the next twenty minutes of gameplay. To accom- plish this, we tried various machine learning and feature se- lection methods to find the most significant features needed to predict student outcomes in this educational game. In this study our intention was to give insight to the teachers of ST Math by indicating our best guess for which students Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). could most benefit from teacher intervention on a single day, provided early enough in the gameplay session to allow the teacher to help as many as possible. 1.1 Spatial Temporal Math (ST Math) ST Math is a curriculum-integrated supplemental mathe- matics game for 2nd-4th-grade students that uses spatial puzzles to teach basic math concepts [19, 12, 14, 15]. The puzzles do not contain any textual instruction. The games are grouped at the highest level by objective which indicates the broad math concept. Each objective contains a number of games, the gameplay under an objective varies but con- cerns the same content inside an objective. The games usu- ally have between 3 to 5 levels each, and the gameplay across levels is similar but increases in difficulty. There are usually between 6 and 8 puzzles per level. The puzzles are either randomly generated using a template or randomly selected from pre-designed puzzles depending on the level. Each puz- zle requires the student to do the correct action to indicate their answer. Animated feedback is presented to the student following the puzzle solving attempt that shows the student Figure 1: ST Math if they are correct or incorrect. For example, in the game “Fair Sharing” under “Division Concepts” a student is asked to distribute boxes equally among animals to construct a help-seeking students went unnoticed. In classrooms that straight bridge and will show the bridge blocked off or with use rotation-seating, teacher attention was only given if the gaps that make it impossible to cross in case of an incorrect ST Math group was being disruptive. In general, the stu- answer. A level begins with a set number of lives, usually dents who directly asked for help or were obviously off-task 2, that resets at the beginning of each level. If a student’s received more teacher intervention than the students who response is incorrect, the student loses a life. If the student were less vocal [13]. loses all of their lives before completing the level, they do not pass the level and must retry it. To pass a level, the The task of automatically identifying students who need student must complete all puzzles without losing all their help has been explored [2]. With machine learning meth- lives. After a student passes a level, they may move on to ods that can process large amounts of data and make pre- the next level in the game or objective, or backtrack and dictions, many have been using these methods in attempt play a previously passed level. We refer to this backtracking to solve the problem. Ahadi et al. explored machine learn- as replay. A level attempt includes passing a level, failing ing techniques and were able to use the first week of data a level, and replay. Each student has the option to ”raise in a programming course to predict student performance their hand” if they want help from their teacher by clicking with accuracy ranging from 71-80%. Additionally, Jiang on a hand icon on the screen. The teacher also has access et al. employed logistic regression models to predict the see which objectives and levels a student has passed. See type of certificate a learner received in Massive Open Online Figure 1 for a breakdown of ST Math. Courses (MOOCs) [8]. In a study at the Open University, decision tree models were implemented on users’ current and 2. RELATED WORK previous activity to predict if they were at risk of failing a Educational games in classrooms are helpful to teachers be- module [24]. Another Open University study explored using cause the students can receive individualized attention and Bayesian models to build real-time predictive models from learning from the game while the teacher gives one-on-one student data and found little difference among the types of attention to students who need it.[7]. However, teachers models but that the accuracy increased with the addition have limited time and try to prioritize their attention to the of data throughout the progression of the learning module students who need it most. It has been shown that students [23]. These studies show that machine learning can be used who are given attention by teachers have increased student to predict and understand student behavior, but are not be- learning[5]; therefore, teachers who are able to focus their ing used to directly aid in student learning. attention to low-performing students should see them bene- fit. Predictive models are now being integrated into teacher dash- boards or alert systems to enable this aid. In a survey of Unfortunately, there are many reasons that students who K-12 teachers who used intelligent tutoring systems in class, need help do not receive it. One study found that middle- Holstein et al. found that they were very interested in hav- class students seek help more directly than working-class ing real-time classroom monitoring tools that would help students and end up getting more help as a result[4]. In them decide which students most needed their attention [7]. a recent study of classroom observations using ST Math, In another study, Holstein et al. developed a teacher alert researchers found differences in classroom format have an system using smart glasses that showed real-time indicators influence on who receives help[13]. Furthermore, this work of student behaviors floating above their heads [6]. Their found teachers in free-seating classrooms could not easily see early findings suggest that this helped direct the teachers to the raised hand indicator on student screens and so those the students who needed intervention the most [7]. 3. DATA The data is collected by MIND Research Institute, who cre- Table 1: Features Created for the Analysis ated ST Math. This study was conducted on data from Feature Description 3rd grade students who played ST Math during the 2016- game The name of the game 2017 school year. The data contains 31 objectives, 154 this level is in, games, and 669 levels which equals 5,186,269 total level at- extrapolated from the tempts by 8,983 students from 111 schools and 636 teach- Objective Code and Level ers. We excluded students who completed objectives not performance Number of puzzles correct contained in the 3rd grade objectives which removed 21,544 before losing all lives divided level attempts. These are students who might have been by total puzzles in level erroneously included from other grades, as the system is isReplay 1 if this level has been used for grades 3-5. For the purposes of our study, we fil- attempted before, otherwise 0 tered out level attempts where the student only completed isUnneceReplay 1 if the level is one level attempt in our 30 minute gameplay session cutoff. replay, otherwise 0 This removed 101,849 level attempts, leaving us with a fi- prevPassAttempt For replay: 1 if it is on nal dataset of 5,062,876 unique level attempts. The initial SameLevel the same level played data we were provided with had 6 features for each level at- in previous row, 0 if not, tempt: STMathID (unique ID for each student), Level, Ob- NA if not replay jective Code, Timestamp, Number of Correct Puzzles, and passedCurrentLevel 1 if the student passed Total Number of Puzzles. For the purpose of our analysis, B4Unnece the level in the previous row we created additional features shown in 1. (their “current” level) AND this row is replay, 0 if not, NA if not replay 4. METHOD sameObj 1 if current level and previous We wanted to explore different ways of providing teachers are the same objective with information about student progress, so that they could intervene and monitor progress according to their own class- levelPlayTimeSec Timestamp - previous row’s room goals. Therefore, our intention is to predict the pro- timestamp jected progress of a student using the least amount of infor- gameplaySession Numbered starting at 1 mation and using only same-day gameplay data. Here, we and incrementing if the are comparing two methods of data segmentation: predict- levelPlayTimeSec > 30 ing the time spent passing the next level based on the time minutes (1800s). Rows with the spent passing the current level, and using the least amount same numbered gameplaySession of gameplay to predict how many levels would be passed are assumed to happen in the next 20 minutes of gameplay. We are attempting to during a single “play session” see if a student’s performance in the beginning of a game- prevPerf Average of previous gameplay play session is a good prediction for their later performance. session’s performance A gameplay session is defined as subsequent level attempts firstInGameplaySession 1 if this level attempt is that are separated by less than 30 minutes. If two level at- the first in a new gameplay tempts are separated by 30 minutes or longer, we count the session, otherwise 0 second attempt as a new gameplay session. We decided on a lastInGameplaySession 1 if this level attempt is 30 minute cutoff because a pause in the game of 30 minutes the last in a gameplay or more might indicate the student was working on some- session, otherwise 0 thing else in between and we cannot say that the previously played level will have any effect on the performance of next level. However, we still want to include students who may intervention for the next level. This would be beneficial to be truly struggling, receiving help from a teacher or giving teachers because it would give an alert immediately after a help to another student during the playthrough of a level. student finishes a level that would tell them that the stu- Only 3.5% of the data for the gap between play sessions was dent might need help on the next level. We used pairwise between 30 minutes and 20 hours, while 11.6% of the data prediction, so the data was grouped into Level A and Level was 20 hours or longer between gameplay sessions. B, with the constraint that Level A and Level B had to be in the same objective and the same gameplay session. We 5. EXPERIMENT 1: PAIRWISE PREDIC- considered Level A to be the first attempt on a new level, all subsequent attempts (retries) until the level was passed, and TION any replayed levels that happened before or after the level This section details our attempt at using previous level data was passed. Any gameplay that happened between the first to predict student outcomes for the next level attempt. attempts of two consecutive levels would be counted as Level A data. Level B is the next attempt on a new level after 5.1 Pairwise Method Level A, and contains the same information as Level A: num- We wanted to see if we can predict how well a student will ber of attempts, retries, replays before and after passing, up do on the next level by using only the data from the pre- until the next new level attempt. We expected the num- vious level in the prediction. Our aim in this experiment ber of attempts to pass a level to provide information about was to investigate if the features of a single level for a stu- how many attempts they will need for the next level because dent can predict whether the student would have needed an the levels increase in difficulty inside objectives and this is consistent through ST Math. Also, research has shown that replay that happens before passing a level results in a neg- Table 2: Features Selected using Linear SVC (with ative effect on performance while replay that happens after L1 loss) for Pairwise Prediction passing a level has been shown to have a positive effect[12]; Feature Description Mean SD/ therefore, we expected this data to also be useful for the (mode occurrence prediction. for for Binary) Binary Due to time constraints and the complexity of the feature creation, we used a subset of our total dataset for this anal- retryOr- 1 if 0 0:181,804 ysis. The dataset includes 830 students, and 665 unique Replay- retryAnd- 1:40,454 objective-level pairs. Objective-level pairs are level pairs Binary ReplaySum>0, within an objective. However, some students did not com- otherwise 0 plete all the objective-levels resulting in a total of 277,975 replay- 1 if 0 0:221,981 unique student objective-level pairs. Binary replaySum>0, 1:277 otherwise 0 retryAnd- Total Time 65.78 200.18 5.1.1 Pairwise Feature Generation Replay- Spent on Since the raw data only included gameplay aspects per at- TimeSum Retries tempt such as time taken, attempt performance and the kind and Replays of attempt (retries, replay, etc.), we engineered 33 additional avgTime- Average Overall 225.49 175.23 features for every student objective-level pair. There are 7 PerLevel time attempt categories and 5 metrics per each category. The avgPerf- Average Overall 0.94 0.16 7 attempt categories are as follows: overall level attempts, Total Performance total retry and replay attempts, retry attempts, total replay avgRetry- Average Replay 0.16 0.35 attempts, total replay attempts before passing the current AndReplay- Performance level, total replay attempts after passing, replay attempts of Perf the same level (current) after passing, and replay attempts avgRetry- Average Retry 0.16 0.35 of other levels after passing. The 5 metrics for each cate- Perf Performance gory are as follows: whether an attempt category occurred (except for overall attempts category), total number of at- tempts (except for overall attempts category), total time, 5.2 Pairwise Results & Discussion average time, and average performance. The feature selection based on LinearSVC with L1 loss and the Random Forest classifier provided the most optimal pre- Next, for each student, we identified the consecutive levels diction accuracy. Table 3 shows the results of using a Ran- (Level A, Level B ) within each objective that were com- dom Forest classifier for each intervention type. We observed pleted in the same session. We found a total of 222,258 such that the average time spent on Level A was selected for each level pairs for 830 students. We explored four different ways intervention type based target for Level B. The distribution to define an intervention: if the total time was greater than parameters of the features selected are shown in Table 2. the 75th percentile (I-TotalTime), if the average time was We observed that very few features related to the replays greater than the 75th percentile (I-AvgTime), if the average were selected. This may be because the consecutive level performance was less than 25th percentile (I-AvgPerf), and pair dataset recorded very few rows with retries or replays if the student could not finish a level in the first attempt (18.20%) and even lower just considering replays (0.12%). (I-FirstAttempt). Each of these intervention types were in- Such a high degree of sparsity in replay made any replay re- tended to capture a different aspect of a student’s ability to lated features not significant enough to contribute towards complete a level. the predictions. Another interesting observation is that the performance over all the attempts (avgPerfTotal) in Level A was not a significant predictor of the intervention for Level B 5.1.2 Pairwise Feature Selection and Prediction Mod- in any of the models primarily because of the small variance els recorded for this feature. The small variance is due to the The analysis was carried out in Python. We normalized the granularity of the data only recording passed level attempts time-related features and then explored three feature selec- with failed puzzles as 100% performance. On the other hand, tion techniques in the scikit-learn[16] package. We used a the average time in Level A (avgTimePerLevel) was a sig- pipeline of a feature selection wrapper method, SelectFrom- nificant predictor of the intervention for Level B for every Model, with models such as LinearSVC (Linear Support intervention type based target. We recorded few replays in Vector Classifier with L1 loss), LassoCV (Lasso linear model general and a low variance in the performance related fea- with 3-fold Cross Validation), and Logistic Regression. We tures, so only the time related features were varied enough used 7 classifiers KNN (n = 3), LinearSVC (Linear Support to capture the relationship between Level A and Level B. Vector Classifier), Decision Tree (using Gini index), Ran- The results suggest that the average time spent on an at- dom Forest (using Gini index), MLP (Multi-layer Percep- tempt in a level is the most significant predictor of whether a tron classifier with a 2-layer (100,100) neural network using student may need assistance in the next level. However, the a learning rate of 0.001 and reLU activation function), AD- classifier models for each intervention type did not perform Aboost, Naive Bayes and measured the prediction accuracy significantly better than a baseline classifier that would pre- using 10-fold Cross Validation. dict all the observations to be the Majority Class (the class Table 3: Features Selected based on Linear SVC with L1 loss and Prediction Accuracy with Significant Predictors using a Random Forest classifier for Each Intervention Type Target I-TotalTime I-AvgTime I-AvgPerf I-FirstAttempt Feature avgTimePerLevel avgTimePerLevel avgTimePerLevel avgTimePerLevel Selected avgPerfTotal avgPerfTotal avgPerfTotal retryOrReplayBinary avgRetryAndReplayPerf retryOrReplayBinary retryAndReplayTimeSum replayBinary avgRetryPerf Significant avgTimePerLevel: 0.7 avgTimePerLevel: 0.93 avgTimePerLevel: 0.98 avgTimePerLevel: 1.00 Predictor retryAnd- ReplayTimeSum: 0.17 Majority 76.64% 76.18% 99.59% 75.32% Class Prediction 77.20% 77.21% 99.60% 76.49% Accuracy (K=10) containing more students) as shown in Table 3. This sug- details about level attempts, including knowing more about gests that the relation between the behavior of students in how the levels compare to each other within each objective, two consecutive levels may be highly varied and that it is may improve the prediction accuracy. difficult to generalize whether an intervention is needed in a level based on only one previous level. It may also suggest that such a prediction may be dependant on how far along 6. EXPERIMENT 2: LEVELS COMPLETED students are in their academic year. To investigate the first IN 20 MINUTES scenario, we added a feature for a student’s previous perfor- We chose to determine if we could predict the number of mance average, an average of every level attempt until now, levels completed in 20 minutes using only information from in attempt to help distinguish low-performing students from the current session. Using only information from the cur- the rest. Previous performance average was selected for each rent session will allow an easier integration with the current intervention type prediction but had lower feature impor- system with minimal changes needed. Different schools and tance (0.03%) because of the low variance and, therefore, classrooms have unique ways of using ST Math[13]. As a did not affect the prediction accuracy. To investigate if the result, there is a variety of session times ranging from very time of the academic year had any impact on the predic- short (less than 5 min) to sessions lasting over an hour, with tion, we added a feature to indicate the month in which an average of 23 minutes spent in a session. Therefore, we the sessions occurred. Similar to the previous performance decided to predict how many levels a student would com- average, this feature was selected but had a low feature im- plete in a 20 minute session. This information could be portance (0.05%) leading to an insignificant difference in the used by teachers to identify students who will not be able prediction accuracy. to complete the number of levels the teacher expects for that session and the teacher can intervene to assist or en- Since only time related features were varied enough to cap- courage. With this prediction, the system could provide a ture the variance in the student behavior in two consecutive teacher with each student’s predictions and order the stu- level pairs, we explored ways other than feature generation dents by the lowest predicted number of levels to complete to perform the pairwise prediction. We sliced the data based in the next 20 minutes. Then, the teacher can easily look at on aspects, such as replay type or month of the year but, the slowest students and make the judgment, based on their again, obtained similar prediction accuracies; however, the knowledge of each student and what goals the teacher has for replay related features did get selected and had high impor- that lesson, and determine who they need to assist. Studies tance for the prediction in the data sliced by replay type. have shown that teachers may focus assistance on students To investigate if the variance in the content of the objec- with better help-seeking behaviors because they are often tives may be affecting the prediction results, we performed more persistent or better in requesting help[4, 20]. Provid- the prediction for each intervention type within a single ob- ing this information this early in the session could be crucial jective and observed that the prediction accuracy decreased for low-performing student with who are not asking for help slightly. This suggests that even within one objective, the or not doing so effectively. behavior of a student in one level, as captured in its cur- rent granularity, may not accurately predict if they need an intervention in the next level. 6.1 Levels Completed Prediction: Methods The data we used consisted of 787949 session observations The pairwise prediction models may not have generated de- from 8978 unique students, 111 schools, and 636 teachers. A sirable results because we may need more than just one pre- session represents a period of time that the student spends vious level’s data to predict if an intervention is needed. working on ST Math without taking longer than a 30 minute There is not sufficient data about each level in this dataset to break (see Section 4 for full definition). For accurate predic- accurately represent the student’s performance and to cre- tions, we chose to use the first 6 minutes of gameplay due to ate good predictions. Therefore, having more fine-grained the average level attempt taking approximately 3 minutes. We refer to this segment of data used for prediction as a time “slice”. Since our goal was to use the least amount of For the models, we tried both classification and regression. information to do the prediction, we wanted this time to be For classifications, multiple groupings of number of com- as short as possible. We initially attempted to use shorter pleted levels in 20 minutes were chosen using balanced classes, time slices, but due to a level attempt taking on average the best accuracy (77% using a 2-layer neural network) be- 3 minutes, this did not provide a sufficient amount of data ing a split to determine if a student could complete at least to represent the students’ gameplay behavior and, in some an average number of levels in 20 minutes. However, we cases, eliminated slower students data for that time slice. decided that regression, providing finer-grained predictions, would provide more useful information to the teachers and Next, we removed sessions under 10 minutes (242750 obs. allow them to have more autonomy in deciding which stu- - 169863 obs. under 6 min) and sessions over 75 minutes dents need help. For regression, we tried to predict how (10,257 obs.). We chose these cutoffs to eliminate short ses- many levels a student could complete in 20 minutes, which sions where predictions would not be useful and long ses- derived from taking the total number of levels completed sions, in some cases over 4 hours, that were likely anoma- and the total time of the session and scaling. lous. We tried multiple models, including decision trees, neural Table 5 shows the statistics for session and slice features. networks, and random forests after normalizing the features. The average session time was 28 minutes and the average Intrepretable machine learning methods are more important slice length was 4.5 minutes. because knowing which features are more influential to pre- dicting performance can give insights to how students learn in games. In the results, we show the 2 best models com- Table 4: Session and Time Slice Stats for the time, pared to a baseline. The baseline model is created by always performance, and levels completed predicting the mean of the completed levels per 20 minutes. Feature Mean(SD) Mdn The best models were created by testing multiple models Session Levels per 20 min 3.8 (2.5) 3.5 and fine-tuning the parameters. The two best models are Total Time (min) 28.9(26.3) 816.2 created from scikit-learn: a 3-layer (50,30,20) neural net- Avg Performance 74.9%(23.4) 81.3% work (MLPRegressor) using a learning rate of 0.001 and Slice Levels Completed 1.9(1.1) 1 reLU activation function, and a Random Forest (Random- Total Time (min) 4.5(0.9) 4.6 ForestRegressor) using mean squared error as the criterion Avg Performance 72.6%(34.0) 100 function and setting the minimum samples for a split to be 20. 6.1.1 Levels Completed Prediction: Feature Gener- To evaluate each model, we used the following metrics: Mean ation Absolute Error (MAE), Root Mean Squared Error (RMSE), adjusted R-squared score, and explained variance (EV). We From the level data, the data was segmented into the first 6 choose these metrics to evaluate, on average, how accurate minute time slice for prediction. Features were aggregated each prediction rate was to determine if the error is small from this 6 minute time slice to capture what each student enough to still provide a good estimate of the students’ pro- was able to do, such as complete a level, fail a level, retry a jected progress. Both R2 and EV were used to evaluate the level, or engage in replay. The features generated are based variance of these errors and check for biases within the mod- on performance, time, level attempts/replay features, the els. All models were evaluated with 10-fold cross validation. objective the student was in within that slice. The per- formance features were: average performance (numeric), We attempted feature selection using scikit-learn filter meth- percentage of levels passed out of all attempts (numeric), ods, such as feature importance from tree regression, and a and percentage of levels completed out of all attempts (nu- wrapper method (SelectFromModel) with each model. Fea- meric). The time features from the slice were: the total ture selection did not improve any models, and resulted in time (numeric), the average level time (numeric), number significantly worse predictions in most cases. This is most of passed levels per time (numeric, scaled), number of com- likely due to the limited amount of features available. Be- pleted levels per time (numeric, scaled), and the month of cause our data is not fine-grained, we have a limited amount the session. The level attempt features were: total num- of information about each student for a level attempt. This ber of replays (numeric), total number of levels failed (nu- indicates that each feature could be providing key informa- meric), total number of levels passed (numeric), total num- tion regarding their current progress. Therefore, for the best ber of levels completed (numeric), total puzzles attempted models the whole feature set was used (see Feature Genera- (numeric), total puzzles completed (numeric), whether they tion). engaged in replay (binary), whether they re-attempted a level (binary). Then, there were 31 binary features rep- resenting the objective the student was playing when the 6.2 Levels Completed Prediction: Results & session started. Discussion This section discusses the results of the Experiment 2 re- gressions. 6.1.2 Levels Completed Prediction: Model Selection and Feature Selection Table 5 shows the results of the evaluation metrics for the For this prediction, we tried a variety of models, tuning of two best regression models compared to the baseline model. these models, and alteration of the target variable. Model The NN and the Random Forest both perform similarly, and feature selection was accomplished by using scikit-learn[16]. both outperforming the baseline model. Although the MAE does not have a large difference, the RMSE is much lower. This indicates that the variance of the errors is significantly smaller for our predictive models. The MAE of 1.2 for our predictive models means that on average the prediction will only be around 1 level off for a specific student, which still provides a good estimation for the teacher to use. The ad- justed R-squared and explained variance are almost identical for both models, which happens when the mean of the errors is approaching zero. Although these scores are not perfect, in the context of educational data from a system used with multiple teaching styles, this is a highly meaningful result[1, 17]. Figure 2: These density plots show the predicted vs. Table 5: Results for the 2 best regression models actual values for the two best regression models to compared to a baseline(mean). predict the number of completed levels in 20 min- Model MAE RMSE R2 EV utes. Note: the yellow/lightest areas represent the Forest 1.24 1.59 0.58 0.58 highest density of points. NN 1.22 1.58 0.59 0.59 Baseline 1.96 2.48 -1.2E-5 0.0 them during a gameplay session. Therefore, a student may Table 6 shows the top five most important features for deci- spend part of the session working as normal, then, after the sions in the Random Forest. All of these top features focus teacher has identified a struggling student, the teacher may on the number of completed levels, the total time, or a com- ask the student next to them to help. This could result in bination of these features. This is not surprising because the much higher predicted values than what the student actu- amount of levels a student can complete in the first 6 min- ally completes. Furthermore, we observed students in some utes should be a good indication of how they will perform classrooms initially talking and working at a slower pace in over the whole session. However, this is assuming that the the first few minutes of a session as they settled in, then students remain seated and playing the game in the same shortly being asked to focus. This could result in much manner. lower predictions of the projected number of levels that stu- dent could complete. Since the data does not only include the sessions where students quietly work by themselves for a Table 6: The top five most important features from continuous period of time, accurate predictions are difficult. the Random Forest Rank Feature Furthermore, the data we used focused on only the same-day 1 Total Levels Completed gameplay data, not containing any information regarding 2 % of Levels Completed how a student has previously performed in other sessions. 3 Completed Levels per Time This decision was made to limit the changes required to im- 4 Total Time plement this system in the game. However, including prior 5 Average Level Time information may improve predictions. One possible way to control for the effect of the different teaching styles is to in- Figure 2 shows a density plot of the predicted values versus clude teacher or school information in the model. However, the actual values, the yellow/lightest color being the highest this would create very sparse features due to the large num- density. Both figures show the highest density areas occurs ber of teachers and schools that use the system. A future closely to the actual values. The Neural Network appears attempt could identify and categorize the teachers or schools to have a higher density closer to the line and the points based on similar styles and add those features to the models. appear to be more compact, although both models show similar predictions. Both figures are zoomed in to focus on This prediction can be used in two main ways: identifying the lower level number predictions, although few points have the lowest performing students who may need assistance, values higher than 10. One note is both models are less fitted but may not be requesting help, and identifying students for the higher values and tends to predict around 10 after the who may be working too fast and getting ahead in the cur- actual value is 10+. However, we are mostly concerned with riculum. The second usage may not seem like an issue, but students who are completing very few levels. If a student having a large knowledge gap between students may make falls into the 10+ range of levels completed, the actual value a classroom harder to manage and teach. This is a problem becomes less important due to how much above the average teachers seek to avoid in ST Math that they have reme- it is. A teacher will still be able to use this information to died by asking those students to help others or by allowing identify over-performing students and ensure they don’t get them to play games while others catch up[13]. For ease of too far ahead of the class. use, these predictions could be provided in a simple list with each student’s name and the predicted number of levels they With the variability of how this system is used, the models will complete in 20 minutes. Furthermore, the top 5 lowest evaluations are a positive result. For example, during field and highest predictions could be presented at the top of the observations of the system, we found many teachers asked interface so teachers could quickly have an idea of who is students ahead in the curriculum to help students next to struggling and who may need to be slowed down. Because teachers already have access to where each student is in the 7.1 Limitations curriculum, the teacher can use their expertise and knowl- To reduce the amount of time processing the data, we used a edge of the students to make judgement calls on what to do representative subset for the pairwise prediction. However, from there. A mock interface of how this could be presented we compared multiple numerical and categorical features be- can be seen in Figure 3. tween this subset and the entire dataset and determined that it contained almost identical distributions of data points. We created histograms for the distributions of performance, level play time, levels in session, time of session, performance session, and compared the number of schools and teachers represented in the subset to the totals. We were only miss- ing 6 out of 111 schools and we had students from almost half of the teachers (291 out of 636) included in our subset. We do not have fine-grained interaction data, which means we cannot tell exactly how many puzzles a student gets wrong. This lack of information causes our data to be skewed by having many performance scores of 100%, without cap- turing the full gameplay. However, there are other features that we can use tease out this information, like level time, as students who pass a level while also getting puzzles wrong Figure 3: Mock interface showing how teachers will most likely take longer because they are doing more would view students’ predictions problems. We have finer-grained puzzle level data, but it does not match up accurately with our level data. This 7. OVERALL DISCUSSION means that while we can do studies on these datasets sepa- The results for the levels completed experiment were more rately, we cannot combine them to have the full picture of promising than the pairwise experiment. For the pairwise what a student is doing during the level: which puzzles they prediction, the lack of fine-grained puzzle level data made see, if any puzzles are repeated during a level, how many it difficult to predict whether a student may need interven- puzzles right and wrong, and the time spent on each indi- tion based only on their previous level’s data. We believe vidual puzzle in a level. These finer granularities could offer the results for this method of pairwise prediction might im- valuable information on what a student is doing during a prove with more data about how the objectives, games, and level and their performance compared to the whole student levels relate to each other. On the other hand, the predic- set. tion model from the levels completed experiment had decent results with the MAE and RMSE indicating that the predic- 8. CONCLUSION tions are generally within 1-2 levels of the actual completed This study aimed to use the least amount of student game- levels for the 20-minute time period. Having additional in- play data possible to predict which students would benefit formation, including finer-grained puzzle-level data, should from teacher intervention during the remainder of the game- also improve this prediction. play session. We tried two granularities of prediction for our analysis. We hypothesized that we could use one level’s Providing the teachers with a projected completed amount data (average of 3.5 minutes of gameplay) to predict the of levels allows us to give the teachers a list of the stu- next level’s outcomes, as this controls for content and dif- dents ranked by the number of levels they are predicted to ficulty, but this hypothesis was not confirmed. The lack of complete. This allows the teachers to use their expertise to fine-grained level attempt data might not allow us to make distinguish the higher- and lower-performing students dur- a good prediction. Our second hypothesis was that we could ing that game session, and, importantly, the teachers have use the first 6 minutes of gameplay (about 2 levels) to pre- the ability to make judgments about interventions accord- dict how many levels the student could complete in the next ing to their discretion. Currently, the teachers only have 20 minutes. This had a reasonable outcome with a MAE information on student progress in the overall game curricu- of 1.2 and RMSE error of 1.6, meaning that, on average, lum (which objectives each student has finished and how the prediction is only off by 1-2 levels, which is a good es- many levels have been completed). Additionally, the only timation of how many levels a student will complete. We method currently used to support students in seeking help believe this can provide a valuable resource for the teachers is the raised hand indicator, which has been shown to not who use ST Math in their classrooms, to help them con- always get the teachers’ attention due to its location on the centrate their time and energy on the students who need it students’ screens. We believe that incorporating this predic- the most. Furthermore, this method allows the teachers to tion into the system will be a valuable tool for teachers that have a certain level of judgment in regards to who needs the will suggest which students are struggling and allow them assistance, which is imperative in a system that is used in to decide if they need intervention. Giving teachers these multiple styles. Future work could investigate how this af- suggestions after only 6 minutes of gameplay time means fected the students’ performance if we gave this information that the teachers will have more control over the classroom to teachers. progress because they will have more time to help students get back on track instead of being behind for the entire ses- 9. ACKNOWLEDGMENTS sion and be able to slow down students who are getting too This research is made possible by support of the National far ahead of the class. Science Foundation under Grant No. 1544273. 10. ADDITIONAL AUTHORS analytics to inform digital curricular sequencing: Additional authors: Teomara Rutherford (North Carolina What math objective should students play next? In: State University, email: taruther@ncsu.edu). Proceedings of the Annual Symposium on Computer-Human Interaction in Play. pp. 195–204. 11. REFERENCES ACM (2017) [1] Abelson, R.P.: A variance explanation paradox: when [15] Peddycord-Liu, Z., Harred, R., Karamarkovich, S., a little is a lot. Psychological bulletin 97(1), 129 Barnes, T., Lynch, C., Rutherford, T.: Learning curve (1985) analysis in a large-scale, drill-and-practice serious [2] Ahadi, A., Lister, R., Haapala, H., Vihavainen, A.: math game: Where is learning support needed? In: Exploring machine learning methods to automatically International Conference on Artificial Intelligence in identify students in need of assistance. In: Proceedings Education. pp. 436–449. Springer (2018) of the eleventh annual International Conference on [16] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, International Computing Education Research. pp. V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, 121–130. ACM (2015) P., Weiss, R., Dubourg, V., et al.: Scikit-learn: [3] Backlund, P., Hendrix, M.: Educational games-are Machine learning in python. Journal of machine they worth the effort? a literature survey of the learning research 12(Oct), 2825–2830 (2011) effectiveness of serious games. In: 2013 5th [17] Prentice, D.A., Miller, D.T.: When small effects are international conference on games and virtual worlds impressive. Psychological bulletin 112(1), 160 (1992) for serious applications (VS-GAMES). pp. 1–8. IEEE [18] Romero, C., Ventura, S.: Educational data mining: a (2013) review of the state of the art. IEEE Transactions on [4] Calarco, J.M.: “I need help!” Social class and Systems, Man, and Cybernetics, Part C (Applications children’s help-seeking in elementary school. American and Reviews) 40(6), 601–618 (2010) Sociological Review 76(6), 862–882 (2011) [19] Rutherford, T., Farkas, G., Duncan, G., Burchinal, [5] Good, T.L.: Which pupils do teachers call on? The M., Kibrick, M., Graham, J., Richland, L., Tran, N., Elementary School Journal 70(4), 190–198 (1970) Schneider, S., Duran, L., et al.: A randomized trial of [6] Holstein, K., Hong, G., Tegene, M., McLaren, B.M., an elementary school mathematics software Aleven, V.: The classroom as a dashboard: intervention: Spatial-temporal math. Journal of co-designing wearable cognitive augmentation for k-12 Research on Educational Effectiveness 7(4), 358–383 teachers. In: Proceedings of the 8th International (2014) Conference on Learning Analytics and Knowledge. pp. [20] Ryan, A.M., Gheen, M.H., Midgley, C.: Why do some 79–88. ACM (2018) students avoid asking for help? an examination of the [7] Holstein, K., McLaren, B.M., Aleven, V.: Intelligent interplay among students’ academic efficacy, teachers’ tutors as teachers’ aides: exploring teacher needs for social–emotional role, and the classroom goal real-time analytics in blended classrooms. In: structure. Journal of educational psychology 90(3), Proceedings of the seventh international learning 528 (1998) analytics & knowledge conference. pp. 257–266. ACM [21] Sabourin, J.L., Shores, L.R., Mott, B.W., Lester, J.C.: (2017) Understanding and predicting student self-regulated [8] Jiang, S., Williams, A., Schenke, K., Warschauer, M., learning strategies in game-based learning O’dowd, D.: Predicting mooc performance with week environments. International Journal of Artificial 1 behavior. In: Educational data mining 2014 (2014) Intelligence in Education 23(1-4), 94–114 (2013) [9] Karumbaiah, S., Baker, R.S., Shute, V.: Predicting [22] Skinner, E.A., Belmont, M.J.: Motivation in the quitting in students playing a learning game. In: EDM classroom: Reciprocal effects of teacher behavior and (2018) student engagement across the school year. Journal of [10] Lee, S.J., Liu, Y.E., Popovic, Z.: Learning individual educational psychology 85(4), 571 (1993) behavior in an educational game: a data-driven [23] Wolff, A., Zdrahal, Z., Herrmannova, D., Kuzilek, J., approach. In: Educational Data Mining 2014 (2014) Hlosta, M.: Developing predictive models for early [11] Liu, Y.E., Mandel, T., Butler, E., Andersen, E., detection of at-risk students on distance learning O’Rourke, E., Brunskill, E., Popovic, Z.: Predicting modules. In: Machine Learning and Learning player moves in an educational game: A hybrid Analytics Workshop at The 4th International approach. In: EDM. pp. 106–113. Citeseer (2013) Conference on Learning Analytics and Knowledge [12] Liu, Z., Cody, C., Barnes, T., Lynch, C., Rutherford, (LAK14). p. 24–28 (2014) T.: The antecedents of and associations with elective [24] Wolff, A., Zdrahal, Z., Nikolov, A., Pantucek, M.: replay in an educational game: Is replay worth it? In: Improving retention: predicting at-risk students by EDM (2017) analysing clicking behaviour in a virtual learning [13] Peddycord-Liu, Z., Cateté, V., Vandenberg, J., environment. In: Proceedings of the third Barnes, T., Lynch, C.F., Rutherford, T.: A field study international conference on learning analytics and of teachers using a curriculum-integrated digital game. knowledge. pp. 145–149. ACM (2013) In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. p. 428. ACM (2019) [14] Peddycord-Liu, Z., Cody, C., Kessler, S., Barnes, T., Lynch, C.F., Rutherford, T.: Using serious game