How Long is Enough? Predicting Student Outcomes with
 Same-Day Gameplay Data in an Educational Math Game

                Rachel Harred                                           Christa Cody                                      Mehak Maniktala
               North Carolina State                                 North Carolina State                                  North Carolina State
                    University                                           University                                            University
             rlharred@ncsu.edu                                    cncody@ncsu.edu                                      mmanikt@ncsu.edu

                Preya Shabrina                                        Tiffany Barnes                                           Collin Lynch
               North Carolina State                                 North Carolina State                                  North Carolina State
                    University                                           University                                            University
             pshabri@ncsu.edu                                   tmbarnes@ncsu.edu                                        cflynch@ncsu.edu


ABSTRACT                                                                                 that students who “struggled quietly” often went unnoticed.
Curriculum-integrated games can provide teachers with data                               In other work, the authors found that when students pos-
to help them decide when and how to intervene with indi-                                 sibly need intervention but do not receive it, they might
vidual students. Based on our prior work observing teach-                                get frustrated and give up or replay an easier game instead
ers using ST Math, teachers may not be able to attend to a                               [9]. Other research has also shown that teachers can often
dashboard or student screens to determine who might need                                 unintentionally favor or give assistance to certain types of
intervention. We therefore set out to determine how much                                 students due to differences in perceptions or help-seeking
data we need from the current ST Math gameplay session to                                behaviors [4, 22, 5]. Therefore, providing teachers with in-
predict performance. Based on the available log data that                                formation to help them determine who needs assistance the
tracks student performance over SETS of puzzles, we per-                                 most may be crucial to some low-performing students.
formed two experiments to predict performance. The first
uses data from one game level, which is about 3 minutes                                  Despite the amount of data gathered with each playthrough,
long, to predict the performance on the next level, and the                              teachers in our system are only provided with a student’s
second uses the first 6 minutes of gameplay to predict how                               current progress in the curriculum and a feature that al-
many levels a student can complete in 20 minutes, a typical                              lows a student to “raise” their hand through the system.
class length. Our results show that our data are not fine-                               However, this is only visible on the student’s screen via a
grained enough to allow for paired level prediction, but that                            purple hand indicator and often goes unnoticed. Therefore,
6 minutes of gameplay can be used to rank students in or-                                we sought to determine if there was a way to provide teach-
der of performance for a class session. These results can be                             ers with knowledge regarding students projected progress
used as a basis for an alert system that could help teachers                             as fast as possible, so that the teachers can determine who
prioritize their time in the classroom.                                                  to help from there. With the machine learning techniques
                                                                                         that can process such data and help predict outcomes, we
1.   INTRODUCTION                                                                        wanted to find the correct technique to answer our question.
Educational games can be a useful tool for teachers to pro-                              Machine learning and educational data mining techniques
vide additional practical learning for students [3]. As more                             have been successfully used in educational game research
educational games become curriculum-integrated, a signifi-                               for many years [11, 21, 10, 18].
cant portion of a students time can be spent in these sys-
tems. However, teachers cannot monitor and assist each                                   In this paper, we tried to determine the smallest amount of
student at the same time, struggling to identify students                                time needed to predict student outcomes for one gameplay
who need help the most. In previous work, we observed                                    session by investigating multiple feature selection algorithms
teachers assistance often was influenced by things such as                               and prediction models on student gameplay data for an edu-
classroom layout and disruptive behavior rather than learner                             cational game, Spatial Temporal Math (ST Math). We tried
proficiency or needs [13]. Furthermore, the work identified                              two methods of prediction using data analysis and machine
                                                                                         learning: 1) Trying to predict student outcomes for play-
                                                                                         ing one level of a game using gameplay data from only the
                                                                                         previous level, 2) Using the least amount of time of a stu-
                                                                                         dent’s gameplay data to predict the number of levels they
                                                                                         will pass in the next twenty minutes of gameplay. To accom-
                                                                                         plish this, we tried various machine learning and feature se-
                                                                                         lection methods to find the most significant features needed
                                                                                         to predict student outcomes in this educational game. In
                                                                                         this study our intention was to give insight to the teachers
                                                                                         of ST Math by indicating our best guess for which students


               Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
could most benefit from teacher intervention on a single day,
provided early enough in the gameplay session to allow the
teacher to help as many as possible.

1.1    Spatial Temporal Math (ST Math)
ST Math is a curriculum-integrated supplemental mathe-
matics game for 2nd-4th-grade students that uses spatial
puzzles to teach basic math concepts [19, 12, 14, 15]. The
puzzles do not contain any textual instruction. The games
are grouped at the highest level by objective which indicates
the broad math concept. Each objective contains a number
of games, the gameplay under an objective varies but con-
cerns the same content inside an objective. The games usu-
ally have between 3 to 5 levels each, and the gameplay across
levels is similar but increases in difficulty. There are usually
between 6 and 8 puzzles per level. The puzzles are either
randomly generated using a template or randomly selected
from pre-designed puzzles depending on the level. Each puz-
zle requires the student to do the correct action to indicate
their answer. Animated feedback is presented to the student
following the puzzle solving attempt that shows the student                           Figure 1: ST Math
if they are correct or incorrect. For example, in the game
“Fair Sharing” under “Division Concepts” a student is asked
to distribute boxes equally among animals to construct a           help-seeking students went unnoticed. In classrooms that
straight bridge and will show the bridge blocked off or with       use rotation-seating, teacher attention was only given if the
gaps that make it impossible to cross in case of an incorrect      ST Math group was being disruptive. In general, the stu-
answer. A level begins with a set number of lives, usually         dents who directly asked for help or were obviously off-task
2, that resets at the beginning of each level. If a student’s      received more teacher intervention than the students who
response is incorrect, the student loses a life. If the student    were less vocal [13].
loses all of their lives before completing the level, they do
not pass the level and must retry it. To pass a level, the         The task of automatically identifying students who need
student must complete all puzzles without losing all their         help has been explored [2]. With machine learning meth-
lives. After a student passes a level, they may move on to         ods that can process large amounts of data and make pre-
the next level in the game or objective, or backtrack and          dictions, many have been using these methods in attempt
play a previously passed level. We refer to this backtracking      to solve the problem. Ahadi et al. explored machine learn-
as replay. A level attempt includes passing a level, failing       ing techniques and were able to use the first week of data
a level, and replay. Each student has the option to ”raise         in a programming course to predict student performance
their hand” if they want help from their teacher by clicking       with accuracy ranging from 71-80%. Additionally, Jiang
on a hand icon on the screen. The teacher also has access          et al. employed logistic regression models to predict the
see which objectives and levels a student has passed. See          type of certificate a learner received in Massive Open Online
Figure 1 for a breakdown of ST Math.                               Courses (MOOCs) [8]. In a study at the Open University,
                                                                   decision tree models were implemented on users’ current and
2.    RELATED WORK                                                 previous activity to predict if they were at risk of failing a
Educational games in classrooms are helpful to teachers be-        module [24]. Another Open University study explored using
cause the students can receive individualized attention and        Bayesian models to build real-time predictive models from
learning from the game while the teacher gives one-on-one          student data and found little difference among the types of
attention to students who need it.[7]. However, teachers           models but that the accuracy increased with the addition
have limited time and try to prioritize their attention to the     of data throughout the progression of the learning module
students who need it most. It has been shown that students         [23]. These studies show that machine learning can be used
who are given attention by teachers have increased student         to predict and understand student behavior, but are not be-
learning[5]; therefore, teachers who are able to focus their       ing used to directly aid in student learning.
attention to low-performing students should see them bene-
fit.                                                               Predictive models are now being integrated into teacher dash-
                                                                   boards or alert systems to enable this aid. In a survey of
Unfortunately, there are many reasons that students who            K-12 teachers who used intelligent tutoring systems in class,
need help do not receive it. One study found that middle-          Holstein et al. found that they were very interested in hav-
class students seek help more directly than working-class          ing real-time classroom monitoring tools that would help
students and end up getting more help as a result[4]. In           them decide which students most needed their attention [7].
a recent study of classroom observations using ST Math,            In another study, Holstein et al. developed a teacher alert
researchers found differences in classroom format have an          system using smart glasses that showed real-time indicators
influence on who receives help[13]. Furthermore, this work         of student behaviors floating above their heads [6]. Their
found teachers in free-seating classrooms could not easily see     early findings suggest that this helped direct the teachers to
the raised hand indicator on student screens and so those          the students who needed intervention the most [7].
3.    DATA
The data is collected by MIND Research Institute, who cre-              Table 1: Features Created for the Analysis
ated ST Math. This study was conducted on data from                Feature                 Description
3rd grade students who played ST Math during the 2016-             game                    The name of the game
2017 school year. The data contains 31 objectives, 154                                     this level is in,
games, and 669 levels which equals 5,186,269 total level at-                               extrapolated from the
tempts by 8,983 students from 111 schools and 636 teach-                                   Objective Code and Level
ers. We excluded students who completed objectives not             performance             Number of puzzles correct
contained in the 3rd grade objectives which removed 21,544                                 before losing all lives divided
level attempts. These are students who might have been                                     by total puzzles in level
erroneously included from other grades, as the system is           isReplay                1 if this level has been
used for grades 3-5. For the purposes of our study, we fil-                                attempted before, otherwise 0
tered out level attempts where the student only completed          isUnneceReplay          1 if the level is
one level attempt in our 30 minute gameplay session cutoff.                                replay, otherwise 0
This removed 101,849 level attempts, leaving us with a fi-         prevPassAttempt         For replay: 1 if it is on
nal dataset of 5,062,876 unique level attempts. The initial        SameLevel               the same level played
data we were provided with had 6 features for each level at-                               in previous row, 0 if not,
tempt: STMathID (unique ID for each student), Level, Ob-                                   NA if not replay
jective Code, Timestamp, Number of Correct Puzzles, and            passedCurrentLevel      1 if the student passed
Total Number of Puzzles. For the purpose of our analysis,          B4Unnece                the level in the previous row
we created additional features shown in 1.                                                 (their “current” level) AND
                                                                                           this row is replay, 0 if not,
                                                                                           NA if not replay
4.    METHOD
                                                                   sameObj                 1 if current level and previous
We wanted to explore different ways of providing teachers
                                                                                           are the same objective
with information about student progress, so that they could
intervene and monitor progress according to their own class-       levelPlayTimeSec        Timestamp - previous row’s
room goals. Therefore, our intention is to predict the pro-                                timestamp
jected progress of a student using the least amount of infor-      gameplaySession         Numbered starting at 1
mation and using only same-day gameplay data. Here, we                                     and incrementing if the
are comparing two methods of data segmentation: predict-                                   levelPlayTimeSec > 30
ing the time spent passing the next level based on the time                                minutes (1800s). Rows with the
spent passing the current level, and using the least amount                                same numbered gameplaySession
of gameplay to predict how many levels would be passed                                     are assumed to happen
in the next 20 minutes of gameplay. We are attempting to                                   during a single “play session”
see if a student’s performance in the beginning of a game-         prevPerf                Average of previous gameplay
play session is a good prediction for their later performance.                             session’s performance
A gameplay session is defined as subsequent level attempts         firstInGameplaySession 1 if this level attempt is
that are separated by less than 30 minutes. If two level at-                               the first in a new gameplay
tempts are separated by 30 minutes or longer, we count the                                 session, otherwise 0
second attempt as a new gameplay session. We decided on a          lastInGameplaySession 1 if this level attempt is
30 minute cutoff because a pause in the game of 30 minutes                                 the last in a gameplay
or more might indicate the student was working on some-                                    session, otherwise 0
thing else in between and we cannot say that the previously
played level will have any effect on the performance of next
level. However, we still want to include students who may         intervention for the next level. This would be beneficial to
be truly struggling, receiving help from a teacher or giving      teachers because it would give an alert immediately after a
help to another student during the playthrough of a level.        student finishes a level that would tell them that the stu-
Only 3.5% of the data for the gap between play sessions was       dent might need help on the next level. We used pairwise
between 30 minutes and 20 hours, while 11.6% of the data          prediction, so the data was grouped into Level A and Level
was 20 hours or longer between gameplay sessions.                 B, with the constraint that Level A and Level B had to be
                                                                  in the same objective and the same gameplay session. We
5.    EXPERIMENT 1: PAIRWISE PREDIC-                              considered Level A to be the first attempt on a new level, all
                                                                  subsequent attempts (retries) until the level was passed, and
      TION                                                        any replayed levels that happened before or after the level
This section details our attempt at using previous level data     was passed. Any gameplay that happened between the first
to predict student outcomes for the next level attempt.           attempts of two consecutive levels would be counted as Level
                                                                  A data. Level B is the next attempt on a new level after
5.1    Pairwise Method                                            Level A, and contains the same information as Level A: num-
We wanted to see if we can predict how well a student will        ber of attempts, retries, replays before and after passing, up
do on the next level by using only the data from the pre-         until the next new level attempt. We expected the num-
vious level in the prediction. Our aim in this experiment         ber of attempts to pass a level to provide information about
was to investigate if the features of a single level for a stu-   how many attempts they will need for the next level because
dent can predict whether the student would have needed an         the levels increase in difficulty inside objectives and this is
consistent through ST Math. Also, research has shown that
replay that happens before passing a level results in a neg-      Table 2: Features Selected using Linear SVC (with
ative effect on performance while replay that happens after       L1 loss) for Pairwise Prediction
passing a level has been shown to have a positive effect[12];      Feature       Description      Mean    SD/
therefore, we expected this data to also be useful for the                                        (mode   occurrence
prediction.                                                                                       for     for
                                                                                                  Binary) Binary
Due to time constraints and the complexity of the feature
creation, we used a subset of our total dataset for this anal-     retryOr-         1 if                0           0:181,804
ysis. The dataset includes 830 students, and 665 unique            Replay-          retryAnd-                       1:40,454
objective-level pairs. Objective-level pairs are level pairs       Binary           ReplaySum>0,
within an objective. However, some students did not com-                            otherwise 0
plete all the objective-levels resulting in a total of 277,975     replay-          1 if                0           0:221,981
unique student objective-level pairs.                              Binary           replaySum>0,                    1:277
                                                                                    otherwise 0
                                                                   retryAnd-        Total Time          65.78       200.18
5.1.1    Pairwise Feature Generation                               Replay-          Spent on
Since the raw data only included gameplay aspects per at-          TimeSum          Retries
tempt such as time taken, attempt performance and the kind                          and Replays
of attempt (retries, replay, etc.), we engineered 33 additional    avgTime-         Average Overall     225.49      175.23
features for every student objective-level pair. There are 7       PerLevel         time
attempt categories and 5 metrics per each category. The            avgPerf-         Average Overall     0.94        0.16
7 attempt categories are as follows: overall level attempts,       Total            Performance
total retry and replay attempts, retry attempts, total replay      avgRetry-        Average Replay      0.16        0.35
attempts, total replay attempts before passing the current         AndReplay-       Performance
level, total replay attempts after passing, replay attempts of     Perf
the same level (current) after passing, and replay attempts        avgRetry-        Average Retry       0.16        0.35
of other levels after passing. The 5 metrics for each cate-        Perf             Performance
gory are as follows: whether an attempt category occurred
(except for overall attempts category), total number of at-
tempts (except for overall attempts category), total time,        5.2    Pairwise Results & Discussion
average time, and average performance.                            The feature selection based on LinearSVC with L1 loss and
                                                                  the Random Forest classifier provided the most optimal pre-
Next, for each student, we identified the consecutive levels      diction accuracy. Table 3 shows the results of using a Ran-
(Level A, Level B ) within each objective that were com-          dom Forest classifier for each intervention type. We observed
pleted in the same session. We found a total of 222,258 such      that the average time spent on Level A was selected for each
level pairs for 830 students. We explored four different ways     intervention type based target for Level B. The distribution
to define an intervention: if the total time was greater than     parameters of the features selected are shown in Table 2.
the 75th percentile (I-TotalTime), if the average time was        We observed that very few features related to the replays
greater than the 75th percentile (I-AvgTime), if the average      were selected. This may be because the consecutive level
performance was less than 25th percentile (I-AvgPerf), and        pair dataset recorded very few rows with retries or replays
if the student could not finish a level in the first attempt      (18.20%) and even lower just considering replays (0.12%).
(I-FirstAttempt). Each of these intervention types were in-       Such a high degree of sparsity in replay made any replay re-
tended to capture a different aspect of a student’s ability to    lated features not significant enough to contribute towards
complete a level.                                                 the predictions. Another interesting observation is that the
                                                                  performance over all the attempts (avgPerfTotal) in Level A
                                                                  was not a significant predictor of the intervention for Level B
5.1.2    Pairwise Feature Selection and Prediction Mod-           in any of the models primarily because of the small variance
         els                                                      recorded for this feature. The small variance is due to the
The analysis was carried out in Python. We normalized the         granularity of the data only recording passed level attempts
time-related features and then explored three feature selec-      with failed puzzles as 100% performance. On the other hand,
tion techniques in the scikit-learn[16] package. We used a        the average time in Level A (avgTimePerLevel) was a sig-
pipeline of a feature selection wrapper method, SelectFrom-       nificant predictor of the intervention for Level B for every
Model, with models such as LinearSVC (Linear Support              intervention type based target. We recorded few replays in
Vector Classifier with L1 loss), LassoCV (Lasso linear model      general and a low variance in the performance related fea-
with 3-fold Cross Validation), and Logistic Regression. We        tures, so only the time related features were varied enough
used 7 classifiers KNN (n = 3), LinearSVC (Linear Support         to capture the relationship between Level A and Level B.
Vector Classifier), Decision Tree (using Gini index), Ran-        The results suggest that the average time spent on an at-
dom Forest (using Gini index), MLP (Multi-layer Percep-           tempt in a level is the most significant predictor of whether a
tron classifier with a 2-layer (100,100) neural network using     student may need assistance in the next level. However, the
a learning rate of 0.001 and reLU activation function), AD-       classifier models for each intervention type did not perform
Aboost, Naive Bayes and measured the prediction accuracy          significantly better than a baseline classifier that would pre-
using 10-fold Cross Validation.                                   dict all the observations to be the Majority Class (the class
Table 3: Features Selected based on Linear SVC with L1 loss and Prediction Accuracy with Significant
Predictors using a Random Forest classifier for Each Intervention Type Target
               I-TotalTime             I-AvgTime               I-AvgPerf             I-FirstAttempt
  Feature      avgTimePerLevel         avgTimePerLevel         avgTimePerLevel       avgTimePerLevel
  Selected     avgPerfTotal            avgPerfTotal            avgPerfTotal
               retryOrReplayBinary     avgRetryAndReplayPerf retryOrReplayBinary
               retryAndReplayTimeSum replayBinary
               avgRetryPerf
  Significant avgTimePerLevel: 0.7     avgTimePerLevel: 0.93   avgTimePerLevel: 0.98 avgTimePerLevel: 1.00
  Predictor    retryAnd-
               ReplayTimeSum: 0.17
  Majority     76.64%                  76.18%                  99.59%                75.32%
  Class
  Prediction 77.20%                    77.21%                  99.60%                76.49%
  Accuracy
  (K=10)


containing more students) as shown in Table 3. This sug-           details about level attempts, including knowing more about
gests that the relation between the behavior of students in        how the levels compare to each other within each objective,
two consecutive levels may be highly varied and that it is         may improve the prediction accuracy.
difficult to generalize whether an intervention is needed in a
level based on only one previous level. It may also suggest
that such a prediction may be dependant on how far along           6.    EXPERIMENT 2: LEVELS COMPLETED
students are in their academic year. To investigate the first            IN 20 MINUTES
scenario, we added a feature for a student’s previous perfor-      We chose to determine if we could predict the number of
mance average, an average of every level attempt until now,        levels completed in 20 minutes using only information from
in attempt to help distinguish low-performing students from        the current session. Using only information from the cur-
the rest. Previous performance average was selected for each       rent session will allow an easier integration with the current
intervention type prediction but had lower feature impor-          system with minimal changes needed. Different schools and
tance (0.03%) because of the low variance and, therefore,          classrooms have unique ways of using ST Math[13]. As a
did not affect the prediction accuracy. To investigate if the      result, there is a variety of session times ranging from very
time of the academic year had any impact on the predic-            short (less than 5 min) to sessions lasting over an hour, with
tion, we added a feature to indicate the month in which            an average of 23 minutes spent in a session. Therefore, we
the sessions occurred. Similar to the previous performance         decided to predict how many levels a student would com-
average, this feature was selected but had a low feature im-       plete in a 20 minute session. This information could be
portance (0.05%) leading to an insignificant difference in the     used by teachers to identify students who will not be able
prediction accuracy.                                               to complete the number of levels the teacher expects for
                                                                   that session and the teacher can intervene to assist or en-
Since only time related features were varied enough to cap-        courage. With this prediction, the system could provide a
ture the variance in the student behavior in two consecutive       teacher with each student’s predictions and order the stu-
level pairs, we explored ways other than feature generation        dents by the lowest predicted number of levels to complete
to perform the pairwise prediction. We sliced the data based       in the next 20 minutes. Then, the teacher can easily look at
on aspects, such as replay type or month of the year but,          the slowest students and make the judgment, based on their
again, obtained similar prediction accuracies; however, the        knowledge of each student and what goals the teacher has for
replay related features did get selected and had high impor-       that lesson, and determine who they need to assist. Studies
tance for the prediction in the data sliced by replay type.        have shown that teachers may focus assistance on students
To investigate if the variance in the content of the objec-        with better help-seeking behaviors because they are often
tives may be affecting the prediction results, we performed        more persistent or better in requesting help[4, 20]. Provid-
the prediction for each intervention type within a single ob-      ing this information this early in the session could be crucial
jective and observed that the prediction accuracy decreased        for low-performing student with who are not asking for help
slightly. This suggests that even within one objective, the        or not doing so effectively.
behavior of a student in one level, as captured in its cur-
rent granularity, may not accurately predict if they need an
intervention in the next level.                                    6.1    Levels Completed Prediction: Methods
                                                                   The data we used consisted of 787949 session observations
The pairwise prediction models may not have generated de-          from 8978 unique students, 111 schools, and 636 teachers. A
sirable results because we may need more than just one pre-        session represents a period of time that the student spends
vious level’s data to predict if an intervention is needed.        working on ST Math without taking longer than a 30 minute
There is not sufficient data about each level in this dataset to   break (see Section 4 for full definition). For accurate predic-
accurately represent the student’s performance and to cre-         tions, we chose to use the first 6 minutes of gameplay due to
ate good predictions. Therefore, having more fine-grained          the average level attempt taking approximately 3 minutes.
                                                                   We refer to this segment of data used for prediction as a
time “slice”. Since our goal was to use the least amount of         For the models, we tried both classification and regression.
information to do the prediction, we wanted this time to be         For classifications, multiple groupings of number of com-
as short as possible. We initially attempted to use shorter         pleted levels in 20 minutes were chosen using balanced classes,
time slices, but due to a level attempt taking on average           the best accuracy (77% using a 2-layer neural network) be-
3 minutes, this did not provide a sufficient amount of data         ing a split to determine if a student could complete at least
to represent the students’ gameplay behavior and, in some           an average number of levels in 20 minutes. However, we
cases, eliminated slower students data for that time slice.         decided that regression, providing finer-grained predictions,
                                                                    would provide more useful information to the teachers and
Next, we removed sessions under 10 minutes (242750 obs.             allow them to have more autonomy in deciding which stu-
- 169863 obs. under 6 min) and sessions over 75 minutes             dents need help. For regression, we tried to predict how
(10,257 obs.). We chose these cutoffs to eliminate short ses-       many levels a student could complete in 20 minutes, which
sions where predictions would not be useful and long ses-           derived from taking the total number of levels completed
sions, in some cases over 4 hours, that were likely anoma-          and the total time of the session and scaling.
lous.
                                                                    We tried multiple models, including decision trees, neural
Table 5 shows the statistics for session and slice features.        networks, and random forests after normalizing the features.
The average session time was 28 minutes and the average             Intrepretable machine learning methods are more important
slice length was 4.5 minutes.                                       because knowing which features are more influential to pre-
                                                                    dicting performance can give insights to how students learn
                                                                    in games. In the results, we show the 2 best models com-
Table 4: Session and Time Slice Stats for the time,                 pared to a baseline. The baseline model is created by always
performance, and levels completed                                   predicting the mean of the completed levels per 20 minutes.
             Feature          Mean(SD) Mdn                          The best models were created by testing multiple models
   Session Levels per 20 min 3.8 (2.5)     3.5                      and fine-tuning the parameters. The two best models are
             Total Time (min) 28.9(26.3)   816.2                    created from scikit-learn: a 3-layer (50,30,20) neural net-
             Avg Performance 74.9%(23.4) 81.3%                      work (MLPRegressor) using a learning rate of 0.001 and
   Slice     Levels Completed 1.9(1.1)     1                        reLU activation function, and a Random Forest (Random-
             Total Time (min) 4.5(0.9)     4.6                      ForestRegressor) using mean squared error as the criterion
             Avg Performance 72.6%(34.0) 100                        function and setting the minimum samples for a split to be
                                                                    20.

6.1.1    Levels Completed Prediction: Feature Gener-                To evaluate each model, we used the following metrics: Mean
         ation                                                      Absolute Error (MAE), Root Mean Squared Error (RMSE),
                                                                    adjusted R-squared score, and explained variance (EV). We
From the level data, the data was segmented into the first 6
                                                                    choose these metrics to evaluate, on average, how accurate
minute time slice for prediction. Features were aggregated
                                                                    each prediction rate was to determine if the error is small
from this 6 minute time slice to capture what each student
                                                                    enough to still provide a good estimate of the students’ pro-
was able to do, such as complete a level, fail a level, retry a
                                                                    jected progress. Both R2 and EV were used to evaluate the
level, or engage in replay. The features generated are based
                                                                    variance of these errors and check for biases within the mod-
on performance, time, level attempts/replay features, the
                                                                    els. All models were evaluated with 10-fold cross validation.
objective the student was in within that slice. The per-
formance features were: average performance (numeric),
                                                                    We attempted feature selection using scikit-learn filter meth-
percentage of levels passed out of all attempts (numeric),
                                                                    ods, such as feature importance from tree regression, and a
and percentage of levels completed out of all attempts (nu-
                                                                    wrapper method (SelectFromModel) with each model. Fea-
meric). The time features from the slice were: the total
                                                                    ture selection did not improve any models, and resulted in
time (numeric), the average level time (numeric), number
                                                                    significantly worse predictions in most cases. This is most
of passed levels per time (numeric, scaled), number of com-
                                                                    likely due to the limited amount of features available. Be-
pleted levels per time (numeric, scaled), and the month of
                                                                    cause our data is not fine-grained, we have a limited amount
the session. The level attempt features were: total num-
                                                                    of information about each student for a level attempt. This
ber of replays (numeric), total number of levels failed (nu-
                                                                    indicates that each feature could be providing key informa-
meric), total number of levels passed (numeric), total num-
                                                                    tion regarding their current progress. Therefore, for the best
ber of levels completed (numeric), total puzzles attempted
                                                                    models the whole feature set was used (see Feature Genera-
(numeric), total puzzles completed (numeric), whether they
                                                                    tion).
engaged in replay (binary), whether they re-attempted a
level (binary). Then, there were 31 binary features rep-
resenting the objective the student was playing when the            6.2   Levels Completed Prediction: Results &
session started.                                                          Discussion
                                                                    This section discusses the results of the Experiment 2 re-
                                                                    gressions.
6.1.2    Levels Completed Prediction: Model Selection
         and Feature Selection                                      Table 5 shows the results of the evaluation metrics for the
For this prediction, we tried a variety of models, tuning of        two best regression models compared to the baseline model.
these models, and alteration of the target variable. Model          The NN and the Random Forest both perform similarly,
and feature selection was accomplished by using scikit-learn[16].   both outperforming the baseline model. Although the MAE
does not have a large difference, the RMSE is much lower.
This indicates that the variance of the errors is significantly
smaller for our predictive models. The MAE of 1.2 for our
predictive models means that on average the prediction will
only be around 1 level off for a specific student, which still
provides a good estimation for the teacher to use. The ad-
justed R-squared and explained variance are almost identical
for both models, which happens when the mean of the errors
is approaching zero. Although these scores are not perfect,
in the context of educational data from a system used with
multiple teaching styles, this is a highly meaningful result[1,
17].

                                                                  Figure 2: These density plots show the predicted vs.
Table 5: Results for the 2 best regression models
                                                                  actual values for the two best regression models to
compared to a baseline(mean).
                                                                  predict the number of completed levels in 20 min-
       Model    MAE RMSE R2            EV
                                                                  utes. Note: the yellow/lightest areas represent the
       Forest   1.24   1.59    0.58    0.58                       highest density of points.
       NN       1.22   1.58    0.59    0.59
       Baseline 1.96   2.48    -1.2E-5 0.0
                                                                  them during a gameplay session. Therefore, a student may
Table 6 shows the top five most important features for deci-      spend part of the session working as normal, then, after the
sions in the Random Forest. All of these top features focus       teacher has identified a struggling student, the teacher may
on the number of completed levels, the total time, or a com-      ask the student next to them to help. This could result in
bination of these features. This is not surprising because the    much higher predicted values than what the student actu-
amount of levels a student can complete in the first 6 min-       ally completes. Furthermore, we observed students in some
utes should be a good indication of how they will perform         classrooms initially talking and working at a slower pace in
over the whole session. However, this is assuming that the        the first few minutes of a session as they settled in, then
students remain seated and playing the game in the same           shortly being asked to focus. This could result in much
manner.                                                           lower predictions of the projected number of levels that stu-
                                                                  dent could complete. Since the data does not only include
                                                                  the sessions where students quietly work by themselves for a
Table 6: The top five most important features from                continuous period of time, accurate predictions are difficult.
the Random Forest
          Rank Feature                                            Furthermore, the data we used focused on only the same-day
           1     Total Levels Completed                           gameplay data, not containing any information regarding
           2     % of Levels Completed                            how a student has previously performed in other sessions.
           3     Completed Levels per Time                        This decision was made to limit the changes required to im-
           4     Total Time                                       plement this system in the game. However, including prior
           5     Average Level Time                               information may improve predictions. One possible way to
                                                                  control for the effect of the different teaching styles is to in-
Figure 2 shows a density plot of the predicted values versus      clude teacher or school information in the model. However,
the actual values, the yellow/lightest color being the highest    this would create very sparse features due to the large num-
density. Both figures show the highest density areas occurs       ber of teachers and schools that use the system. A future
closely to the actual values. The Neural Network appears          attempt could identify and categorize the teachers or schools
to have a higher density closer to the line and the points        based on similar styles and add those features to the models.
appear to be more compact, although both models show
similar predictions. Both figures are zoomed in to focus on       This prediction can be used in two main ways: identifying
the lower level number predictions, although few points have      the lowest performing students who may need assistance,
values higher than 10. One note is both models are less fitted    but may not be requesting help, and identifying students
for the higher values and tends to predict around 10 after the    who may be working too fast and getting ahead in the cur-
actual value is 10+. However, we are mostly concerned with        riculum. The second usage may not seem like an issue, but
students who are completing very few levels. If a student         having a large knowledge gap between students may make
falls into the 10+ range of levels completed, the actual value    a classroom harder to manage and teach. This is a problem
becomes less important due to how much above the average          teachers seek to avoid in ST Math that they have reme-
it is. A teacher will still be able to use this information to    died by asking those students to help others or by allowing
identify over-performing students and ensure they don’t get       them to play games while others catch up[13]. For ease of
too far ahead of the class.                                       use, these predictions could be provided in a simple list with
                                                                  each student’s name and the predicted number of levels they
With the variability of how this system is used, the models       will complete in 20 minutes. Furthermore, the top 5 lowest
evaluations are a positive result. For example, during field      and highest predictions could be presented at the top of the
observations of the system, we found many teachers asked          interface so teachers could quickly have an idea of who is
students ahead in the curriculum to help students next to         struggling and who may need to be slowed down. Because
teachers already have access to where each student is in the     7.1    Limitations
curriculum, the teacher can use their expertise and knowl-       To reduce the amount of time processing the data, we used a
edge of the students to make judgement calls on what to do       representative subset for the pairwise prediction. However,
from there. A mock interface of how this could be presented      we compared multiple numerical and categorical features be-
can be seen in Figure 3.                                         tween this subset and the entire dataset and determined that
                                                                 it contained almost identical distributions of data points.
                                                                 We created histograms for the distributions of performance,
                                                                 level play time, levels in session, time of session, performance
                                                                 session, and compared the number of schools and teachers
                                                                 represented in the subset to the totals. We were only miss-
                                                                 ing 6 out of 111 schools and we had students from almost
                                                                 half of the teachers (291 out of 636) included in our subset.

                                                                 We do not have fine-grained interaction data, which means
                                                                 we cannot tell exactly how many puzzles a student gets
                                                                 wrong. This lack of information causes our data to be skewed
                                                                 by having many performance scores of 100%, without cap-
                                                                 turing the full gameplay. However, there are other features
                                                                 that we can use tease out this information, like level time, as
                                                                 students who pass a level while also getting puzzles wrong
Figure 3: Mock interface showing how teachers                    will most likely take longer because they are doing more
would view students’ predictions                                 problems. We have finer-grained puzzle level data, but it
                                                                 does not match up accurately with our level data. This
7.   OVERALL DISCUSSION                                          means that while we can do studies on these datasets sepa-
The results for the levels completed experiment were more        rately, we cannot combine them to have the full picture of
promising than the pairwise experiment. For the pairwise         what a student is doing during the level: which puzzles they
prediction, the lack of fine-grained puzzle level data made      see, if any puzzles are repeated during a level, how many
it difficult to predict whether a student may need interven-     puzzles right and wrong, and the time spent on each indi-
tion based only on their previous level’s data. We believe       vidual puzzle in a level. These finer granularities could offer
the results for this method of pairwise prediction might im-     valuable information on what a student is doing during a
prove with more data about how the objectives, games, and        level and their performance compared to the whole student
levels relate to each other. On the other hand, the predic-      set.
tion model from the levels completed experiment had decent
results with the MAE and RMSE indicating that the predic-        8.    CONCLUSION
tions are generally within 1-2 levels of the actual completed    This study aimed to use the least amount of student game-
levels for the 20-minute time period. Having additional in-      play data possible to predict which students would benefit
formation, including finer-grained puzzle-level data, should     from teacher intervention during the remainder of the game-
also improve this prediction.                                    play session. We tried two granularities of prediction for
                                                                 our analysis. We hypothesized that we could use one level’s
Providing the teachers with a projected completed amount         data (average of 3.5 minutes of gameplay) to predict the
of levels allows us to give the teachers a list of the stu-      next level’s outcomes, as this controls for content and dif-
dents ranked by the number of levels they are predicted to       ficulty, but this hypothesis was not confirmed. The lack of
complete. This allows the teachers to use their expertise to     fine-grained level attempt data might not allow us to make
distinguish the higher- and lower-performing students dur-       a good prediction. Our second hypothesis was that we could
ing that game session, and, importantly, the teachers have       use the first 6 minutes of gameplay (about 2 levels) to pre-
the ability to make judgments about interventions accord-        dict how many levels the student could complete in the next
ing to their discretion. Currently, the teachers only have       20 minutes. This had a reasonable outcome with a MAE
information on student progress in the overall game curricu-     of 1.2 and RMSE error of 1.6, meaning that, on average,
lum (which objectives each student has finished and how          the prediction is only off by 1-2 levels, which is a good es-
many levels have been completed). Additionally, the only         timation of how many levels a student will complete. We
method currently used to support students in seeking help        believe this can provide a valuable resource for the teachers
is the raised hand indicator, which has been shown to not        who use ST Math in their classrooms, to help them con-
always get the teachers’ attention due to its location on the    centrate their time and energy on the students who need it
students’ screens. We believe that incorporating this predic-    the most. Furthermore, this method allows the teachers to
tion into the system will be a valuable tool for teachers that   have a certain level of judgment in regards to who needs the
will suggest which students are struggling and allow them        assistance, which is imperative in a system that is used in
to decide if they need intervention. Giving teachers these       multiple styles. Future work could investigate how this af-
suggestions after only 6 minutes of gameplay time means          fected the students’ performance if we gave this information
that the teachers will have more control over the classroom      to teachers.
progress because they will have more time to help students
get back on track instead of being behind for the entire ses-    9.    ACKNOWLEDGMENTS
sion and be able to slow down students who are getting too       This research is made possible by support of the National
far ahead of the class.                                          Science Foundation under Grant No. 1544273.
10.   ADDITIONAL AUTHORS                                             analytics to inform digital curricular sequencing:
Additional authors: Teomara Rutherford (North Carolina               What math objective should students play next? In:
State University, email: taruther@ncsu.edu).                         Proceedings of the Annual Symposium on
                                                                     Computer-Human Interaction in Play. pp. 195–204.
11.   REFERENCES                                                     ACM (2017)
 [1] Abelson, R.P.: A variance explanation paradox: when        [15] Peddycord-Liu, Z., Harred, R., Karamarkovich, S.,
     a little is a lot. Psychological bulletin 97(1), 129            Barnes, T., Lynch, C., Rutherford, T.: Learning curve
     (1985)                                                          analysis in a large-scale, drill-and-practice serious
 [2] Ahadi, A., Lister, R., Haapala, H., Vihavainen, A.:             math game: Where is learning support needed? In:
     Exploring machine learning methods to automatically             International Conference on Artificial Intelligence in
     identify students in need of assistance. In: Proceedings        Education. pp. 436–449. Springer (2018)
     of the eleventh annual International Conference on         [16] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel,
     International Computing Education Research. pp.                 V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer,
     121–130. ACM (2015)                                             P., Weiss, R., Dubourg, V., et al.: Scikit-learn:
 [3] Backlund, P., Hendrix, M.: Educational games-are                Machine learning in python. Journal of machine
     they worth the effort? a literature survey of the               learning research 12(Oct), 2825–2830 (2011)
     effectiveness of serious games. In: 2013 5th               [17] Prentice, D.A., Miller, D.T.: When small effects are
     international conference on games and virtual worlds            impressive. Psychological bulletin 112(1), 160 (1992)
     for serious applications (VS-GAMES). pp. 1–8. IEEE         [18] Romero, C., Ventura, S.: Educational data mining: a
     (2013)                                                          review of the state of the art. IEEE Transactions on
 [4] Calarco, J.M.: “I need help!” Social class and                  Systems, Man, and Cybernetics, Part C (Applications
     children’s help-seeking in elementary school. American          and Reviews) 40(6), 601–618 (2010)
     Sociological Review 76(6), 862–882 (2011)                  [19] Rutherford, T., Farkas, G., Duncan, G., Burchinal,
 [5] Good, T.L.: Which pupils do teachers call on? The               M., Kibrick, M., Graham, J., Richland, L., Tran, N.,
     Elementary School Journal 70(4), 190–198 (1970)                 Schneider, S., Duran, L., et al.: A randomized trial of
 [6] Holstein, K., Hong, G., Tegene, M., McLaren, B.M.,              an elementary school mathematics software
     Aleven, V.: The classroom as a dashboard:                       intervention: Spatial-temporal math. Journal of
     co-designing wearable cognitive augmentation for k-12           Research on Educational Effectiveness 7(4), 358–383
     teachers. In: Proceedings of the 8th International              (2014)
     Conference on Learning Analytics and Knowledge. pp.        [20] Ryan, A.M., Gheen, M.H., Midgley, C.: Why do some
     79–88. ACM (2018)                                               students avoid asking for help? an examination of the
 [7] Holstein, K., McLaren, B.M., Aleven, V.: Intelligent            interplay among students’ academic efficacy, teachers’
     tutors as teachers’ aides: exploring teacher needs for          social–emotional role, and the classroom goal
     real-time analytics in blended classrooms. In:                  structure. Journal of educational psychology 90(3),
     Proceedings of the seventh international learning               528 (1998)
     analytics & knowledge conference. pp. 257–266. ACM         [21] Sabourin, J.L., Shores, L.R., Mott, B.W., Lester, J.C.:
     (2017)                                                          Understanding and predicting student self-regulated
 [8] Jiang, S., Williams, A., Schenke, K., Warschauer, M.,           learning strategies in game-based learning
     O’dowd, D.: Predicting mooc performance with week               environments. International Journal of Artificial
     1 behavior. In: Educational data mining 2014 (2014)             Intelligence in Education 23(1-4), 94–114 (2013)
 [9] Karumbaiah, S., Baker, R.S., Shute, V.: Predicting         [22] Skinner, E.A., Belmont, M.J.: Motivation in the
     quitting in students playing a learning game. In: EDM           classroom: Reciprocal effects of teacher behavior and
     (2018)                                                          student engagement across the school year. Journal of
[10] Lee, S.J., Liu, Y.E., Popovic, Z.: Learning individual          educational psychology 85(4), 571 (1993)
     behavior in an educational game: a data-driven             [23] Wolff, A., Zdrahal, Z., Herrmannova, D., Kuzilek, J.,
     approach. In: Educational Data Mining 2014 (2014)               Hlosta, M.: Developing predictive models for early
[11] Liu, Y.E., Mandel, T., Butler, E., Andersen, E.,                detection of at-risk students on distance learning
     O’Rourke, E., Brunskill, E., Popovic, Z.: Predicting            modules. In: Machine Learning and Learning
     player moves in an educational game: A hybrid                   Analytics Workshop at The 4th International
     approach. In: EDM. pp. 106–113. Citeseer (2013)                 Conference on Learning Analytics and Knowledge
[12] Liu, Z., Cody, C., Barnes, T., Lynch, C., Rutherford,           (LAK14). p. 24–28 (2014)
     T.: The antecedents of and associations with elective      [24] Wolff, A., Zdrahal, Z., Nikolov, A., Pantucek, M.:
     replay in an educational game: Is replay worth it? In:          Improving retention: predicting at-risk students by
     EDM (2017)                                                      analysing clicking behaviour in a virtual learning
[13] Peddycord-Liu, Z., Cateté, V., Vandenberg, J.,                 environment. In: Proceedings of the third
     Barnes, T., Lynch, C.F., Rutherford, T.: A field study          international conference on learning analytics and
     of teachers using a curriculum-integrated digital game.         knowledge. pp. 145–149. ACM (2013)
     In: Proceedings of the 2019 CHI Conference on
     Human Factors in Computing Systems. p. 428. ACM
     (2019)
[14] Peddycord-Liu, Z., Cody, C., Kessler, S., Barnes, T.,
     Lynch, C.F., Rutherford, T.: Using serious game