=Paper=
{{Paper
|id=Vol-2592/paper8
|storemode=property
|title=How Long is Enough? Predicting Student Outcomes with Same-Day Gameplay Data in an Educational Math Game
|pdfUrl=https://ceur-ws.org/Vol-2592/paper8.pdf
|volume=Vol-2592
|authors=Rachel Harred,Christa Cody,Mehak Maniktala,Preya Shabrina,Tiffany Barnes,Collin Lynch
|dblpUrl=https://dblp.org/rec/conf/edm/HarredCMSBL19
}}
==How Long is Enough? Predicting Student Outcomes with Same-Day Gameplay Data in an Educational Math Game==
How Long is Enough? Predicting Student Outcomes with
Same-Day Gameplay Data in an Educational Math Game
Rachel Harred Christa Cody Mehak Maniktala
North Carolina State North Carolina State North Carolina State
University University University
rlharred@ncsu.edu cncody@ncsu.edu mmanikt@ncsu.edu
Preya Shabrina Tiffany Barnes Collin Lynch
North Carolina State North Carolina State North Carolina State
University University University
pshabri@ncsu.edu tmbarnes@ncsu.edu cflynch@ncsu.edu
ABSTRACT that students who “struggled quietly” often went unnoticed.
Curriculum-integrated games can provide teachers with data In other work, the authors found that when students pos-
to help them decide when and how to intervene with indi- sibly need intervention but do not receive it, they might
vidual students. Based on our prior work observing teach- get frustrated and give up or replay an easier game instead
ers using ST Math, teachers may not be able to attend to a [9]. Other research has also shown that teachers can often
dashboard or student screens to determine who might need unintentionally favor or give assistance to certain types of
intervention. We therefore set out to determine how much students due to differences in perceptions or help-seeking
data we need from the current ST Math gameplay session to behaviors [4, 22, 5]. Therefore, providing teachers with in-
predict performance. Based on the available log data that formation to help them determine who needs assistance the
tracks student performance over SETS of puzzles, we per- most may be crucial to some low-performing students.
formed two experiments to predict performance. The first
uses data from one game level, which is about 3 minutes Despite the amount of data gathered with each playthrough,
long, to predict the performance on the next level, and the teachers in our system are only provided with a student’s
second uses the first 6 minutes of gameplay to predict how current progress in the curriculum and a feature that al-
many levels a student can complete in 20 minutes, a typical lows a student to “raise” their hand through the system.
class length. Our results show that our data are not fine- However, this is only visible on the student’s screen via a
grained enough to allow for paired level prediction, but that purple hand indicator and often goes unnoticed. Therefore,
6 minutes of gameplay can be used to rank students in or- we sought to determine if there was a way to provide teach-
der of performance for a class session. These results can be ers with knowledge regarding students projected progress
used as a basis for an alert system that could help teachers as fast as possible, so that the teachers can determine who
prioritize their time in the classroom. to help from there. With the machine learning techniques
that can process such data and help predict outcomes, we
1. INTRODUCTION wanted to find the correct technique to answer our question.
Educational games can be a useful tool for teachers to pro- Machine learning and educational data mining techniques
vide additional practical learning for students [3]. As more have been successfully used in educational game research
educational games become curriculum-integrated, a signifi- for many years [11, 21, 10, 18].
cant portion of a students time can be spent in these sys-
tems. However, teachers cannot monitor and assist each In this paper, we tried to determine the smallest amount of
student at the same time, struggling to identify students time needed to predict student outcomes for one gameplay
who need help the most. In previous work, we observed session by investigating multiple feature selection algorithms
teachers assistance often was influenced by things such as and prediction models on student gameplay data for an edu-
classroom layout and disruptive behavior rather than learner cational game, Spatial Temporal Math (ST Math). We tried
proficiency or needs [13]. Furthermore, the work identified two methods of prediction using data analysis and machine
learning: 1) Trying to predict student outcomes for play-
ing one level of a game using gameplay data from only the
previous level, 2) Using the least amount of time of a stu-
dent’s gameplay data to predict the number of levels they
will pass in the next twenty minutes of gameplay. To accom-
plish this, we tried various machine learning and feature se-
lection methods to find the most significant features needed
to predict student outcomes in this educational game. In
this study our intention was to give insight to the teachers
of ST Math by indicating our best guess for which students
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
could most benefit from teacher intervention on a single day,
provided early enough in the gameplay session to allow the
teacher to help as many as possible.
1.1 Spatial Temporal Math (ST Math)
ST Math is a curriculum-integrated supplemental mathe-
matics game for 2nd-4th-grade students that uses spatial
puzzles to teach basic math concepts [19, 12, 14, 15]. The
puzzles do not contain any textual instruction. The games
are grouped at the highest level by objective which indicates
the broad math concept. Each objective contains a number
of games, the gameplay under an objective varies but con-
cerns the same content inside an objective. The games usu-
ally have between 3 to 5 levels each, and the gameplay across
levels is similar but increases in difficulty. There are usually
between 6 and 8 puzzles per level. The puzzles are either
randomly generated using a template or randomly selected
from pre-designed puzzles depending on the level. Each puz-
zle requires the student to do the correct action to indicate
their answer. Animated feedback is presented to the student
following the puzzle solving attempt that shows the student Figure 1: ST Math
if they are correct or incorrect. For example, in the game
“Fair Sharing” under “Division Concepts” a student is asked
to distribute boxes equally among animals to construct a help-seeking students went unnoticed. In classrooms that
straight bridge and will show the bridge blocked off or with use rotation-seating, teacher attention was only given if the
gaps that make it impossible to cross in case of an incorrect ST Math group was being disruptive. In general, the stu-
answer. A level begins with a set number of lives, usually dents who directly asked for help or were obviously off-task
2, that resets at the beginning of each level. If a student’s received more teacher intervention than the students who
response is incorrect, the student loses a life. If the student were less vocal [13].
loses all of their lives before completing the level, they do
not pass the level and must retry it. To pass a level, the The task of automatically identifying students who need
student must complete all puzzles without losing all their help has been explored [2]. With machine learning meth-
lives. After a student passes a level, they may move on to ods that can process large amounts of data and make pre-
the next level in the game or objective, or backtrack and dictions, many have been using these methods in attempt
play a previously passed level. We refer to this backtracking to solve the problem. Ahadi et al. explored machine learn-
as replay. A level attempt includes passing a level, failing ing techniques and were able to use the first week of data
a level, and replay. Each student has the option to ”raise in a programming course to predict student performance
their hand” if they want help from their teacher by clicking with accuracy ranging from 71-80%. Additionally, Jiang
on a hand icon on the screen. The teacher also has access et al. employed logistic regression models to predict the
see which objectives and levels a student has passed. See type of certificate a learner received in Massive Open Online
Figure 1 for a breakdown of ST Math. Courses (MOOCs) [8]. In a study at the Open University,
decision tree models were implemented on users’ current and
2. RELATED WORK previous activity to predict if they were at risk of failing a
Educational games in classrooms are helpful to teachers be- module [24]. Another Open University study explored using
cause the students can receive individualized attention and Bayesian models to build real-time predictive models from
learning from the game while the teacher gives one-on-one student data and found little difference among the types of
attention to students who need it.[7]. However, teachers models but that the accuracy increased with the addition
have limited time and try to prioritize their attention to the of data throughout the progression of the learning module
students who need it most. It has been shown that students [23]. These studies show that machine learning can be used
who are given attention by teachers have increased student to predict and understand student behavior, but are not be-
learning[5]; therefore, teachers who are able to focus their ing used to directly aid in student learning.
attention to low-performing students should see them bene-
fit. Predictive models are now being integrated into teacher dash-
boards or alert systems to enable this aid. In a survey of
Unfortunately, there are many reasons that students who K-12 teachers who used intelligent tutoring systems in class,
need help do not receive it. One study found that middle- Holstein et al. found that they were very interested in hav-
class students seek help more directly than working-class ing real-time classroom monitoring tools that would help
students and end up getting more help as a result[4]. In them decide which students most needed their attention [7].
a recent study of classroom observations using ST Math, In another study, Holstein et al. developed a teacher alert
researchers found differences in classroom format have an system using smart glasses that showed real-time indicators
influence on who receives help[13]. Furthermore, this work of student behaviors floating above their heads [6]. Their
found teachers in free-seating classrooms could not easily see early findings suggest that this helped direct the teachers to
the raised hand indicator on student screens and so those the students who needed intervention the most [7].
3. DATA
The data is collected by MIND Research Institute, who cre- Table 1: Features Created for the Analysis
ated ST Math. This study was conducted on data from Feature Description
3rd grade students who played ST Math during the 2016- game The name of the game
2017 school year. The data contains 31 objectives, 154 this level is in,
games, and 669 levels which equals 5,186,269 total level at- extrapolated from the
tempts by 8,983 students from 111 schools and 636 teach- Objective Code and Level
ers. We excluded students who completed objectives not performance Number of puzzles correct
contained in the 3rd grade objectives which removed 21,544 before losing all lives divided
level attempts. These are students who might have been by total puzzles in level
erroneously included from other grades, as the system is isReplay 1 if this level has been
used for grades 3-5. For the purposes of our study, we fil- attempted before, otherwise 0
tered out level attempts where the student only completed isUnneceReplay 1 if the level is
one level attempt in our 30 minute gameplay session cutoff. replay, otherwise 0
This removed 101,849 level attempts, leaving us with a fi- prevPassAttempt For replay: 1 if it is on
nal dataset of 5,062,876 unique level attempts. The initial SameLevel the same level played
data we were provided with had 6 features for each level at- in previous row, 0 if not,
tempt: STMathID (unique ID for each student), Level, Ob- NA if not replay
jective Code, Timestamp, Number of Correct Puzzles, and passedCurrentLevel 1 if the student passed
Total Number of Puzzles. For the purpose of our analysis, B4Unnece the level in the previous row
we created additional features shown in 1. (their “current” level) AND
this row is replay, 0 if not,
NA if not replay
4. METHOD
sameObj 1 if current level and previous
We wanted to explore different ways of providing teachers
are the same objective
with information about student progress, so that they could
intervene and monitor progress according to their own class- levelPlayTimeSec Timestamp - previous row’s
room goals. Therefore, our intention is to predict the pro- timestamp
jected progress of a student using the least amount of infor- gameplaySession Numbered starting at 1
mation and using only same-day gameplay data. Here, we and incrementing if the
are comparing two methods of data segmentation: predict- levelPlayTimeSec > 30
ing the time spent passing the next level based on the time minutes (1800s). Rows with the
spent passing the current level, and using the least amount same numbered gameplaySession
of gameplay to predict how many levels would be passed are assumed to happen
in the next 20 minutes of gameplay. We are attempting to during a single “play session”
see if a student’s performance in the beginning of a game- prevPerf Average of previous gameplay
play session is a good prediction for their later performance. session’s performance
A gameplay session is defined as subsequent level attempts firstInGameplaySession 1 if this level attempt is
that are separated by less than 30 minutes. If two level at- the first in a new gameplay
tempts are separated by 30 minutes or longer, we count the session, otherwise 0
second attempt as a new gameplay session. We decided on a lastInGameplaySession 1 if this level attempt is
30 minute cutoff because a pause in the game of 30 minutes the last in a gameplay
or more might indicate the student was working on some- session, otherwise 0
thing else in between and we cannot say that the previously
played level will have any effect on the performance of next
level. However, we still want to include students who may intervention for the next level. This would be beneficial to
be truly struggling, receiving help from a teacher or giving teachers because it would give an alert immediately after a
help to another student during the playthrough of a level. student finishes a level that would tell them that the stu-
Only 3.5% of the data for the gap between play sessions was dent might need help on the next level. We used pairwise
between 30 minutes and 20 hours, while 11.6% of the data prediction, so the data was grouped into Level A and Level
was 20 hours or longer between gameplay sessions. B, with the constraint that Level A and Level B had to be
in the same objective and the same gameplay session. We
5. EXPERIMENT 1: PAIRWISE PREDIC- considered Level A to be the first attempt on a new level, all
subsequent attempts (retries) until the level was passed, and
TION any replayed levels that happened before or after the level
This section details our attempt at using previous level data was passed. Any gameplay that happened between the first
to predict student outcomes for the next level attempt. attempts of two consecutive levels would be counted as Level
A data. Level B is the next attempt on a new level after
5.1 Pairwise Method Level A, and contains the same information as Level A: num-
We wanted to see if we can predict how well a student will ber of attempts, retries, replays before and after passing, up
do on the next level by using only the data from the pre- until the next new level attempt. We expected the num-
vious level in the prediction. Our aim in this experiment ber of attempts to pass a level to provide information about
was to investigate if the features of a single level for a stu- how many attempts they will need for the next level because
dent can predict whether the student would have needed an the levels increase in difficulty inside objectives and this is
consistent through ST Math. Also, research has shown that
replay that happens before passing a level results in a neg- Table 2: Features Selected using Linear SVC (with
ative effect on performance while replay that happens after L1 loss) for Pairwise Prediction
passing a level has been shown to have a positive effect[12]; Feature Description Mean SD/
therefore, we expected this data to also be useful for the (mode occurrence
prediction. for for
Binary) Binary
Due to time constraints and the complexity of the feature
creation, we used a subset of our total dataset for this anal- retryOr- 1 if 0 0:181,804
ysis. The dataset includes 830 students, and 665 unique Replay- retryAnd- 1:40,454
objective-level pairs. Objective-level pairs are level pairs Binary ReplaySum>0,
within an objective. However, some students did not com- otherwise 0
plete all the objective-levels resulting in a total of 277,975 replay- 1 if 0 0:221,981
unique student objective-level pairs. Binary replaySum>0, 1:277
otherwise 0
retryAnd- Total Time 65.78 200.18
5.1.1 Pairwise Feature Generation Replay- Spent on
Since the raw data only included gameplay aspects per at- TimeSum Retries
tempt such as time taken, attempt performance and the kind and Replays
of attempt (retries, replay, etc.), we engineered 33 additional avgTime- Average Overall 225.49 175.23
features for every student objective-level pair. There are 7 PerLevel time
attempt categories and 5 metrics per each category. The avgPerf- Average Overall 0.94 0.16
7 attempt categories are as follows: overall level attempts, Total Performance
total retry and replay attempts, retry attempts, total replay avgRetry- Average Replay 0.16 0.35
attempts, total replay attempts before passing the current AndReplay- Performance
level, total replay attempts after passing, replay attempts of Perf
the same level (current) after passing, and replay attempts avgRetry- Average Retry 0.16 0.35
of other levels after passing. The 5 metrics for each cate- Perf Performance
gory are as follows: whether an attempt category occurred
(except for overall attempts category), total number of at-
tempts (except for overall attempts category), total time, 5.2 Pairwise Results & Discussion
average time, and average performance. The feature selection based on LinearSVC with L1 loss and
the Random Forest classifier provided the most optimal pre-
Next, for each student, we identified the consecutive levels diction accuracy. Table 3 shows the results of using a Ran-
(Level A, Level B ) within each objective that were com- dom Forest classifier for each intervention type. We observed
pleted in the same session. We found a total of 222,258 such that the average time spent on Level A was selected for each
level pairs for 830 students. We explored four different ways intervention type based target for Level B. The distribution
to define an intervention: if the total time was greater than parameters of the features selected are shown in Table 2.
the 75th percentile (I-TotalTime), if the average time was We observed that very few features related to the replays
greater than the 75th percentile (I-AvgTime), if the average were selected. This may be because the consecutive level
performance was less than 25th percentile (I-AvgPerf), and pair dataset recorded very few rows with retries or replays
if the student could not finish a level in the first attempt (18.20%) and even lower just considering replays (0.12%).
(I-FirstAttempt). Each of these intervention types were in- Such a high degree of sparsity in replay made any replay re-
tended to capture a different aspect of a student’s ability to lated features not significant enough to contribute towards
complete a level. the predictions. Another interesting observation is that the
performance over all the attempts (avgPerfTotal) in Level A
was not a significant predictor of the intervention for Level B
5.1.2 Pairwise Feature Selection and Prediction Mod- in any of the models primarily because of the small variance
els recorded for this feature. The small variance is due to the
The analysis was carried out in Python. We normalized the granularity of the data only recording passed level attempts
time-related features and then explored three feature selec- with failed puzzles as 100% performance. On the other hand,
tion techniques in the scikit-learn[16] package. We used a the average time in Level A (avgTimePerLevel) was a sig-
pipeline of a feature selection wrapper method, SelectFrom- nificant predictor of the intervention for Level B for every
Model, with models such as LinearSVC (Linear Support intervention type based target. We recorded few replays in
Vector Classifier with L1 loss), LassoCV (Lasso linear model general and a low variance in the performance related fea-
with 3-fold Cross Validation), and Logistic Regression. We tures, so only the time related features were varied enough
used 7 classifiers KNN (n = 3), LinearSVC (Linear Support to capture the relationship between Level A and Level B.
Vector Classifier), Decision Tree (using Gini index), Ran- The results suggest that the average time spent on an at-
dom Forest (using Gini index), MLP (Multi-layer Percep- tempt in a level is the most significant predictor of whether a
tron classifier with a 2-layer (100,100) neural network using student may need assistance in the next level. However, the
a learning rate of 0.001 and reLU activation function), AD- classifier models for each intervention type did not perform
Aboost, Naive Bayes and measured the prediction accuracy significantly better than a baseline classifier that would pre-
using 10-fold Cross Validation. dict all the observations to be the Majority Class (the class
Table 3: Features Selected based on Linear SVC with L1 loss and Prediction Accuracy with Significant
Predictors using a Random Forest classifier for Each Intervention Type Target
I-TotalTime I-AvgTime I-AvgPerf I-FirstAttempt
Feature avgTimePerLevel avgTimePerLevel avgTimePerLevel avgTimePerLevel
Selected avgPerfTotal avgPerfTotal avgPerfTotal
retryOrReplayBinary avgRetryAndReplayPerf retryOrReplayBinary
retryAndReplayTimeSum replayBinary
avgRetryPerf
Significant avgTimePerLevel: 0.7 avgTimePerLevel: 0.93 avgTimePerLevel: 0.98 avgTimePerLevel: 1.00
Predictor retryAnd-
ReplayTimeSum: 0.17
Majority 76.64% 76.18% 99.59% 75.32%
Class
Prediction 77.20% 77.21% 99.60% 76.49%
Accuracy
(K=10)
containing more students) as shown in Table 3. This sug- details about level attempts, including knowing more about
gests that the relation between the behavior of students in how the levels compare to each other within each objective,
two consecutive levels may be highly varied and that it is may improve the prediction accuracy.
difficult to generalize whether an intervention is needed in a
level based on only one previous level. It may also suggest
that such a prediction may be dependant on how far along 6. EXPERIMENT 2: LEVELS COMPLETED
students are in their academic year. To investigate the first IN 20 MINUTES
scenario, we added a feature for a student’s previous perfor- We chose to determine if we could predict the number of
mance average, an average of every level attempt until now, levels completed in 20 minutes using only information from
in attempt to help distinguish low-performing students from the current session. Using only information from the cur-
the rest. Previous performance average was selected for each rent session will allow an easier integration with the current
intervention type prediction but had lower feature impor- system with minimal changes needed. Different schools and
tance (0.03%) because of the low variance and, therefore, classrooms have unique ways of using ST Math[13]. As a
did not affect the prediction accuracy. To investigate if the result, there is a variety of session times ranging from very
time of the academic year had any impact on the predic- short (less than 5 min) to sessions lasting over an hour, with
tion, we added a feature to indicate the month in which an average of 23 minutes spent in a session. Therefore, we
the sessions occurred. Similar to the previous performance decided to predict how many levels a student would com-
average, this feature was selected but had a low feature im- plete in a 20 minute session. This information could be
portance (0.05%) leading to an insignificant difference in the used by teachers to identify students who will not be able
prediction accuracy. to complete the number of levels the teacher expects for
that session and the teacher can intervene to assist or en-
Since only time related features were varied enough to cap- courage. With this prediction, the system could provide a
ture the variance in the student behavior in two consecutive teacher with each student’s predictions and order the stu-
level pairs, we explored ways other than feature generation dents by the lowest predicted number of levels to complete
to perform the pairwise prediction. We sliced the data based in the next 20 minutes. Then, the teacher can easily look at
on aspects, such as replay type or month of the year but, the slowest students and make the judgment, based on their
again, obtained similar prediction accuracies; however, the knowledge of each student and what goals the teacher has for
replay related features did get selected and had high impor- that lesson, and determine who they need to assist. Studies
tance for the prediction in the data sliced by replay type. have shown that teachers may focus assistance on students
To investigate if the variance in the content of the objec- with better help-seeking behaviors because they are often
tives may be affecting the prediction results, we performed more persistent or better in requesting help[4, 20]. Provid-
the prediction for each intervention type within a single ob- ing this information this early in the session could be crucial
jective and observed that the prediction accuracy decreased for low-performing student with who are not asking for help
slightly. This suggests that even within one objective, the or not doing so effectively.
behavior of a student in one level, as captured in its cur-
rent granularity, may not accurately predict if they need an
intervention in the next level. 6.1 Levels Completed Prediction: Methods
The data we used consisted of 787949 session observations
The pairwise prediction models may not have generated de- from 8978 unique students, 111 schools, and 636 teachers. A
sirable results because we may need more than just one pre- session represents a period of time that the student spends
vious level’s data to predict if an intervention is needed. working on ST Math without taking longer than a 30 minute
There is not sufficient data about each level in this dataset to break (see Section 4 for full definition). For accurate predic-
accurately represent the student’s performance and to cre- tions, we chose to use the first 6 minutes of gameplay due to
ate good predictions. Therefore, having more fine-grained the average level attempt taking approximately 3 minutes.
We refer to this segment of data used for prediction as a
time “slice”. Since our goal was to use the least amount of For the models, we tried both classification and regression.
information to do the prediction, we wanted this time to be For classifications, multiple groupings of number of com-
as short as possible. We initially attempted to use shorter pleted levels in 20 minutes were chosen using balanced classes,
time slices, but due to a level attempt taking on average the best accuracy (77% using a 2-layer neural network) be-
3 minutes, this did not provide a sufficient amount of data ing a split to determine if a student could complete at least
to represent the students’ gameplay behavior and, in some an average number of levels in 20 minutes. However, we
cases, eliminated slower students data for that time slice. decided that regression, providing finer-grained predictions,
would provide more useful information to the teachers and
Next, we removed sessions under 10 minutes (242750 obs. allow them to have more autonomy in deciding which stu-
- 169863 obs. under 6 min) and sessions over 75 minutes dents need help. For regression, we tried to predict how
(10,257 obs.). We chose these cutoffs to eliminate short ses- many levels a student could complete in 20 minutes, which
sions where predictions would not be useful and long ses- derived from taking the total number of levels completed
sions, in some cases over 4 hours, that were likely anoma- and the total time of the session and scaling.
lous.
We tried multiple models, including decision trees, neural
Table 5 shows the statistics for session and slice features. networks, and random forests after normalizing the features.
The average session time was 28 minutes and the average Intrepretable machine learning methods are more important
slice length was 4.5 minutes. because knowing which features are more influential to pre-
dicting performance can give insights to how students learn
in games. In the results, we show the 2 best models com-
Table 4: Session and Time Slice Stats for the time, pared to a baseline. The baseline model is created by always
performance, and levels completed predicting the mean of the completed levels per 20 minutes.
Feature Mean(SD) Mdn The best models were created by testing multiple models
Session Levels per 20 min 3.8 (2.5) 3.5 and fine-tuning the parameters. The two best models are
Total Time (min) 28.9(26.3) 816.2 created from scikit-learn: a 3-layer (50,30,20) neural net-
Avg Performance 74.9%(23.4) 81.3% work (MLPRegressor) using a learning rate of 0.001 and
Slice Levels Completed 1.9(1.1) 1 reLU activation function, and a Random Forest (Random-
Total Time (min) 4.5(0.9) 4.6 ForestRegressor) using mean squared error as the criterion
Avg Performance 72.6%(34.0) 100 function and setting the minimum samples for a split to be
20.
6.1.1 Levels Completed Prediction: Feature Gener- To evaluate each model, we used the following metrics: Mean
ation Absolute Error (MAE), Root Mean Squared Error (RMSE),
adjusted R-squared score, and explained variance (EV). We
From the level data, the data was segmented into the first 6
choose these metrics to evaluate, on average, how accurate
minute time slice for prediction. Features were aggregated
each prediction rate was to determine if the error is small
from this 6 minute time slice to capture what each student
enough to still provide a good estimate of the students’ pro-
was able to do, such as complete a level, fail a level, retry a
jected progress. Both R2 and EV were used to evaluate the
level, or engage in replay. The features generated are based
variance of these errors and check for biases within the mod-
on performance, time, level attempts/replay features, the
els. All models were evaluated with 10-fold cross validation.
objective the student was in within that slice. The per-
formance features were: average performance (numeric),
We attempted feature selection using scikit-learn filter meth-
percentage of levels passed out of all attempts (numeric),
ods, such as feature importance from tree regression, and a
and percentage of levels completed out of all attempts (nu-
wrapper method (SelectFromModel) with each model. Fea-
meric). The time features from the slice were: the total
ture selection did not improve any models, and resulted in
time (numeric), the average level time (numeric), number
significantly worse predictions in most cases. This is most
of passed levels per time (numeric, scaled), number of com-
likely due to the limited amount of features available. Be-
pleted levels per time (numeric, scaled), and the month of
cause our data is not fine-grained, we have a limited amount
the session. The level attempt features were: total num-
of information about each student for a level attempt. This
ber of replays (numeric), total number of levels failed (nu-
indicates that each feature could be providing key informa-
meric), total number of levels passed (numeric), total num-
tion regarding their current progress. Therefore, for the best
ber of levels completed (numeric), total puzzles attempted
models the whole feature set was used (see Feature Genera-
(numeric), total puzzles completed (numeric), whether they
tion).
engaged in replay (binary), whether they re-attempted a
level (binary). Then, there were 31 binary features rep-
resenting the objective the student was playing when the 6.2 Levels Completed Prediction: Results &
session started. Discussion
This section discusses the results of the Experiment 2 re-
gressions.
6.1.2 Levels Completed Prediction: Model Selection
and Feature Selection Table 5 shows the results of the evaluation metrics for the
For this prediction, we tried a variety of models, tuning of two best regression models compared to the baseline model.
these models, and alteration of the target variable. Model The NN and the Random Forest both perform similarly,
and feature selection was accomplished by using scikit-learn[16]. both outperforming the baseline model. Although the MAE
does not have a large difference, the RMSE is much lower.
This indicates that the variance of the errors is significantly
smaller for our predictive models. The MAE of 1.2 for our
predictive models means that on average the prediction will
only be around 1 level off for a specific student, which still
provides a good estimation for the teacher to use. The ad-
justed R-squared and explained variance are almost identical
for both models, which happens when the mean of the errors
is approaching zero. Although these scores are not perfect,
in the context of educational data from a system used with
multiple teaching styles, this is a highly meaningful result[1,
17].
Figure 2: These density plots show the predicted vs.
Table 5: Results for the 2 best regression models
actual values for the two best regression models to
compared to a baseline(mean).
predict the number of completed levels in 20 min-
Model MAE RMSE R2 EV
utes. Note: the yellow/lightest areas represent the
Forest 1.24 1.59 0.58 0.58 highest density of points.
NN 1.22 1.58 0.59 0.59
Baseline 1.96 2.48 -1.2E-5 0.0
them during a gameplay session. Therefore, a student may
Table 6 shows the top five most important features for deci- spend part of the session working as normal, then, after the
sions in the Random Forest. All of these top features focus teacher has identified a struggling student, the teacher may
on the number of completed levels, the total time, or a com- ask the student next to them to help. This could result in
bination of these features. This is not surprising because the much higher predicted values than what the student actu-
amount of levels a student can complete in the first 6 min- ally completes. Furthermore, we observed students in some
utes should be a good indication of how they will perform classrooms initially talking and working at a slower pace in
over the whole session. However, this is assuming that the the first few minutes of a session as they settled in, then
students remain seated and playing the game in the same shortly being asked to focus. This could result in much
manner. lower predictions of the projected number of levels that stu-
dent could complete. Since the data does not only include
the sessions where students quietly work by themselves for a
Table 6: The top five most important features from continuous period of time, accurate predictions are difficult.
the Random Forest
Rank Feature Furthermore, the data we used focused on only the same-day
1 Total Levels Completed gameplay data, not containing any information regarding
2 % of Levels Completed how a student has previously performed in other sessions.
3 Completed Levels per Time This decision was made to limit the changes required to im-
4 Total Time plement this system in the game. However, including prior
5 Average Level Time information may improve predictions. One possible way to
control for the effect of the different teaching styles is to in-
Figure 2 shows a density plot of the predicted values versus clude teacher or school information in the model. However,
the actual values, the yellow/lightest color being the highest this would create very sparse features due to the large num-
density. Both figures show the highest density areas occurs ber of teachers and schools that use the system. A future
closely to the actual values. The Neural Network appears attempt could identify and categorize the teachers or schools
to have a higher density closer to the line and the points based on similar styles and add those features to the models.
appear to be more compact, although both models show
similar predictions. Both figures are zoomed in to focus on This prediction can be used in two main ways: identifying
the lower level number predictions, although few points have the lowest performing students who may need assistance,
values higher than 10. One note is both models are less fitted but may not be requesting help, and identifying students
for the higher values and tends to predict around 10 after the who may be working too fast and getting ahead in the cur-
actual value is 10+. However, we are mostly concerned with riculum. The second usage may not seem like an issue, but
students who are completing very few levels. If a student having a large knowledge gap between students may make
falls into the 10+ range of levels completed, the actual value a classroom harder to manage and teach. This is a problem
becomes less important due to how much above the average teachers seek to avoid in ST Math that they have reme-
it is. A teacher will still be able to use this information to died by asking those students to help others or by allowing
identify over-performing students and ensure they don’t get them to play games while others catch up[13]. For ease of
too far ahead of the class. use, these predictions could be provided in a simple list with
each student’s name and the predicted number of levels they
With the variability of how this system is used, the models will complete in 20 minutes. Furthermore, the top 5 lowest
evaluations are a positive result. For example, during field and highest predictions could be presented at the top of the
observations of the system, we found many teachers asked interface so teachers could quickly have an idea of who is
students ahead in the curriculum to help students next to struggling and who may need to be slowed down. Because
teachers already have access to where each student is in the 7.1 Limitations
curriculum, the teacher can use their expertise and knowl- To reduce the amount of time processing the data, we used a
edge of the students to make judgement calls on what to do representative subset for the pairwise prediction. However,
from there. A mock interface of how this could be presented we compared multiple numerical and categorical features be-
can be seen in Figure 3. tween this subset and the entire dataset and determined that
it contained almost identical distributions of data points.
We created histograms for the distributions of performance,
level play time, levels in session, time of session, performance
session, and compared the number of schools and teachers
represented in the subset to the totals. We were only miss-
ing 6 out of 111 schools and we had students from almost
half of the teachers (291 out of 636) included in our subset.
We do not have fine-grained interaction data, which means
we cannot tell exactly how many puzzles a student gets
wrong. This lack of information causes our data to be skewed
by having many performance scores of 100%, without cap-
turing the full gameplay. However, there are other features
that we can use tease out this information, like level time, as
students who pass a level while also getting puzzles wrong
Figure 3: Mock interface showing how teachers will most likely take longer because they are doing more
would view students’ predictions problems. We have finer-grained puzzle level data, but it
does not match up accurately with our level data. This
7. OVERALL DISCUSSION means that while we can do studies on these datasets sepa-
The results for the levels completed experiment were more rately, we cannot combine them to have the full picture of
promising than the pairwise experiment. For the pairwise what a student is doing during the level: which puzzles they
prediction, the lack of fine-grained puzzle level data made see, if any puzzles are repeated during a level, how many
it difficult to predict whether a student may need interven- puzzles right and wrong, and the time spent on each indi-
tion based only on their previous level’s data. We believe vidual puzzle in a level. These finer granularities could offer
the results for this method of pairwise prediction might im- valuable information on what a student is doing during a
prove with more data about how the objectives, games, and level and their performance compared to the whole student
levels relate to each other. On the other hand, the predic- set.
tion model from the levels completed experiment had decent
results with the MAE and RMSE indicating that the predic- 8. CONCLUSION
tions are generally within 1-2 levels of the actual completed This study aimed to use the least amount of student game-
levels for the 20-minute time period. Having additional in- play data possible to predict which students would benefit
formation, including finer-grained puzzle-level data, should from teacher intervention during the remainder of the game-
also improve this prediction. play session. We tried two granularities of prediction for
our analysis. We hypothesized that we could use one level’s
Providing the teachers with a projected completed amount data (average of 3.5 minutes of gameplay) to predict the
of levels allows us to give the teachers a list of the stu- next level’s outcomes, as this controls for content and dif-
dents ranked by the number of levels they are predicted to ficulty, but this hypothesis was not confirmed. The lack of
complete. This allows the teachers to use their expertise to fine-grained level attempt data might not allow us to make
distinguish the higher- and lower-performing students dur- a good prediction. Our second hypothesis was that we could
ing that game session, and, importantly, the teachers have use the first 6 minutes of gameplay (about 2 levels) to pre-
the ability to make judgments about interventions accord- dict how many levels the student could complete in the next
ing to their discretion. Currently, the teachers only have 20 minutes. This had a reasonable outcome with a MAE
information on student progress in the overall game curricu- of 1.2 and RMSE error of 1.6, meaning that, on average,
lum (which objectives each student has finished and how the prediction is only off by 1-2 levels, which is a good es-
many levels have been completed). Additionally, the only timation of how many levels a student will complete. We
method currently used to support students in seeking help believe this can provide a valuable resource for the teachers
is the raised hand indicator, which has been shown to not who use ST Math in their classrooms, to help them con-
always get the teachers’ attention due to its location on the centrate their time and energy on the students who need it
students’ screens. We believe that incorporating this predic- the most. Furthermore, this method allows the teachers to
tion into the system will be a valuable tool for teachers that have a certain level of judgment in regards to who needs the
will suggest which students are struggling and allow them assistance, which is imperative in a system that is used in
to decide if they need intervention. Giving teachers these multiple styles. Future work could investigate how this af-
suggestions after only 6 minutes of gameplay time means fected the students’ performance if we gave this information
that the teachers will have more control over the classroom to teachers.
progress because they will have more time to help students
get back on track instead of being behind for the entire ses- 9. ACKNOWLEDGMENTS
sion and be able to slow down students who are getting too This research is made possible by support of the National
far ahead of the class. Science Foundation under Grant No. 1544273.
10. ADDITIONAL AUTHORS analytics to inform digital curricular sequencing:
Additional authors: Teomara Rutherford (North Carolina What math objective should students play next? In:
State University, email: taruther@ncsu.edu). Proceedings of the Annual Symposium on
Computer-Human Interaction in Play. pp. 195–204.
11. REFERENCES ACM (2017)
[1] Abelson, R.P.: A variance explanation paradox: when [15] Peddycord-Liu, Z., Harred, R., Karamarkovich, S.,
a little is a lot. Psychological bulletin 97(1), 129 Barnes, T., Lynch, C., Rutherford, T.: Learning curve
(1985) analysis in a large-scale, drill-and-practice serious
[2] Ahadi, A., Lister, R., Haapala, H., Vihavainen, A.: math game: Where is learning support needed? In:
Exploring machine learning methods to automatically International Conference on Artificial Intelligence in
identify students in need of assistance. In: Proceedings Education. pp. 436–449. Springer (2018)
of the eleventh annual International Conference on [16] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel,
International Computing Education Research. pp. V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer,
121–130. ACM (2015) P., Weiss, R., Dubourg, V., et al.: Scikit-learn:
[3] Backlund, P., Hendrix, M.: Educational games-are Machine learning in python. Journal of machine
they worth the effort? a literature survey of the learning research 12(Oct), 2825–2830 (2011)
effectiveness of serious games. In: 2013 5th [17] Prentice, D.A., Miller, D.T.: When small effects are
international conference on games and virtual worlds impressive. Psychological bulletin 112(1), 160 (1992)
for serious applications (VS-GAMES). pp. 1–8. IEEE [18] Romero, C., Ventura, S.: Educational data mining: a
(2013) review of the state of the art. IEEE Transactions on
[4] Calarco, J.M.: “I need help!” Social class and Systems, Man, and Cybernetics, Part C (Applications
children’s help-seeking in elementary school. American and Reviews) 40(6), 601–618 (2010)
Sociological Review 76(6), 862–882 (2011) [19] Rutherford, T., Farkas, G., Duncan, G., Burchinal,
[5] Good, T.L.: Which pupils do teachers call on? The M., Kibrick, M., Graham, J., Richland, L., Tran, N.,
Elementary School Journal 70(4), 190–198 (1970) Schneider, S., Duran, L., et al.: A randomized trial of
[6] Holstein, K., Hong, G., Tegene, M., McLaren, B.M., an elementary school mathematics software
Aleven, V.: The classroom as a dashboard: intervention: Spatial-temporal math. Journal of
co-designing wearable cognitive augmentation for k-12 Research on Educational Effectiveness 7(4), 358–383
teachers. In: Proceedings of the 8th International (2014)
Conference on Learning Analytics and Knowledge. pp. [20] Ryan, A.M., Gheen, M.H., Midgley, C.: Why do some
79–88. ACM (2018) students avoid asking for help? an examination of the
[7] Holstein, K., McLaren, B.M., Aleven, V.: Intelligent interplay among students’ academic efficacy, teachers’
tutors as teachers’ aides: exploring teacher needs for social–emotional role, and the classroom goal
real-time analytics in blended classrooms. In: structure. Journal of educational psychology 90(3),
Proceedings of the seventh international learning 528 (1998)
analytics & knowledge conference. pp. 257–266. ACM [21] Sabourin, J.L., Shores, L.R., Mott, B.W., Lester, J.C.:
(2017) Understanding and predicting student self-regulated
[8] Jiang, S., Williams, A., Schenke, K., Warschauer, M., learning strategies in game-based learning
O’dowd, D.: Predicting mooc performance with week environments. International Journal of Artificial
1 behavior. In: Educational data mining 2014 (2014) Intelligence in Education 23(1-4), 94–114 (2013)
[9] Karumbaiah, S., Baker, R.S., Shute, V.: Predicting [22] Skinner, E.A., Belmont, M.J.: Motivation in the
quitting in students playing a learning game. In: EDM classroom: Reciprocal effects of teacher behavior and
(2018) student engagement across the school year. Journal of
[10] Lee, S.J., Liu, Y.E., Popovic, Z.: Learning individual educational psychology 85(4), 571 (1993)
behavior in an educational game: a data-driven [23] Wolff, A., Zdrahal, Z., Herrmannova, D., Kuzilek, J.,
approach. In: Educational Data Mining 2014 (2014) Hlosta, M.: Developing predictive models for early
[11] Liu, Y.E., Mandel, T., Butler, E., Andersen, E., detection of at-risk students on distance learning
O’Rourke, E., Brunskill, E., Popovic, Z.: Predicting modules. In: Machine Learning and Learning
player moves in an educational game: A hybrid Analytics Workshop at The 4th International
approach. In: EDM. pp. 106–113. Citeseer (2013) Conference on Learning Analytics and Knowledge
[12] Liu, Z., Cody, C., Barnes, T., Lynch, C., Rutherford, (LAK14). p. 24–28 (2014)
T.: The antecedents of and associations with elective [24] Wolff, A., Zdrahal, Z., Nikolov, A., Pantucek, M.:
replay in an educational game: Is replay worth it? In: Improving retention: predicting at-risk students by
EDM (2017) analysing clicking behaviour in a virtual learning
[13] Peddycord-Liu, Z., Cateté, V., Vandenberg, J., environment. In: Proceedings of the third
Barnes, T., Lynch, C.F., Rutherford, T.: A field study international conference on learning analytics and
of teachers using a curriculum-integrated digital game. knowledge. pp. 145–149. ACM (2013)
In: Proceedings of the 2019 CHI Conference on
Human Factors in Computing Systems. p. 428. ACM
(2019)
[14] Peddycord-Liu, Z., Cody, C., Kessler, S., Barnes, T.,
Lynch, C.F., Rutherford, T.: Using serious game