Comparison of Off the Shelf Data Mining Methodologies in Educational Game Analytics David J. Gagnon Erik Harpstead Stefan Slater University of Wisconsin-Madison Carnegie Mellon University University of Pennsylvania Madison, WI Pittsburgh, PA Philadelphia, PA david.gagnon@wisc.edu eharpste@cs.cmu.edu slater.research@gmail.com ABSTRACT on-task and working productively [2]. Prediction and modeling of In this paper we compare the accuracy of nine common machine content understanding in students enables game designers and learning algorithms in predicting quitting and performance on educators the opportunity to generate additional opportunities for knowledge assessment tests in the context of two middle school a student to practice a given skill, or correct specific science learning games. The games being studied, the Crystal misconceptions that might exist about that content. While such Cave and Wave Combinator, are both short duration (played for techniques have been employed in intelligent tutoring systems via an average of 25 and 28 minutes respectfully), web-based games knowledge tracing and knowledge inference methods [14], the designed for use in classroom contexts. We used samples of 1,254 open-ended structure of many educational games makes these and 5,308 anonymous internet players respectively collected methods difficult to employ successfully. during Fall of 2018. We recorded raw clickstream data and used feature engineering methods to calculate simple descriptive 1.1 Games Being Studied features such as average timings between events and the number The two games used in this study, Crystal Cave and Wave and types of player moves. We then used these features to model Combinator, are available online for free public use and are short players quitting the game at each level, as well as content duration experiences, played for an average of 25 and 28 minutes knowledge measured by subsequent assessment. We found that respectfully. They are primarily used in classroom contexts. logistic regression produced the best models overall and model In Wave Combinator, players must manipulate the amplitude, quality was influenced by specific game levels and assessment frequency and offset of a wave in order to match the shape of a items. We conclude by discussing future work to improve target wave (Figure 1). Once the player’s wave is within a certain predicting player quitting and player knowledge assessment. range of the target wave, they are allowed to continue to the next level. At key points of the game, a multiple-choice question Keywords appears on screen that assess the vocabulary used in the game (Figure 2). While these assessment items are presented as being Feature engineering, digital games, videogames, modeling, asked by in-game characters, they are not situated within a prediction, quitting, assessment broader narrative context, but were retrofitted into the game for 1. INTRODUCTION the sake of this research. This study will be examining play data Digital games are increasingly being used to support learning in from the first 7 levels of the game and the 2 multiple choice items educational contexts across a wide variety of subjects, including that follow. social studies [4], mathematics [3], physics [9], and history [10]. Beyond content knowledge, games have also been used to support the development of cognitive and noncognitive skills, such as persistence and spatial reasoning [8]. As video games see increasing use in classroom contexts, the need to analyze the rich interaction data that they produce for meaningful behavioral and learning indicators from play becomes greater as well. Educational data mining (EDM) is well-suited to the problem of analyzing digital games which feature rich interaction data, and methods common to EDM have been frequently deployed to better understand data produced by digital games. For instance, EDM techniques have been used to model quitting behavior among students playing an educational physics simulation game [2], problem-solving in a game-based programming task [5], and computational thinking skills in Zoombinis [7, 14]. In this paper, we use EDM techniques to predict quitting behavior and content knowledge within two middle school science games, Crystal Cave and Wave Combinator [1,11]. We sought to model these outcomes because of their relevance for the use of these Figure 1. Initial levels of Wave Combinator. games in educational contexts. The identification of quitting behavior affords game designers and educators the opportunity to intervene with scaffolds or feedback that can help keep students Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Figure 2. A multiple choice question embedded in Wave Figure 4. A multiple choice question embedded in Crystal Combinator. Cave. These two games were chosen because they represent different In Crystal Cave, players assemble differently shaped molecules to archetypes of educational games. Wave Combinator provides form crystals of varying stabilities (Figure 3). Stability is players with controls that manipulate the outputs of a simple determined by the density of the resultant molecular pattern as simulation in real-time. Players have to construct meanings about well as the proper alignment of the positive and negative charges the purpose of each control in order to find a solution. Crystal on portions of the molecules. For each level, different thresholds Cave is a more constructive task that delays feedback for several of stability for the players’ molecular design will result in moves and requires players to apply simple chemistry rules to completing the level with 1 to 3 stars. Each level unlocks when develop reasonable strategies. Our goal in looking at two different the player has achieved a certain number of stars, leading to a games was to explore the degree to which similar minimal feature semi-structured progression through the game that allows students engineering approaches would perform across differing game to repeat challenges to find optimal molecular arrangements or to structures and design attributes. progress to new challenges. As with the Wave Combinator, multiple choice quiz items are presented by game characters, but 2. METHODS without meaningful integration into the game context. These questions appear after completing specific levels (Figure 4). This 2.1 Process of Data Collection & study will be examining play data from the entire set of 7 levels of Instrumentation the game and the 3 multiple choice items that are intermixed. Data is collected from the games using both Google Analytics (GA) as well as a researcher developed event logging system. GA were used to quickly record and visualize overall game metrics such as number and location of player sessions, session length, and high-level progression through each game (i.e., completed level 5, then quit). GA were primarily used during development and to understand audiences but are not included in the current analyses beyond understanding the audience and usage patterns. Multiple choice knowledge assessment measures were designed by the researchers for both games. Each item was aligned with the documented learning goals of the game. The instruments were designed to use a similar visual style as the rest of the game play (See Figures 2 and 4). Players completed the assessment measures after finishing gameplay; the assessment items were not embedded into the game itself. Our analyses focused on two labels – quitting behavior, when a player quits a level before it is completed and leaves the game, and performance on the post-test assessment measures. Population and Sampling Process Figure 3. Assembling simple molecules to create stable Based on GA, 93% of the games’ usage was from United States, crystals in Crystal Cave. based on IP addresses. Gameplay sessions were primarily recorded during school hours and on weekdays, leading researchers to believe that the games are used primarily in classroom contexts. Gameplay sessions were recorded anonymously making it impossible to tell if a session represented a new or returning player. While we acknowledge this limitation, our analysis assumes that an individual session represents a correct answers for the three Crystal Cave questions while an unique player. During the data collection period of September 1 average of 65.8% of sessions selected correct answers for the two through December 31, 2018 the Crystal Cave was played 20,963 Wave Combinator questions. As with quitting, we use each times with an average of 24.68 minutes/session. The Wave question’s distribution as the baseline model. Combinator was played 23,353 times with an average of 28.78 minutes/session. Of these sessions, 1,254 of the Crystal Cave sessions and 5,308 Wave Combinator sessions are included in this Table 1. Gameplay features for Wave Combinator and study based on the availability of the logging system and data Crystal Cave. exclusion rules described below. Wave Combinator Crystal Cave 2.2 Data Logging Move Total Slider Moves Total Molecule Moves Within each game, a JavaScript based logging client captures and Counts Total Offset Moves Total Molecule Rotates transmits clickstream events to a server for storage. Events are recorded for all discrete player actions such as starting a level, Total Amplitude Moves Total Stamp Rotates making a move, and completing a challenge. Each event is time- Total Wavelength Moves coded using the client browser’s native time, an automatically Total Move Type generated session identifier, and details about the event that took Changes place. The events are encoded as JSON and sent via an HTTP POST request. These requests are scheduled for delivery to the Averages Av. Slider Moves Av. Molecule Moves backend logging server using a first-in-first-out queue and are / Level Av. Offset Moves Av. Molecule Rotates only dismissed after delivery is confirmed. Av. Amplitude Moves Av. Stamp Rotates The backend server is comprised of a researcher built, open source, PHP-based web service. Client requests are parsed, Av. Wavelength Moves appended with the server’s system time, and inserted as individual Av. Move Type Changes records into a MySQL database. As each clickstream event is sent Av. % Offset Moves as a seperate network request and recorded as a individual row, the system is easily parallelized for large numbers of clients. For Av. % Amplitude Moves this study, a single quad core Apache / PHP server Virtual Av. % Wavelength Machine and a single quad core MySQL Virtual Machine server Moves were provisioned in a University data-center. Timing / Slider St. Dev. Av. Molecule Move 2.3 Feature Engineering and Distribution Move Slider Min / Max Time We designed features that describe the actions players are able to Attributes Total Time take in each game. We intentionally explored basic features that Av. Slider St. Dev. / could conceivably be extended to other educational games. We Level Av. Time / Level developed features that describe the counts of each (game Av. Slider Max / Min Total Time in specific) move type, the average number of each move type in “Museum” Total Time each level of play, timings and attributes of each move, the scoring the game provided to the player in each level, and Av. Time / Level attributes of re-starting and replaying challenges by players. Scoring Av. Score / Level Features were calculated using data collected chronologically before the outcome being modeled. For example, features to Resets / Av. Valid / Invalid Av. Resets / Level model quitting in level 5 were calculated using play data derived Replays Transitions Av. Completes / Level from levels 1 through 4, and any available data from level 5 play, Total Valid / Invalid but not level 6 or greater. This was done to preserve the predictive Transitions nature of the research and to create a model that could conceivably be used to make predictions in real-time within a gameplay session. Play sessions with less than 10 total moves 2.4 Modeling process were excluded from the final dataset. The table below describes We modeled the data using several algorithms provided by the features used for each game and attempts to align them across RapidMiner, a multiplatform data science tool [6]. This tool was games when appropriate. chosen for ease of use and free use in educational contexts across We defined “quitting” on a given level as a session where the the vast majority of computational platforms (OSX, Windows, session ends (log events halt abruptly) before the current level is and Linux). Individual models were generated for quitting at completed. Using this definition, each session is labeled with levels 1, 3, 5 and 7 for each game. Individual models were also either a “quit” or a “complete” for each level. The distribution for generated for each knowledge assessment, where the results of the these events leans strongly toward completing, with an average of assessment were represented as a binomial indicating a correct vs. 70.3% and 81.4% of sessions completing each level in Crystal incorrect response. Models that were used included RapidMiner’s Cave and Wave Combinator respectfully. We use each level’s implementations of Naive Bayes, Generalized Linear Model, quitting distributions as our baseline model. Logistic Regression, Fast Large Margin, Deep Learning1, We defined “incorrect” answers as a session selecting any of the 3 options that were not correct for each assessment item. As with quitting, the distribution for the assessments leans toward correct 1 answers being provided. An average of 65.6% of sessions selected RapidMiner’s Deep Learner component is based on a multi-layer feed-forward artificial neural network. For more details see: Decision Tree, Random Forest, Gradient Boosted Trees and Support Vector Machine. RapidMiner’s default hyperparameters Baseline calculations were quite high for predicting quitting at were used for all models, including a preprocessing step to each of the different levels across both games. For example, standardize all values to have zero mean and unit variance as well 91.1% of players who start level 7 in Wave Combinator also as the option to use a single thread to ensure reproducibility. complete it. By predicting that all players will complete level 7, a Model specific hyperparameters are seen in Appendix A. A single model will have a 0.911 accuracy, leaving very little room for 60/40 split process randomly divided the source data into a 60% improvement. Across all levels, a baseline model that always training set and 40% validation set. Accuracy percentage of each predicts completing the level will have an average accuracy of model was determined, along with baseline accuracies for quitting 0.703 for Crystal Cave and 0.814 for Wave Combinator. and knowledge assessment. The same initial feature space was used to predict both quitting behavior and post-test performance. For predictions of quitting at each level in Crystal Cave, Deep Neural Networks performed best on average. The most accurate 3. RESULTS prediction was 0.908 for level 1, followed by 0.786 for level 5, Table 3. Performance for predicting instances of quitting at 0.737 for level 3 and 0.707 for level 7. The largest improvement each level within Crystal Cave. over the baseline was for level 7, with the model performing 22.7% more accurately. This was followed by level 5 with a Accuracy 15.8% improvement over baseline, and level 1 with a 15.1% LV1 LV3 LV5 LV7 Av. improvement over baseline. The model performed slightly worse than baseline (0.958) predicting quitting for level 3. On average, Baseline 0.789 0.769 0.679 0.576 0.703 the Deep Neural Networks predicted quitting with an accuracy of Naive Bayes 0.819 0.749 0.685 0.687 0.735 0.784. All models performed better than the baseline model Generalized Linear except for Fast Large Margin. Model 0.883 0.772 0.785 0.687 0.782 For predictions of quitting at each level in Wave Combinator, Logistic Regression 0.897 0.763 0.786 0.667 0.778 Logistic Regression was the most accurate, but offered little improvement over baseline for most levels. The most accurate Fast Large Margin 0.863 0.771 0.666 0.460 0.690 prediction was 0.999 for quitting in level 1 followed by 0.914 for Deep Learning 0.908 0.737 0.786 0.707 0.784 level 7, 0.819 for level 3 and 0.630 for level 5. The largest improvement over baseline was seen in level 1 with the model Decision Tree 0.863 0.767 0.666 0.649 0.736 performing 11.2% more accurately. This advantage quickly Random Forest 0.861 0.762 0.676 0.660 0.740 dissolves with only a 0.8% improvement in level 5, a 0.3% improvement for level 7 and a 0.1% improvement for level 3. On Gradient Boosted average, Logistic Regression predicted quitting with an accuracy Trees 0.850 0.762 0.777 0.686 0.769 of 0.841. Deep Learning and Gradient Boosted Tree algorithms Support Vector failed to perform better than the baseline model for this Machine 0.762 1.091 prediction. Table 5. Performance for predicting incorrect answers for Table 4. Performance for predicting instances of quitting each assessment question in Crystal Cave. within Wave Combinator. Accuracy Accuracy Q0 Q1 Q2 Av. LV1 LV3 LV5 LV7 Av. Baseline 0.588 0.724 0.574 0.656 Baseline 0.899 0.819 0.625 0.911 0.814 Naive Bayes 0.594 0.732 0.575 0.634 Naive Bayes 0.956 0.819 0.629 0.914 0.830 Generalized Linear Generalized Linear Model 0.594 0.732 0.575 0.634 Model 1.000 0.818 0.624 0.914 0.839 Logistic Regression 0.603 0.745 0.600 0.649 Logistic Regression 0.999 0.819 0.630 0.914 0.841 Fast Large Margin 0.585 0.732 0.625 0.647 Fast Large Margin 1.000 0.819 0.619 0.914 0.838 Deep Learning 0.500 0.591 0.450 0.514 Deep Learning 0.999 0.657 0.425 0.717 0.700 Decision Tree 0.581 0.732 0.600 0.638 Decision Tree 0.997 0.819 0.629 0.914 0.840 Random Forest 0.585 0.732 0.500 0.606 Random Forest 0.974 0.819 0.630 0.914 0.834 Gradient Boosted Gradient Boosted Trees 0.543 0.706 0.525 0.591 Trees n/a 0.699 0.582 0.888 n/a Support Vector Support Vector Machine 0.568 0.691 0.650 0.637 Machine 0.999 0.818 0.621 0.908 0.836 https://docs.rapidminer.com/latest/studio/operators/modeling/pr edictive/neural_nets/deep_learning.html Table 6. Performance for predicting incorrect answers for That said, there is room for improvement in the performance of each assessment question in Wave Combinator. these models. More complex, move sequence features may lead to more meaningful descriptors of the player’s thinking. While the Q0 Q1 Av. features that were used in this paper were certainly grounded in Baseline 0.540 0.776 0.658 the interactions afforded to the player, they were only computed Naive Bayes 0.446 0.718 0.582 in terms of simple counts and averages. One possible next step would be to use sequential pattern mining to first identify Generalized Linear Model 0.588 0.771 0.680 common sequences of moves that correlate with outcomes of Logistic Regression 0.590 0.776 0.683 interest [12]. The presence of these patterns could then be used as an engineered feature to train the models. Fast Large Margin 0.453 0.774 0.614 The extreme accuracy (0.999) of level 1 quitting predictions for Deep Learning 0.489 0.585 0.537 the wave combinator invites speculation for the usefulness of Decision Tree 0.549 0.774 0.661 building models based only on very recent events. The approach Random Forest 0.546 0.774 0.660 used here was to use all player actions leading up to the quitting or assessment event. This may have the unintended consequence Gradient Boosted Trees 0.578 0.728 0.653 of diluting player moves that may immediately lead to a success Support Vector Machine 0.550 0.776 0.663 in a specific level, with moves from much earlier in the gameplay that are now irrelevant to the challenge at hand. A next step would be to modify the feature generating scripts to experiment with Baseline predictions for the assessment items were lower than for different time windows for modeling. quitting, but still much higher than a fair coin toss. Averaging Another limitation of this work is that accuracy may not be the across the 3 items in the Crystal Cave and the 2 items in the Wave Combinator, a baseline model that always predicts a correct best measure of the effectiveness of the predictions. In future work, the performance of the models should be reported by answer will have an accuracy of 0.656 and 0.658, respectfully. providing precision, recall and F1 scores. This issue is For predicting incorrect answers on the 3 assessment items in the compounded by the fact that baseline predictions, based only on Crystal Cave, Logistic Regression was the most accurate. The the percentages of players that complete a level or correctly model best predicts the outcome of question 1 with an accuracy of answer a quiz item, are quite high, leaving very little room for 0.745, followed by question 0 with an accuracy of 0.603 and improvement. The authors are unable to conclude that the models question 2 with an accuracy of 0.600. Compared to the baseline, are deriving their accuracy from the strength of the features and the greatest improvement was 4.4% on question 2. The model not simply the unbalanced distribution of the phenomena. demonstrated a 2.8% improvement on question 1 and 2.5% improvement on question 0. On average, the model predicted Finally, the validity of the answers provided for the multiple incorrect answers with an accuracy of 0.649. choice assessment items could be studied. These items are not standardized measures, but reasonable assessments designed by For predicting incorrect answers on the 2 assessment items in the the researchers. Further evaluating their validity and reliability Wave Combinator, Logistic Regression was the most accurate. may highlight insights as to why they are harder to predict. The model has an accuracy of 0.776 for question 1 followed by an Additionally, by modifying the system to record the time spent accuracy of 0.590 for question 0. This translates to an answering each assessment would help identify obvious issues improvement of 9.2% for question 0 and identical accuracy to such as spending less than 1 second before answering, not nearly baseline for question 1. On average, the model predicted incorrect enough time to read and decide on a correct answer. answers with an accuracy of 0.683. 5. SUMMARY 4. DISCUSSION In summary, logistic regressions performed better than all This paper compares the accuracy of 9 common modeling competing algorithms for quitting in Wave Combinator and algorithms for predicting quitting and knowledge assessment in content knowledge tests in both games. Deep Learning models two different learning games using the simplest possible feature performed best in predicting quitting in the Crystal Cave game. engineering. We found that, on the whole, these models were able Level quits can be predicted with an average accuracy of 0.784 for to successfully predict quitting behavior and correct answers in Crystal Cave and 0.841 for Wave Combinator, an improvement of our two games and their associated post-tests. This is a promising 12.4% and 3.1% over baseline, respectfully. Correct answers finding for continuing to deploy educational data mining methods across the embedded knowledge assessment items can be in order to capture and identify learning and behaviors of interest predicted with an average accuracy of 0.649 for Crystal Cave and within digital games. 0.683 for the Wave Combinator. The models provided a 3.3% and Accurate prediction of quitting behaviors and post-test 4.6% improvement over baseline for these games. performance has a number of practical applications within These results show that educational data mining techniques can educational settings. For instance, players who are identified as provide some predictive value to different kinds of educational being at-risk for quitting a level may be given targeted behavioral games even with relatively minimal feature engineering. We hope or affective scaffolds to keep them on-task and working that other researchers can be encouraged to apply similar methods productively. Players who have a low predicted score for a post- to their own games given our results. test assessment can be given additional practice opportunities on- demand, based on the specific misconceptions or difficulties they 6. ACKNOWLEDGMENTS are having. The authors gratefully acknowledge partial support of this research by NSF through the University of Wisconsin Materials Research Science and Engineering Center (DMR-1720415) and 8. APPENDIX A: Hyperparameters used for the Wisconsin Department of Public Instruction. each model 7. REFERENCES [1] Crystal Cave [Computer Software]. (2017). Madison: Field Model Hyperparameters Day. Naive Bayes n/a [2] Karumbaiah, S., Baker, R.S., Shute, V. (2018) Predicting Quitting in Students Playing a Learning Game. Proceedings Family = binomial of the 11th International Conference on Educational Data Generalized Linear Model Solver = L_BFGS Mining, 21-31. Logistic Regression Solver = L_BFGS [3] Kiili, K., Devlin, K., Perttula, A., Tuomi, P., & Lindstedt, A. Fast Large Margin Strategy = 1 against all (2015). Using video games to combine learning and assessment in mathematics education. International Journal Activation = rectifier of Serious Games, 2(4), 37-55. Hidden layer sizes = 50,50 [4] Maguth, B., List, S., & Wunderle, M. (2015). Teaching Deep Learning Epochs = 10.0 Social Studies with Video Games. The Social Studies, 106(1), 32-36. Criterion = gain_ratio [5] Malkiewich, L., Baker, R.S., Shute, V., Kai, S., Paquette, L. Maximal depth = 2 (2016) Classifying behavior to elucidate elegant problem Apply Pruning solving in an educational game. Proceedings of the 9th International Conference on Educational Data Mining, 448- Confidence = 0.1 453. Minimal Gain = 0.05 [6] Mierswa, I., & Klinkenberg, R. (2019). RapidMiner Studio Decision Tree Minimal Leaf Size = 2 (9.2) [Data science, machine learning, predictive analytics]. Trees = 20 Retrieved from https://rapidminer.com/ Criterion = gain_ratio [7] Rowe, E., Asbell-Clarke, J., Baker, R., Gasca, S., Bardar, E., & Scruggs, R. (2018). Labeling Implicit Computational Max Depth = 7 Thinking in Pizza Pass Gameplay. In Extended Abstracts of Apply Pruning the 2018 CHI Conference on Human Factors in Computing Confidence = 0.25 Systems. Minimal gain = 0.05 [8] Shute, V. J., Ventura, M., & Ke, F. (2015). The power of play: The effects of Portal 2 and Lumosity on cognitive and Minimal Leaf Size = 2 noncognitive skills. Computers & Education, 80, 58-67. Guess subset rratio [9] Shute, V. J., Ventura, M., & Kim, Y. J. (2013). Assessment Random Forest Voting Strategy = confidence vote and learning of qualitative physics in newton's playground. The Journal of Educational Research, 106(6), Trees = 60 423-430. Max Depth = 2 [10] Watson, W. R., Mong, C. J., & Harris, C. A. (2011). A case Min Rows = 10 study of the in-class use of a video game for teaching high Min Spilt Improvement = 0 school history. Computers & Education, 56(2), 466-474. Bins = 20 [11] Wave Combinator [Computer Software]. (2017). Madison: Field Day. Learning Rate = 0.1 [12] Wallner, G. (2015). Sequential Analysis of Player Behavior. Gradient Boosted Trees Sample Rate = 1.0 In CHI PLAY ’15 Proceedings of the 2015 Annual Type = C-SVG Symposium on Computer-Human Interaction in Play, 349– 358. https://doi.org/10.1145/2793107.2793112 Kernel = rbf [13] Pavlik Jr., P.I., Cen, H., & Koedinger, K.R. (2009). Gamma = 1.0E-4 Performance Factors Analysis – A New Alternative to C = 100.0 Knowledge Tracing. In V. Dimitrova & R. Mizoguchi (Eds.), Support Vector Machine Epsilon = 0.001 Proceedings of the 14th International Conference on Artificial Intelligence in Education. Brighton, England. [14] Zoombinis [Computer Software]. (2015). TERC.