Introduction

Solving Mathematical Exercises: Prediction of Students' Success

Sebastian Wankerl

sebastian.wankerl@uni-wuerzburg.de 0 1

Gerhard Gotz

gerhard.goetz@mosbach.dhbw.de 0

Andreas Hotho

andreas.hotho@informatik.uni-wuerzburg.de 1 0 Chair for Computer Science X , Am Hubland, 97074 Wurzburg 1 DHBW Mosbach , Lohrtalweg 10, 74821 Mosbach

In educational settings, recommender systems can help to choose the right exercises a student should be given for training. To make good decisions, the system should be able to estimate how successful a student would answer a recommended exercise. In this work, we study the performance of convolutional neural networks and collaborative ltering for estimating students' success. We show that we can distinguish between correctly and wrong processed exercises with a precision of up to 64% while training on a small corpus of 712 user interactions.

Recommender systems Technology enhanced learning

Introduction

Recommender systems are widely used for already a long period. While the most prominent areas of application are still e-commerce and entertainment industry[ 4 ], several attempts are made for using recommendation algorithms in the area of education[ 1,2 ], due to the increasing number of online learning material and massive open online courses (MOOC).

We want to build a recommender system that helps freshmen at university to overcome their weaknesses in mathematics. Since no public data is available for our aim, we built a rule-based system which exploits didactical ontologies[ 3 ] to nd suitable exercises for a student. Furthermore, this preliminary system makes it possible to collect data we can later use for building a system that better adapts to the individual student.

In this work, we want to present a rst step in this direction, namely a machine-learning based approach to estimate whether a student will solve an exercise given by our rule-based system correctly. In particular, we want to detect for all users those exercises that they were given but unable to master. Such failing allows the students to learn more and to broaden their skills instead of simply repeating known content.

Educational System and Dataset

Our dataset contains 128 mathematical exercises. With our rule-based system, we collected data from 39 students who contributed a total of 787 processed exercises. However, for our following analysis we only used the data of participants who contributed at least 10 exercises to have enough data for each user. This leaves us with 24 users and 712 processed exercises.

The data points, also called interactions, are represented by 4-tuples (u; i; rui; t) consisting of a user u, an exercise i and a binary rating label rui which indicates whether the user u solved exercise i correctly or not. In addition, we keep a consecutive timestamp t which preserves the chronological order of the interactions. The number of contributed interactions by user u is denoted as Tu. 2.1

Training and Test Dataset

In this section we describe how we build the test and training dataset of each student we use for our experiments.

To maximize the amount of data usable for training, we keep all data of the user except the last 5 interactions which we want to predict the user's performance on. Hence, these last ve interactions are our test set and denoted as teu.

We denote the basis training set for user u as tru. For example, if we want to predict the performance of user u1, we keep all interactions from users u2;:::;24 together with the rst T1 5 interactions by user u1 as basis training set.

Since we expect the CNN to require more training data than provided by the basis training set, we decided to create a second training set t~ru by augmenting the basis training set. More precisely, it consists of replicas of the users. They are obtained by creating approximately 20,000 windows wj of size 5 consisting of random interactions aj1 ; : : : ; aj5 2 tru such that the following conditions are ful lled: 1. all interactions a 2 wj come from the same user u 2. the interactions a 2 wj are ordered ascending according to their timestamp t 3

Experimental Setting

To reach our goal of predicting the students' performance on given exercises, we experimented with a Convolutional Neural Network (CNN) architecture as well as collaborative ltering (CF). 3.1

CNN As stated above, we want to predict the performance of each user u on the 5 exercises contained in teu. Hence, our CNN architecture maps sequences of 5 exercise ids (i1 : : : i5) to their binary performance labels (ri1 : : : ri5 ). We train the network using the augmented training sets t~ru.

Since the inputs to the network are discrete, i.e. high dimensional and sparse ids, they are mapped to a lower dimensional dense vector rst using an embedding layer. The dimension of the embeddings is set to 20. We use one convolutional layer, with the number of lters, and kernel size set to 10 and 3, respectively. This con guration worked best out of the explored con gurations. We tested the number of lters between 5 and 30 and the number of convolutional layers of 1 and 2. Moreover, we varied the kernel size between 2 and 5.

We atten the output and put it into a feed-forward layer that applies the sigmoid activation to each output. The overall architecture is depicted in gure 1 alongside a generic input and output sample.

For training, we set a batch size of 32 and apply the ADAM optimizer. Moreover, we use the binary cross-entropy as the loss function since it transforms each element of the sigmoid output layer into an independent probability. In our setting, these are interpreted as the probabilities of the exercises being solvable. 3.2

Collaborative Filtering

We also apply collaborative ltering (CF) using the kNN algorithm with Pearson correlation as similarity measure as this is a long established approach in the eld of recommender systems. We chose to consider the k = 2 nearest neighbors only as this yields the best results on our dataset. To predict the performance of user u, we t the algorithm on the original user interactions contained in tru. For testing we let the algorithm predict teu of each user u, equal to the evaluation of our CNN approach. 4

Results

As described in section 3, we evaluated a CNN architecture as well as collaborative ltering on our recommendation setting. We use majority vote of each exercise as a baseline. It assumes that the user u will solve exercise i incorrect if the majority of other users did so.

The classi er's predictions are rounded half away from zero to obtain dichotomous variables of predicted success. The metrics used for evaluation are precision p, recall r, and f1-score. In addition, we evaluated the RMSE score between the not rounded predictions and rui. The results are shown in table 1, averaged over all users.

Both classi ers outperform the baseline with regard to all metrics as can be seen in table 1, leading to the conclusion that both classi ers are able to recognize contextual e ects in the students' handling of the exercises. Regardless of parameter k, the CF could not make predictions in 13 of 120 exercises we want to predict since no tting neighbors were available.

With regard to the precision, both approaches work comparably. However, the collaborative ltering yields a considerably higher recall score than the CNN. Hence, the CF approach detects more of the positive items. Overall, the f1-score of the CF is consequently higher than the one of the CNN.

With regard to the RMSE score, the baseline performs worse than both machine learning based approaches again. Also, it gives further evidence that the predictions of the CF are slightly closer to the true values than the one made by the CF approach. 5

Summary

In this work we trained a CNN and CF to predict a student's success in solving mathematical exercises presented to the student by our tutoring system. The classi ers were given all interactions of the student except the last 5, along with the interactions of all other students. We showed that both classi ers are able to predict the students success more accurately than a majority baseline.

The results suggest that a classi er can help in selecting appropriate exercises for a student. As a further step, we can incorporate it into our rule-based system to help making decisions which exercise to present to a student.

Nevertheless, one has to keep in mind that the results presented here are drawn from a very limited set of students. This could also be a possible explanation why the neural network approach does not yield promising results. It is therefore worthwhile to conduct the described experiments again as soon as more student participated in the training.

Moreover, it can be investigated whether it is possible to boost the quality of predictions if the system could exploit more information than just the correctness of an exercise, like the student's average time spent with solving the exercises or his overall success.

1. Drachsler , H. , Verbert , K. , Santos , O.C. , Manouselis , N.: Panorama of recommender systems to support learning . In: Recommender systems handbook , pp. 421 { 451 . Springer ( 2015 )

2. Henning , P.A. , Forstner , A. , Heberle , F. , Swertz , C. , Schmolz, A. , Barberi , A. , Verdu , E. , Regueras , L.M. , Verdu , M.J. , Pablo de Castro, J., et al.: Learning pathway recommendation based on a pedagogical ontology and its implementation in moodle ( 2014 )

3. Pinkernell , G. , Dusi, C. , Vogel , M. : Aspects of pro ciency in elementary algebra . In: 10th Congress of European Research in Mathematics Education ( 2017 )

4. Zhang , S. , Yao , L. , Sun , A. , Tay , Y. : Deep learning based recommender system: A survey and new perspectives . ACM Computing Surveys (CSUR) 52(1) , 5 ( 2019 )