Solving Mathematical Exercises: Prediction of
                 Students’ Success

            Sebastian Wankerl1,2 , Gerhard Götz2 , and Andreas Hotho1
                   1
                      DHBW Mosbach, Lohrtalweg 10, 74821 Mosbach
           2
               Chair for Computer Science X, Am Hubland, 97074 Würzburg
                         sebastian.wankerl@uni-wuerzburg.de
                           gerhard.goetz@mosbach.dhbw.de
                     andreas.hotho@informatik.uni-wuerzburg.de


        Abstract. In educational settings, recommender systems can help to
        choose the right exercises a student should be given for training. To make
        good decisions, the system should be able to estimate how successful a
        student would answer a recommended exercise. In this work, we study
        the performance of convolutional neural networks and collaborative fil-
        tering for estimating students’ success. We show that we can distinguish
        between correctly and wrong processed exercises with a precision of up
        to 64% while training on a small corpus of 712 user interactions.


Keywords: Recommender systems, Technology enhanced learning

1     Introduction
Recommender systems are widely used for already a long period. While the
most prominent areas of application are still e-commerce and entertainment
industry[4], several attempts are made for using recommendation algorithms
in the area of education[1,2], due to the increasing number of online learning
material and massive open online courses (MOOC).
    We want to build a recommender system that helps freshmen at university
to overcome their weaknesses in mathematics. Since no public data is available
for our aim, we built a rule-based system which exploits didactical ontologies[3]
to find suitable exercises for a student. Furthermore, this preliminary system
makes it possible to collect data we can later use for building a system that
better adapts to the individual student.
    In this work, we want to present a first step in this direction, namely a
machine-learning based approach to estimate whether a student will solve an
exercise given by our rule-based system correctly. In particular, we want to
detect for all users those exercises that they were given but unable to master.
Such failing allows the students to learn more and to broaden their skills instead
of simply repeating known content.
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2     Educational System and Dataset

Our dataset contains 128 mathematical exercises. With our rule-based system,
we collected data from 39 students who contributed a total of 787 processed ex-
ercises. However, for our following analysis we only used the data of participants
who contributed at least 10 exercises to have enough data for each user. This
leaves us with 24 users and 712 processed exercises.
    The data points, also called interactions, are represented by 4-tuples (u, i, rui , t)
consisting of a user u, an exercise i and a binary rating label rui which indicates
whether the user u solved exercise i correctly or not. In addition, we keep a con-
secutive timestamp t which preserves the chronological order of the interactions.
The number of contributed interactions by user u is denoted as Tu .


2.1   Training and Test Dataset

In this section we describe how we build the test and training dataset of each
student we use for our experiments.
     To maximize the amount of data usable for training, we keep all data of the
user except the last 5 interactions which we want to predict the user’s perfor-
mance on. Hence, these last five interactions are our test set and denoted as
teu .
     We denote the basis training set for user u as tru . For example, if we want to
predict the performance of user u1 , we keep all interactions from users u2,...,24
together with the first T1 − 5 interactions by user u1 as basis training set.
     Since we expect the CNN to require more training data than provided by the
basis training set, we decided to create a second training set tr  ˜ u by augmenting
the basis training set. More precisely, it consists of replicas of the users. They
are obtained by creating approximately 20,000 windows wj of size 5 consisting
of random interactions aj1 , . . . , aj5 ∈ tru such that the following conditions are
fulfilled:

 1. all interactions a ∈ wj come from the same user u
 2. the interactions a ∈ wj are ordered ascending according to their timestamp
    t


3     Experimental Setting

To reach our goal of predicting the students’ performance on given exercises, we
experimented with a Convolutional Neural Network (CNN) architecture as well
as collaborative filtering (CF).


3.1   CNN

As stated above, we want to predict the performance of each user u on the 5
exercises contained in teu . Hence, our CNN architecture maps sequences of 5
      Fig. 1. Visualization of the neural network architecture with example data


exercise ids (i1 . . . i5 ) to their binary performance labels (ri1 . . . ri5 ). We train
the network using the augmented training sets tr    ˜ u.
    Since the inputs to the network are discrete, i.e. high dimensional and sparse
ids, they are mapped to a lower dimensional dense vector first using an em-
bedding layer. The dimension of the embeddings is set to 20. We use one con-
volutional layer, with the number of filters, and kernel size set to 10 and 3,
respectively. This configuration worked best out of the explored configurations.
We tested the number of filters between 5 and 30 and the number of convo-
lutional layers of 1 and 2. Moreover, we varied the kernel size between 2 and
5.
    We flatten the output and put it into a feed-forward layer that applies the
sigmoid activation to each output. The overall architecture is depicted in figure
1 alongside a generic input and output sample.
    For training, we set a batch size of 32 and apply the ADAM optimizer.
Moreover, we use the binary cross-entropy as the loss function since it transforms
each element of the sigmoid output layer into an independent probability. In our
setting, these are interpreted as the probabilities of the exercises being solvable.

3.2    Collaborative Filtering
We also apply collaborative filtering (CF) using the kNN algorithm with Pearson
correlation as similarity measure as this is a long established approach in the
field of recommender systems. We chose to consider the k = 2 nearest neighbors
only as this yields the best results on our dataset. To predict the performance of
user u, we fit the algorithm on the original user interactions contained in tru . For
testing we let the algorithm predict teu of each user u, equal to the evaluation
of our CNN approach.


4     Results
As described in section 3, we evaluated a CNN architecture as well as collab-
orative filtering on our recommendation setting. We use majority vote of each
                         Table 1. Results of the Classifiers

                       Classifier Precision Recall F1 RMSE
                       CNN          0.64     0.52 0.54 0.48
                       CF           0.64     0.61 0.59 0.47
                       Majority     0.56     0.46 0.47 0.53


exercise as a baseline. It assumes that the user u will solve exercise i incorrect
if the majority of other users did so.
    The classifier’s predictions are rounded half away from zero to obtain di-
chotomous variables of predicted success. The metrics used for evaluation are
precision p, recall r, and f1 -score. In addition, we evaluated the RMSE score
between the not rounded predictions and rui . The results are shown in table 1,
averaged over all users.
    Both classifiers outperform the baseline with regard to all metrics as can
be seen in table 1, leading to the conclusion that both classifiers are able to
recognize contextual effects in the students’ handling of the exercises. Regardless
of parameter k, the CF could not make predictions in 13 of 120 exercises we want
to predict since no fitting neighbors were available.
    With regard to the precision, both approaches work comparably. However,
the collaborative filtering yields a considerably higher recall score than the CNN.
Hence, the CF approach detects more of the positive items. Overall, the f1 -score
of the CF is consequently higher than the one of the CNN.
    With regard to the RMSE score, the baseline performs worse than both
machine learning based approaches again. Also, it gives further evidence that
the predictions of the CF are slightly closer to the true values than the one
made by the CF approach.


5   Summary

In this work we trained a CNN and CF to predict a student’s success in solving
mathematical exercises presented to the student by our tutoring system. The
classifiers were given all interactions of the student except the last 5, along with
the interactions of all other students. We showed that both classifiers are able
to predict the students success more accurately than a majority baseline.
    The results suggest that a classifier can help in selecting appropriate exercises
for a student. As a further step, we can incorporate it into our rule-based system
to help making decisions which exercise to present to a student.
    Nevertheless, one has to keep in mind that the results presented here are
drawn from a very limited set of students. This could also be a possible expla-
nation why the neural network approach does not yield promising results. It
is therefore worthwhile to conduct the described experiments again as soon as
more student participated in the training.
    Moreover, it can be investigated whether it is possible to boost the quality of
predictions if the system could exploit more information than just the correctness
of an exercise, like the student’s average time spent with solving the exercises or
his overall success.


References
1. Drachsler, H., Verbert, K., Santos, O.C., Manouselis, N.: Panorama of recommender
   systems to support learning. In: Recommender systems handbook, pp. 421–451.
   Springer (2015)
2. Henning, P.A., Forstner, A., Heberle, F., Swertz, C., Schmölz, A., Barberi, A.,
   Verdu, E., Regueras, L.M., Verdu, M.J., Pablo de Castro, J., et al.: Learning path-
   way recommendation based on a pedagogical ontology and its implementation in
   moodle (2014)
3. Pinkernell, G., Düsi, C., Vogel, M.: Aspects of proficiency in elementary algebra.
   In: 10th Congress of European Research in Mathematics Education (2017)
4. Zhang, S., Yao, L., Sun, A., Tay, Y.: Deep learning based recommender system: A
   survey and new perspectives. ACM Computing Surveys (CSUR) 52(1), 5 (2019)