Developing predictive models for early detection of at-risk
          students on distance learning modules
      Annika Wolff Zdenek Zdrahal Drahomira Herrmannova Jakub Kuzilek Martin Hlosta
                                              Knowledge Media Institute, The Open University
                                                                      Walton Hall
                                                             Milton Keynes, MK7 6AA
                                         +44(0)1908 659462, 654512, 652477, 659109, 653800
                      a.l.wolff;z.zdrahal;drahomira.herrmannova;jakub.kuzilek;martin.hlosta{@open.ac.uk}


ABSTRACT                                                                      successfully in place at Purdue University for some time,
                                                                              providing feedback to students based on predictions from multiple
Not all students who fail or drop out would have done so if they              data sources (Arnold and Pistilli, 2012). The Open University
had been offered help at the right time. This is particularly true on         (OU) is one of the largest distance learning institutions in Europe.
distance learning modules where there is no direct tutor/student              OU modules are making increasing use of the Virtual Learning
contact, but where it has been shown that making contact at the               Environment, Moodle, to supply learning materials, instead of the
right time can improve a student’s chances. This paper explores               previous paper materials supplied in the post.
the latest work conducted at the Open University, one of Europe’s
largest distance learning institutions, to identify when is the               This paper explores the latest work at the Open University using
optimum time to make student interventions and to develop                     data from VLE, combined with demographic data to predict
models to identify the at-risk students in this time frame. This              student failure or dropout. This ongoing work is already providing
work in progress is taking real-time data and feeding it back to              real-time information to module teams and will be fully evaluated
module teams as the module is running. Module teams will be                   later in the year. The methods investigate the role of VLE activity
indicating which of the predicted at-risk students have received an           compared with demographic data and attempt to make predictions
intervention, and the nature of the intervention.                             of a student before they submit their first assessment. This first
                                                                              assessment has been found to be a very good predictor of a
Categories and Subject Descriptors                                            students final outcome on a module.
H.2.8 [Database Applications]: Data                    Mining;      D.4.8     This work is the culmination of a number of previous projects to
[Performance]: Modelling and prediction                                       investigate the potential for different methods to produce accurate
                                                                              predictions. We will first describe briefly some of the previous
General Terms                                                                 work at the OU before examining the current methods,
Algorithms, Design, Experimentation, Human Factors, Theory.                   preliminary feedback of these and plans for future evaluation.

                                                                              2. Previous work with OU data
Keywords                                                                      Decision trees have proved a fairly popular method for exploring
predictive models, machine learning, student data, Bayesian                   the potential for building predictive models from student data (see
models, distance learning                                                     Baradawaj and Pal, 2011; Pandey and Sharma, 2013; Kabra and
                                                                              Bichkat, 2011). Initial work with OU data focused on using
1. INTRODUCTION                                                               decision tress to predict student outcome from VLE data
Predictive modelling techniques can be applied to student data to             combined with assessment scores. Each OU module evaluates
identify students who are at risk of failing or withdrawing from a            students periodically with a Tutor Marked Assessment (TMA).
module. Tutors or module teams can use this information to aid                The exact number may vary from module to module, but seven
their decision-making about whom they should contact to offer                 TMA’s is quite typical. Three modules, each with fairly typical
help, leading to better strategic use of resources and improved               VLE usage and a large student cohort (between 1200 and 4400
retention. For example, the Course Signals system has been                    students registered), were chosen for building and testing the
                                                                              models. The main findings from this were that decision trees were
                                                                              fairly good at predicting both a drop in performance in a
 ermission to make digital or hard copies of all or part of this work for     subsequent assessment and in predicting the final outcome of the
 personal or classroom use is granted without fee provided that copies are    module. Prediction was overall better when combining VLE and
 not made or distributed for profit or commercial advantage and that
                                                                              TMA data. This preliminary work also suggested that the absolute
                                                                              amount of clicking within the VLE was not directly correlated
                                                                              with outcome, students could click a lot but still fail or not click at
                                                                              all and still pass. However, reduction in clicking was a warning
                                                                              sign.
                                                                              The models were developed and tested on single presentations of
                                                                              the three modules, then they were tested on a future presentation
                                                                              of the same module. Finally, they were tested on each other (in
other words, developed on one module and applied to another).
As expected, accuracy was reduced in the latter two cases, but
interestingly not as much as might have ben expected. A brief
investigation into including demographic data revealed that
prediction could be improved with this data source. This work is
described in detail in Wolff et al. (2013a).
The next phase of work investigated more fully the potential for
using demographic data and focused on Bayesian models for                    Figure 2. TMA1 is a strong predictor of the final result
prediction, which were compared with more simple methods of             The VLE opens two to three weeks prior to the start of the
linear and logistic regression and weighted score. The key              module presentation so that students can smoothly engage early
findings were that a) including VLE data improved the accuracy          in a number of module related activities. In order to achieve early
of predictions compared to using demographic data alone b) there        predictions for TMA1 we start analysing records from the very
was little real difference between the different methods evaluated      opening of the VLE, i.e. well before the presentation start. VLE
- accuracy increased as the module progressed. However, the             activities can be classified into a number of actions and activity
majority of dropout occurs early in the module (Wolff et al.            types depending on what is the student trying to do. Out of many,
2013b).                                                                 we have identified four activity types that provide useful
Some focused investigation into the role of the first TMA in            information for prediction. They are denoted as follows:
predicting the final outcome, found that failing the first
                                                                             Resource contains books and other educational materials for
assessment had a significant negative impact. Therefore, the key
                                                                              the students
to improving retention is in identifying those students who are at
                                                                             Forum is a web site where students communicate with their
risk of either submitting but failing, or not submitting this first
                                                                              tutors and with each other
assessment. This is described in more detail in the next section,
where the overall problem is specified.                                      Subpage is the means of navigation in the VLE structure
                                                                             OU Content refers to the specification of TMAs and the
3. Problem Specification                                                      guidelines for their elaboration.
For identifying students at risk we can use knowledge about             Our predictive modeling algorithms use, for each student, weekly
students’ behavior and performance in the current presentation,         aggregates of all four activity types and all their combination.
their demographic data and data about the module and                    Therefore, for each student and each week we get a 16
performance of others students in previous presentations. In this       dimensional vector (N, R, F, S, O, RF, RS, …, RFSO) where N
task we do not consider students’ overall learning objectives, nor      means “No VLE activity”. Some algorithms use numeric values
their previous or current performance in another modules. This is       describing the number of accesses of particular activity type,
diagrammatically shown in Figure 1. A1-An indicate the time at          others use mutually exclusive Boolean values representing the fact
which a module assessment is due. Vle1-Vlen are the VLE clicks          that the student engaged in the particular combination of activity
in the periods between either the start of the module and first         types.
assessment, or else in between assessments.                             4. Methods for early detection of failure
                                                                        Predictions of at risk students is calculated and updated every
                                                                        week starting from the opening of the VLE. The prediction
                                                                        combines the results of four machine learning algorithms:
                                                                        1.   k Nearest Neighbours (k-NN) makes use of weekly
                                                                             aggregates represented as 16-dimensional numerical activity
                                                                             type vectors compared with legacy data of previous
                                                                             presentations.
                  Figure 1. Prediction problem
                                                                        2.   k Nearest Neighbours (k-NN) is based on a similar
Given demographic data, the results of TMAs so far and VLE                   approach but uses only demographic data. Since
activities, the goal is to identify as early as possible the students        demographic data has typically nominal values, an important
who are at risk and for whom the intervention is meaningful. By              part of the algorithms was how to define distance between
meaningful intervention, we mean that the student can be helped              two demographic sets.
to pass the module and the overall cost of interventions is
affordable. The reasoning about the future behavior of the student      3.   Classification and Regression Tree (CART) is calculated
is based on experience with students with similar characteristics in         from VLE data and TMA1 of previous presentations and
previous presentations of same module.                                       then used for the classification of current students.

Our analysis indicates that VLE data is more important than             4.   Bayes network combines both demographic and VLE data.
demographic characteristics. Moreover, the performance at the                Chi-square tests showed that a statistically independent
early stages of the module presentation is a very good predictor             subset of demographic data exist. For a smaller number of
of final success or failure. In the analysed modules, the students           demographic variables a full Bayes network has been
who fail or do not submit the first TMA have high probability of             constructed. For the complete set, we implemented naïve
overall failure. For this reason it is crucial to concentrate the            Bayes.
effort to identify at risk students before the TMA1 deadline. This      The results of these methods are combined by majority voting.
is indicated in Figure 2.
                                                                     Figure 4. 3-nearest neighbours based on demographic and
            Figure 3a. Mockup module dashboard                                             VLE distances
                                                                   The icon representing the evaluated student is in the centre. The
                                                                   upper right quadrant shows the three nearest neighbours in the
                                                                   current presentation. The nearest neighbours of three previous
                                                                   presentations are organised anticlockwise. In the quadrants
                                                                   representing the previous presentations, the red icons indicate that
                                                                   the student failed, the green ones indicate a pass. In the current
                                                                   presentation the icons show predicted outcome. The amber icon
                                                                   show the borderline cases. The default split is calculated by the
                                                                   algorithms, but the tutors can express their experience by moving
                                                                   the slider.
                                                                   The list of students identified as at risk is passed to the module
                                                                   team for possible interventions. Currently, the data is passed in a
                                                                   spreadsheet, whilst the dashboard mockups are being also
                                                                   evaluated with module teams and will be completed and
                                                                   integrated with models and data when the design is finalized. The
                                                                   spreadsheet rank orders the students on order of their weighted
                                                                   risk score. An explanation for the prediction of each of the models
                                                                   is given. The first two explanations point to the nearest
   Figure 3b. Mockup dashboard describing an individual            neighbours from the previous presentations (first the closest by
                         student                                   their VLE activity and secondly the closest by their demographic
The mockups of the dashboards for presenting the results are       data). Next, the prediction according to the decision tree is
shown in Figures 3a and 3b. Figure 3a demonstrates a view across   explained in terms of the applied rule, which may combine the
students of a module. The upper graph presents an overview of      students level of VLE activity with some demographic attributes
VLE activities. The lower table organizes students according to    (these are the normal demographic attributes that are collected
their predicted outcome at the current point in the module,        about students, e.g. age, previous academic background, etc.).
including an explanation for the prediction. Figure 3b shows the   Finally, the prediction of the Bayes classifier is presented along
view of an individual student.                                     with the explanation similar to the decision tree, combining VLE
                                                                   with demographic information. In some cases, the predictions
The detail of the interface that allows the tutor to change the
                                                                   from the four models do not match.
balance of predictions based on demographic and VLE data is
shown in Figure 4.                                                 5. Evaluation
                                                                   Evaluation of the latest methods will occur when the module has
                                                                   completed. Regardless of the predictive methods being used, there
                                                                   is a general prediction by module teams that retention will
                                                                   improve in this presentation due to other factors, such as
                                                                   improved module design and also changes to student funding and
                                                                   the financial commitments that students are now making. This will
                                                                   clearly impact on the ability to draw any firm conclusions about
                                                                   what to attribute improved retention to, should that turn out to be
                                                                   the case. However, it is still worth looking at the overall retention
                                                                   compared to previous modules. It is also possible to use
                                                                   qualitative methods, such as looking at the student feedback, or
                                                                   speaking with the module tutors and module teams. In addition, it
is possible to make a hypothesis about the accuracy of the              data. The feedback from the first set of output data has been very
methods where interventions have been made. If interventions are        positive. A full evaluation will occur later in the year when the
having an effect, then this should reduce the accuracy of the           module is complete.
predictions. Specifically, it should be the case that predictions
made for a student prior to an intervention being made will give a
                                                                        7. REFERENCES
false positive result for failure. The precision and recall of the
                                                                        [1] Arnold, K.E., Pistilli, M.D. 2012. Course Signals at Purdue:
methods on this module at this point in time can be compared to
                                                                            Using Learning Analytics to increase student success. In:
methods applied to other modules at the same point in time, to test
                                                                            Learning Analytics and Knowledge, 29 April – 2 May,
for significant differences.
                                                                            Vancouver, Canada
The first set of predicted outcome for TMA 1 has been provided
                                                                        [2] Baradwaj, B., Pal, S. 2011. Mining Educational Data to
to one of the pilot module teams and action will be taken in the
                                                                            Analyze Student’s Performance, International. Journal of
very near future. While it is not possible to know yet what the
                                                                            Advanced Computer Science and Applications, vol. 2, no. 6,
final evaluation will show, the module team, as well as wider
                                                                            pp. 63-69
support networks for OU students, have been looking at the initial
outputs and feel very positive about the potential for the              [3] Pandey, M., Sharma, V.K. 2013. A Decision Tree Algorithm
technology to integrate into wider OU practice and provide an               Pertaining to the Student Performance. Analysis and
important source of information, both for strategically targeting           Prediction. International Journal of Computer Applications
support to students when they need it, but also for improving               61(13): 1-5, New York, USA
advice given to students as they begin their studies.                   [4] Kabra, R. R. and Bichkar, R.S. 2011. Performance Prediction
                                                                            of Engineering Students using Decision Trees. International
6. Conclusion                                                               Journal of Computer Applications 36(11): 8-12, New York,
Where previous work has demonstrated that it is possible to                 USA
accurately identify at risk students throughout a module
presentation, this latest work focuses specifically on increasing       [5] Wolff, A., Zdrahal, Z., Nikolov, A., Pantucek, M. 2013a.
accuracy for early detection. Most students who fail get into               Improving retention: predicting at-risk students by analysing
difficulties very early on, so this is the critical point at which to       clicking behaviour in a virtual learning environment.
make an intervention. Predictions are made with reference to a              In: Third Conference on Learning Analytics and Knowledge
students nearest neighbor, based firstly on demographic data and            (LAK 2013), 8-12 April 2013, Leuven, Belgium
secondly on VLE data, allowing the two data sources to be               [6] Wolff, A., Zdrahal, Z., Herrmannová, D. and Knoth, P.
balanced against each other and to better understand, over time,            2013b. Predicting Student Performance from Combined Data
the role of each. In addition, CART and Bayes models are applied            Sources, in eds. Alejandro Peña-Ayala, Educational Data
to the combined VLE and demographic data. Predictions from the              Mining: Applications and Trends, 524, Springer
four models are weighed against each other to produce a list of
students ranked in order of risk. Currently, this is provided in a
spreadsheet to module teams, along with explanations from each
of the models. Dashboards are being constructed to visualize this