     A predictive model for identifying students with dropout
                    profiles in online courses

             Marcelo A. Santana                      Evandro B. Costa                 Baldoino F. S. Neto
             Institute of Computing                Institute of Computing             Institute of Computing
          Federal University of Alagoas         Federal University of Alagoas      Federal University of Alagoas
         marcelo.almeida@nti.ufal.br evandro@ic.ufal.br         baldoino@ic.ufal.br
                        Italo C. L. Silva            Joilson B. A. Rego
                            Institute of Computing                    Institute of Computing
                         Federal University of Alagoas             Federal University of Alagoas
                            italocarlo@nti.ufal.br                    jotarego@gmail.com

ABSTRACT                                                         role for popularization of this learning modality [1].
Online education often deals with the problem related to the
high students’ dropout rate during a course in many areas.       Despite the rapid growth of online courses, there has also
There is huge amount of historical data about students in on-    been rising concern over a number of problems. One issue
line courses. Hence, a relevant problem on this context is to    in particular that is difficult to ignore is that these online
examine those data, aiming at finding effective mechanisms       courses also have high dropout rates. Specifically, in Brazil,
to understand student profiles, identifying those students       in 2013, according with the latest Censo, published by the E-
with characteristics to drop out at early stage in the course.   learning Brazilian Association (ABED), the dropout average
In this paper, we address this problem by proposing predic-      was about 19,06% [1].
tive models to provide educational managers with the duty
to identify students whom are in the dropout bound. Four         Beyond the hard task on identifying the students who can
classification algorithms with different classification meth-    have possible risk of dropping out, the same dropout also
ods were used during the evaluation, in order to find the        brings a huge damage to current financial and social re-
model with the highest accuracy in prediction the profile        sources. Thus, the society also loses when they are poorly
of dropouts students. Data for model generation were ob-         managed, once the student fills the vacancy but he gives up
tained from two data sources available from University. The      the course before the end.
results showed the model generated by using SVM algorithm
as the most accurate among those selected, with 92.03% of        Online education often deals with the problem related to the
accuracy.                                                        high students’ dropout rate during a course in many areas.
                                                                 There is huge amount of historical data about students in on-
                                                                 line courses. Hence, a relevant problem on this context is to
Keywords                                                         examine those data, aiming at finding effective mechanisms
Dropout, Distance Learning, Educational Data Mining, Learn-      to understand student profiles, identifying those students
ing Management Systems                                           with characteristics to drop out at early stage in the course.

1.    INTRODUCTION                                               In this paper, we address this problem by proposing predic-
Every year, the registration marks in E-learning modality        tive models to provide educational managers with the duty
has increased considerably, in 2013, 15.733 courses were of-     of identifying students who are in the dropout bound. This
fered, in E-learning or semi-presence modality. Further-         predictive model took in consideration academic elements
more, the institutions are very optimistic, 82% of researched    related with their performance at the initial disciplines of
places, believe that the amount of registration marks will       the course. Data from System Information course at Fed-
have a considerable expansion in 2015 [1], showing the E-        eral University of Alagoas (UFAL) were used to build this
learning evolution and its importance as a tool for citizen’s    model, which uses a very known LMS, called Moodle.
formation. The Learning Management Systems (LMS) [15]
can be considered one of factors that has had an important       A tool to support the pre-processing phase was used in order
                                                                 to prepare data for application of Data Mining algorithms.
                                                                 The Pentaho Data Integration [2] tool covers the extraction
                                                                 areas, transformation and data load (ETL), making easier
                                                                 the archive generation in the compatible format with the
                                                                 data mining software adopted, called WEKA[5].

                                                                 Therefore, for what was exposed above, it justifies the need-
                                                                 ing of an investment to develop efficient prediction methods,
                                                                 assessment and follow up of the students with dropout risk,
allowing a future scheduling and adoption of proactive mea-        with accuracies between 75 and 80% that is hard to beat
sures aiming the decrease of the stated condition.                 with other more sophisticated models. We demonstrated
                                                                   that cost-sensitive learning does help to bias classification
The rest of the paper is organized as follows. Section 2           errors towards preferring false positives to false negatives.
presents some related work. Section 3 Environment for Con-         We believe that the authors could get better results by mak-
struction of predictive model. Afterwards, we present the          ing some adjustments to the parameters of the algorithms.
experiment settings in Section 4, and in Section 5 we dis-
cuss the results of the experiment. Section 6 presents some        Jaroslav [7], aims to research to develop a method to clas-
concluding remarks and directions of future work.                  sify students at risk of dropout throughout the course. Using
                                                                   personal data of students enriched with data related to so-
2.   RELATED WORK                                                  cial behaviours, Jaroklav uses dimensionality reduction tech-
                                                                   niques and various algorithms in order to find which of the
Several studies have been conducted in order to find out the
                                                                   best results managing to get the accuracy rates of up to
reasons of high dropout indices in online courses. Among
                                                                   93.51%, however the best rates are presented at the end of
them, Xenos [18] makes a review of the Open University stu-
                                                                   the course. Whereas the goal is to identify early on dropout,
dents enrolled in a computing course. In this studies, five ac-
                                                                   the study would be more relevant if the best results were ob-
ceptable reasons, that might have caused the dropout, were
                                                                   tained results at the beginning of the course.
identified: Professional (62,1%), Academic (46%), Family
(17,8%), Health Issues (9,5%), Personal Issues (8,9%). Ac-
                                                                   In summary, several studies investigating the application of
cording to Barroso and Falcão (2004) [6] the motivational
                                                                   EDM techniques to predict and identify students who are
conditions to the dropout are classified in three groups: i)
                                                                   at risk dropout. However, those works share similarities:
Economic - Impossibility of remaining in the course because
                                                                   (i) identify and compare algorithm performance in order to
of socio-economics issues; ii) Vocational - The student is not
                                                                   find the most relevant EDM techniques to solve the prob-
identified with the chosen course. iii) Institutional - Fail-
                                                                   lem or (ii) identify the relevant attributes associated with
ure on initial disciplines, previous shortcomings of earlier
                                                                   the problem. Some works use past time-invariant student
contents, inadequacy with the learning methods.
                                                                   records (demographic and pre-university student data). In
                                                                   this study, contribution to those presented in this section,
Manhães et al.[14] present a novel architecture that uses
                                                                   makes the junction between two different systems, gathering
EDM techniques to predict and identify those who are at
                                                                   a larger number of attributes, variables and time invariant.
dropout risk. The paper shows initial experimental results
                                                                   Besides being concerned with the identification and compar-
using real world data about of three undergraduate engi-
                                                                   ison of algorithms, identify the attributes of great relevance
neering courses of one the largest Brazilian public university.
                                                                   and solve the problem the predict in more antecedence the
According to the experiments, the classifier Naive Bayes pre-
                                                                   likely to dropout students.
sented the highest true positive rate for all datasets used in
the experiments.
                                                                   3.   ENVIRONMENT FOR CONSTRUCTION
A model for predicting students’ performance levels is pro-             OF PREDICTIVE MODEL
posed by Erkan Er [9]. Three machine learning algorithms           This subsection presents an environment for construction
were employed: instance-based learning Classifier, Decision        for a predictive model for supporting educators in the task
Tree and Naive Bayes. The overall goal of the study is to          of identifying prospective students with dropout profiles in
propose a method for accurate prediction of at-risk students       online courses. The environment is depicted in Figure 1.
in an online course. Specifically, data logs of LMS, called
METU-Online, were used to identify at-risk students and
successful students at various stages during the course. The
experiment were realized in two phases: testing and train-
ing. These phases were conducted at three steps which cor-
respond to different stages in a semester. At each step, the
number of attributes in the dataset had been increased and
all attributes were included at final stage. The important
characteristic of the dataset was that it only contained time-
varying attributes rather than time-invariant attributes such
as gender or age. According to the author, these data did
not have significant impact on overall results.

Dekker [8] in your paper presents a data mining case study
demonstrating the effectiveness of several classification tech-
niques and the cost-sensitive learning approach on the dataset
from the Electrical Engineering department of Eindhoven
University of Technology. Was compared two decision tree           Figure 1: Environment for Construction of predic-
algorithms, a Bayesian classifier, a logistic model, a rule-       tive model
based learner and the Random Forest. Was also considered
the OneR classifier as a baseline and as an indicator of the       The proposed environment in this work is composed by three
predictive power of particular attributes. The experimental        layers: Data source, Model development and Model. The
results show that rather simple classifiers give a useful result   data sources are located in the first layer. Data about all
students enrolled at the University are stored in two data         4.4 shows every step in experiment execution, including data
sources: The first one contains students’ personal data, for       consolidation, data preprocessing and algorithms execution.
example: age, gender, income, marital status and grades
from the academic control system used by the University.           4.1     Planning
Information related with frequency of access, participation,       The research question that we would like to answer is:
use of the tools available, and grades of students related
the activities proposed within the environment are kept in         RQ.Is our predictive model able to early identify the stu-
second data source.                                                dents with dropout risk?
In the second layer, the pre-processing [11] activity over the     In order to answer this question, EDM techniques with four
data is initiated. Sequential steps are executed in this layer     different classification methods were used, aiming to get a
in order to prepare them to data mining process. In the            predictive model which answers us with quality in precise
original data some information can not be properly repre-          ways which students have a dropout profile, taking in consid-
sented in a expected format by data mining algorithm, data         eration only data about the initial disciplines of a specified
redundancy or even data with some kind of noise. These             course.
problems can produce misleading results or make the algo-
rithm execution becomes computationally more expensive.
                                                                   4.2     Subject Selection
This layer is divided into the following stages: data extrac-      4.2.1    Data Selection
tion, data cleaning, data transformation, data selection and       The Federal University of Alagoas offers graduation courses,
the choice of algorithm that best fits the model. Just below,      postgraduate courses and E-learning courses. In the on line
will be displayed briefly each step of this layer.                 courses, there are more than 1800 registered students[4].

Data extraction: The extraction phase establishes the con-         An E-learning course is usually partitioned in semesters,
nection with the data source and performs the extraction of        where different disciplines are taught along these semesters.
the data.                                                          Each semester usually has five disciplines per semester, and
                                                                   each discipline has a duration between five to seven weeks.
Data cleaning: This routine tries to fill missing values, smooth   Anonymous data, from the Information Systems E-learning
out noise while identifying outliers, and correct data incon-      course, were selected from this environment, relative to first
sistencies.                                                        semester in 2013. Data of one discipline (Algorithm and
                                                                   Data Structure I), chosen based on its relevance, were anal-
Data transformation: In this step, data are transformed and        ysed. Such discipline has about 162 students enrolled.
consolidated into appropriate forms for mining by perform-
ing summary or aggregation operations. Sometimes, data             4.2.2    Machine Learning Algorithms Selection
transformation and consolidation are performed before the          In this work to predict student dropouts, four machine learn-
data selection process, particularly in the case of data ware-     ing algorithms were used, using different classification meth-
housing. Data reduction may also be performed to obtain a          ods. The methods used were: simple probabilistic classifier
smaller representation of the original data without sacrific-      based on the application of Bayes’ theorem, decision tree,
ing its integrity.                                                 support vector’s machine and multilayer neural network.

Data selection: In this step, relevant data to the analysis        These techniques have been successfully applied to solve var-
task are retrieved from the database.                              ious classification problems and function in two phases: (i)
                                                                   training and (ii) testing phase. During the training phase
Choice of algorithm: An algorithm to respond with quality          each technique is presented with a set of example data pairs
in terms of accuracy, which has students elusive profile, was      (X, Y), where X represents the input and Y the respective
considered the algorithm that best applies to the model.           output of each pair [13]. In this study, Y can receive one
                                                                   of the following values, “approved” or “reproved”, that cor-
Finally, the last layer is the presentation of the model. This     responds the student situation in discipline.
layer is able to post-processing the result obtained in the
lower layer and presenting it to the end-user of a most un-
derstandable way.
                                                                   4.3     Instrumentation
                                                                   The Pentaho Data Integration [2] tool was chosen to realize
                                                                   all preprocessing steps on selected data. Pentaho is a open-
4.   EXPERIMENT SETTINGS                                           source software, developed in Java, which covers extraction
The main objective of this present research is to build a          areas, transform and load of the data [2], making easier the
predictive model for supporting educators in the hard task         creation of an model able to : (i) extract information from
of identifying prospective students with dropout profiles in       data sources, (ii) attributes selection, (iii) data discretization
online courses, using Educational Data Mining (EDM) tech-          and (iv) file generation in a compatible format with the data
niques [16]. This section is organized as follows: Section 4.1     mining software.
describes the issue which drives our assessment. Section 4.2
shows which data were selected for to the data group uti-          For execution of selected classification algorithms (see Sec-
lized in the experiment and which algorithms were chosen for       tion 4.2.2), the data mining tool Weka was selected. Such al-
data mining execution. Section 4.3 indicates the employed          gorithms are implemented on Weka software as NaiveBayes
tools during the execution of experiment. Finally, Section         (NB), J48 (AD), SMO (SVM), MultilayerPerceptron (RN)
[17] respectively. Weka is a software of open code which con-          The student with a grade higher or equal 9, was allo-
tains a machine learning algorithms group to data’s mining             cated for “A” group. Those ones who had their grades
task [5].                                                              between 8,99 and 7 were allocated for “B” group. the
                                                                       “C” students are those that had a grade between 6,99
Some features were taken in consideration for Weka [10]                and 5, and those who had grades under 5,99 stayed at
adoption, such as: ease of acquisition, facility and availabil-        “D” group and finally those that doesn’t have a grade
ity to directly download from the developer page with no               associated were allocated in ”E” group.
operation cost; Attendance of several algorithms versions
set in data mining and availability of statistical resources to      • Every student was labelled as approved or reproved
compare results among algorithms.                                      based on the situation informed by the academics reg-
                                                                       isters. The final score of each discipline is composed by
                                                                       two tests, if the student did not succeed in obtaining
4.4     Operation                                                      the minimum average, he will be leaded to the final
The evaluation of experiment was executed on HP Probook                reassessment and final test.
2.6 GHz Core-I5 with 8Gb of memory, running Windows
8.1.                                                                 • In the “City” attribute, some inconsistencies were found,
                                                                       where different data about the same city were regis-
                                                                       tered in database. For instance, the instances of Ouro
4.4.1    Data’s Preprocessing                                          Branco and Ouro Branco/AL are related to same
Real-world data tend to be dirty, incomplete, and inconsis-
                                                                       city. This problem was totally solved, with application
tent. Data preprocessing techniques can improve data qual-
                                                                       of techniques for grouping attributes.
ity, thereby helping to improve the accuracy and efficiency
of the subsequent mining process [11].                               • The attribute “age” had to be calculated. For this, the
                                                                       student’s birth date, registered in database, was taken
Currently, the data is spread in two main data sources:                in consideration.
LMS Moodle, utilized by the University as assistance on E-
learning teaching, including data which show the access fre-
quency, student’s participation using the available tools, as     When all the attributes were used the accuracy was low.
well as the student’s success level related to proposed activi-   That is why we utilized feature selection methods to re-
ties. Meanwhile, student’s personal files as age, sex, marital    duce the dimensionality of the student data extracted from
status, salary and disciplines grades are kept in the Aca-        dataset. We improved the pre-processing method the data.
demic Control System (ACS), which is a Software designed
to keep the academic control of the whole University [4].         In order to preserve reliability of attributes for classification
                                                                  after the reduction. We use InfoGainAttributeEval algo-
Aiming to reunite a major data group and work only with           rithm that builds a rank of the best attributes considering
relevant data to the research question that we want to an-        the extent of information gain based on the concept of en-
swer, we decided to perform consolidation of these two data       tropy.
source in a unique major data source, keeping their integrity
and ensuring that only relevant information will be used dur-     After this procedure, we reduced the set of attributes from
ing data mining algorithms execution.                             17 to 13 most relevant. The list of the refined set of at-
                                                                  tributes in relevance ordercan be found in Table 1.
Careful integration can help reduce and avoid redundancies
and inconsistencies in the resulting data set. This can help
improve the accuracy and speed of the data mining pro-                          Table 1: Selected Attributes
cess [11].                                                                Attributes             Description
                                                                             AB1       First Evaluation Grade
To maintain the integrity and reliability between data, a                    Blog      Post count and blog view
mandatory attribute, with unique value and present between                  Forum      Post count and forum views
in both data sources, was chosen. Thus, the CPF attribute                   Access     Access Count in LMS
was chosen to make data unification between the two se-                     Assign     Sent files count e viewed
lected data sources, once it permits the unique identification               City      City
among selected students.                                                   Message     Count of sent messages
                                                                             Wiki      Post count and wiki view
In order to facilitate algorithms execution and comprehen-
                                                                           Glossary    Post count and glossary view
sion of results,predicting the dropout in an early stage of
                                                                          Civil status Civil status
the study. In order to achieve a high rate of accuracy and
                                                                            Gender     Gender
minimum of false negatives, i.e. students that have not been
recognized to be in danger of dropout. Some attributes were                 Salary     Salary
transformed, as we can seen below:                                          Status     Status on discipline

                                                                  Taking in consideration that the main objective is to predict
   • The corresponding attributes related with discipline         student’s final situation with the earlier advance as possible
     grades were discretized in a five-group-value (A,B,C,D       inside the given discipline, to this study we will only use
     e E), depending on the discipline’s achieved grades.         data until the moment of the first test.
The Figure 2 presents all the executed stages, during the
preprocessing phase, in order to generate a compatible file                     Table 2: Accuracy and rates
with the mining software.                                                  Classifiers    NB    AD   SVM    RN
                                                                            Accuracy     85.50 86.46 92.03 90.86
                                                                         True Positives   0.76  0.77  0.88  0.85
 4.4.2   Algorithms Execution
                                                                         False Negatives 0.24   0.23  0.12  0.15
The k-fold method was applied to make a assessment the
model generalization capacity, with k=10 (10-fold cross val-             True Negatives   0.89  0.91  0.94  0.93
idation). The cross validation method, consists in splitting             False Positives  0.11  0.09  0.06  0.07
of the model in k subgroups mutually exclusive and with the
same size, from these subgroups, one subgroup is selected for      positives is not suitable to our solution. In this case, we have
test and the remaining k-1’s are utilized for training. The        considered the algorithm which has the lower false positive
average error rate of each training subgroup can be used as        rates.
an estimate of the classifier’s error rate. When Weka imple-
ments the cross validation, it trains the classifier k times to    As we can see on table 2 the algorithm SVM presented a low
calculate the average error rate and finally, leads the build      false positive rate and better accuracy. Therefore, only the
classifier back utilizing the model as a training group. Thus,     best algorithm was considered to our solution. The Naive
the average error rate provides a better solution in terms of      Bayes classifier had the worst result in terms of accuracy
classifier’s error accuracy reliability [12].                      and a high false positive rate. The other ones had an error
                                                                   average of 8%, and then, we end up with 8% of the students
In order to get the best results of the algorithms without         with dropout risk not so correctly classified.
losing generalization, some parameters of SVM algorithms
were adjusted.
                                                                   5.1    Research Question
The first parameter was set the parameter “C”. This pa-            As can be seen in table 2, in our experiment, the SVM al-
rameter is for the soft margin cost function, which controls       gorithm obtained 92% of accuracy. According to Han J. et
the influence of each individual support vector; this process      al. [11] if the accuracy of the classifier is considered accept-
involves trading error penalty for stability [3].                  able, the classifier can be used to classify future data tuples
                                                                   for which the class label is not known. Thus, the results
The default kernel used by Weka tool is the polynomial we          are pointing to the viability of model able to early identify
changed to the Gaussian setting the parameters Gamma.              a possible student’s dropout, based on their failures in the
Gamma is the free parameter of the Gaussian radial basis           initial disciplines.
function [3].
                                                                   5.2    Statistical Significance Comparision
After several adjustments to the values of the two parame-         We often need compare different learning schemes on the
ters mentioned above, which showed the best results in term        same problem to see which is the better one to use. This
of accuracy and lower false positive rate, was C = 9.0 and         is a job for a statistical device known as the t-test, or Stu-
Gamma = 0.06 parameter.                                            dent’s t-test. A more sensitive version of the t-test known
                                                                   as a paired t-test it was used. [17]. Using this value and de-
For comparison of results related to selected algorithms, we       sired significance level (5%), consequently one can say that
used Weka Experiment Environment (WEE). The WEE al-                these classifiers with a certain degree of confidence (100 -
lows the selection of one or more algorithms available in the      significance level) are significantly different or not. By using
tool as well as analyse the results, in order to identify, if a    the t-test paired in the four algorithms, performed via Weka
classifier is, statistically, better than the other. In this ex-   analysis tool, observed that the SVM algorithm is signifi-
periment, the cross validation method, with the parameter          cantly respectful of others.
“k=10” [5], is used in order to calculate the difference on
the results in each one of the algorithms related to a chosen      5.3    Threats to validity
standard algorithm (baseline).                                     The experiment has taken in consideration data from the In-
                                                                   formation System course and the Data Structure Algorithm
5.   RESULTS AND DISCUSSIONS                                       discipline. However, the aforementioned discipline was cho-
In this section, the results of the experiment, described in       sen, based on its importance in the context of Information
Section 4, are analyzed.                                           System course.

The WEE tool calculated the average accuracy of each clas-         6.    CONCLUSION AND FUTURE WORK
sifier. Table 2 shows the result of each algorithms execu-         Understand the reasons behind the dropout in E-learning
tion. The accuracy represents the percentage of the test           education and identify in which aspects can be improved is
group instance which are correctly classified by the model         a challenge to the E-learning. One factor, which has been
built during training phases. If the built model has a high        pointed as influencer of students’ dropout, is the academic
accuracy, the classifier is treated as efficient and can be put    element related with their performance at the initial disci-
into production [11].                                              plines of the course.

Comparing the results among the four algorithms, we can            This research has addressed dropout problem by proposing
verify that the accuracy oscillates around 85.5 to 92.03%.         predictive models to provide educational managers with the
Furthermore, a classifier which has a high error rate to false     duty to identify students whom are in the dropout bound.
                                          Figure 2: Steps Data Preprocessing

The adopted approach allowed us to perform predictions               International Journal of Machine Learning and
at an initial discipline phase. The preliminaries results has        Computing, pages 476–481, Singapore, 2012. IACSIT
shown that prediction model to identify students with dropout        Press.
profiles is feasible. These predictions can be very useful to   [10] M. Hall, E. Frank, G. Holmes, B. Pfahringer,
educators, supporting them in developing special activities          P. Reutemann, and I. H. Witten. The weka data
for these potential students, during the teaching-learning           mining software: An update. SIGKDD Explor. Newsl.,
process.                                                             11(1):10–18, Nov. 2009.
                                                                [11] J. Han, M. Kamber, and J. Pei. Data Mining:
As an immediate future work, some outstanding points still           Concepts and Techniques. Morgan Kaufmann
should be regarded to the study’s improvement, as apply the          Publishers Inc., San Francisco, CA, USA, 3rd edition,
same model in different institution databases with different         2011.
teaching methods and courses, including new factors related     [12] S. B. Kotsiantis, C. Pierrakeas, and P. E. Pintelas.
to dropout as: professional, vocational and family data, ex-         Preventing student dropout in distance learning using
ecute some settings in algorithms’ parameters in order to            machine learning techniques. In V. Palade, R. J.
have the best achievements. Furthermore, a integrated soft-          Howlett, and L. C. Jain, editors, KES, volume 2774 of
ware to LMS, to provide this feedback to educators, will be          Lecture Notes in Computer Science, pages 267–274.
developed using this built model.                                    Springer, 2003.
                                                                [13] I. Lykourentzou, I. Giannoukos, V. Nikolopoulos,
