A predictive model for identifying students with dropout profiles in online courses Marcelo A. Santana Evandro B. Costa Baldoino F. S. Neto Institute of Computing Institute of Computing Institute of Computing Federal University of Alagoas Federal University of Alagoas Federal University of Alagoas marcelo.almeida@nti.ufal.br evandro@ic.ufal.br baldoino@ic.ufal.br Italo C. L. Silva Joilson B. A. Rego Institute of Computing Institute of Computing Federal University of Alagoas Federal University of Alagoas italocarlo@nti.ufal.br jotarego@gmail.com ABSTRACT role for popularization of this learning modality [1]. Online education often deals with the problem related to the high students’ dropout rate during a course in many areas. Despite the rapid growth of online courses, there has also There is huge amount of historical data about students in on- been rising concern over a number of problems. One issue line courses. Hence, a relevant problem on this context is to in particular that is difficult to ignore is that these online examine those data, aiming at finding effective mechanisms courses also have high dropout rates. Specifically, in Brazil, to understand student profiles, identifying those students in 2013, according with the latest Censo, published by the E- with characteristics to drop out at early stage in the course. learning Brazilian Association (ABED), the dropout average In this paper, we address this problem by proposing predic- was about 19,06% [1]. tive models to provide educational managers with the duty to identify students whom are in the dropout bound. Four Beyond the hard task on identifying the students who can classification algorithms with different classification meth- have possible risk of dropping out, the same dropout also ods were used during the evaluation, in order to find the brings a huge damage to current financial and social re- model with the highest accuracy in prediction the profile sources. Thus, the society also loses when they are poorly of dropouts students. Data for model generation were ob- managed, once the student fills the vacancy but he gives up tained from two data sources available from University. The the course before the end. results showed the model generated by using SVM algorithm as the most accurate among those selected, with 92.03% of Online education often deals with the problem related to the accuracy. high students’ dropout rate during a course in many areas. There is huge amount of historical data about students in on- line courses. Hence, a relevant problem on this context is to Keywords examine those data, aiming at finding effective mechanisms Dropout, Distance Learning, Educational Data Mining, Learn- to understand student profiles, identifying those students ing Management Systems with characteristics to drop out at early stage in the course. 1. INTRODUCTION In this paper, we address this problem by proposing predic- Every year, the registration marks in E-learning modality tive models to provide educational managers with the duty has increased considerably, in 2013, 15.733 courses were of- of identifying students who are in the dropout bound. This fered, in E-learning or semi-presence modality. Further- predictive model took in consideration academic elements more, the institutions are very optimistic, 82% of researched related with their performance at the initial disciplines of places, believe that the amount of registration marks will the course. Data from System Information course at Fed- have a considerable expansion in 2015 [1], showing the E- eral University of Alagoas (UFAL) were used to build this learning evolution and its importance as a tool for citizen’s model, which uses a very known LMS, called Moodle. formation. The Learning Management Systems (LMS) [15] can be considered one of factors that has had an important A tool to support the pre-processing phase was used in order to prepare data for application of Data Mining algorithms. The Pentaho Data Integration [2] tool covers the extraction areas, transformation and data load (ETL), making easier the archive generation in the compatible format with the data mining software adopted, called WEKA[5]. Therefore, for what was exposed above, it justifies the need- ing of an investment to develop efficient prediction methods, assessment and follow up of the students with dropout risk, allowing a future scheduling and adoption of proactive mea- with accuracies between 75 and 80% that is hard to beat sures aiming the decrease of the stated condition. with other more sophisticated models. We demonstrated that cost-sensitive learning does help to bias classification The rest of the paper is organized as follows. Section 2 errors towards preferring false positives to false negatives. presents some related work. Section 3 Environment for Con- We believe that the authors could get better results by mak- struction of predictive model. Afterwards, we present the ing some adjustments to the parameters of the algorithms. experiment settings in Section 4, and in Section 5 we dis- cuss the results of the experiment. Section 6 presents some Jaroslav [7], aims to research to develop a method to clas- concluding remarks and directions of future work. sify students at risk of dropout throughout the course. Using personal data of students enriched with data related to so- 2. RELATED WORK cial behaviours, Jaroklav uses dimensionality reduction tech- niques and various algorithms in order to find which of the Several studies have been conducted in order to find out the best results managing to get the accuracy rates of up to reasons of high dropout indices in online courses. Among 93.51%, however the best rates are presented at the end of them, Xenos [18] makes a review of the Open University stu- the course. Whereas the goal is to identify early on dropout, dents enrolled in a computing course. In this studies, five ac- the study would be more relevant if the best results were ob- ceptable reasons, that might have caused the dropout, were tained results at the beginning of the course. identified: Professional (62,1%), Academic (46%), Family (17,8%), Health Issues (9,5%), Personal Issues (8,9%). Ac- In summary, several studies investigating the application of cording to Barroso and Falcão (2004) [6] the motivational EDM techniques to predict and identify students who are conditions to the dropout are classified in three groups: i) at risk dropout. However, those works share similarities: Economic - Impossibility of remaining in the course because (i) identify and compare algorithm performance in order to of socio-economics issues; ii) Vocational - The student is not find the most relevant EDM techniques to solve the prob- identified with the chosen course. iii) Institutional - Fail- lem or (ii) identify the relevant attributes associated with ure on initial disciplines, previous shortcomings of earlier the problem. Some works use past time-invariant student contents, inadequacy with the learning methods. records (demographic and pre-university student data). In this study, contribution to those presented in this section, Manhães et al.[14] present a novel architecture that uses makes the junction between two different systems, gathering EDM techniques to predict and identify those who are at a larger number of attributes, variables and time invariant. dropout risk. The paper shows initial experimental results Besides being concerned with the identification and compar- using real world data about of three undergraduate engi- ison of algorithms, identify the attributes of great relevance neering courses of one the largest Brazilian public university. and solve the problem the predict in more antecedence the According to the experiments, the classifier Naive Bayes pre- likely to dropout students. sented the highest true positive rate for all datasets used in the experiments. 3. ENVIRONMENT FOR CONSTRUCTION A model for predicting students’ performance levels is pro- OF PREDICTIVE MODEL posed by Erkan Er [9]. Three machine learning algorithms This subsection presents an environment for construction were employed: instance-based learning Classifier, Decision for a predictive model for supporting educators in the task Tree and Naive Bayes. The overall goal of the study is to of identifying prospective students with dropout profiles in propose a method for accurate prediction of at-risk students online courses. The environment is depicted in Figure 1. in an online course. Specifically, data logs of LMS, called METU-Online, were used to identify at-risk students and successful students at various stages during the course. The experiment were realized in two phases: testing and train- ing. These phases were conducted at three steps which cor- respond to different stages in a semester. At each step, the number of attributes in the dataset had been increased and all attributes were included at final stage. The important characteristic of the dataset was that it only contained time- varying attributes rather than time-invariant attributes such as gender or age. According to the author, these data did not have significant impact on overall results. Dekker [8] in your paper presents a data mining case study demonstrating the effectiveness of several classification tech- niques and the cost-sensitive learning approach on the dataset from the Electrical Engineering department of Eindhoven University of Technology. Was compared two decision tree Figure 1: Environment for Construction of predic- algorithms, a Bayesian classifier, a logistic model, a rule- tive model based learner and the Random Forest. Was also considered the OneR classifier as a baseline and as an indicator of the The proposed environment in this work is composed by three predictive power of particular attributes. The experimental layers: Data source, Model development and Model. The results show that rather simple classifiers give a useful result data sources are located in the first layer. Data about all students enrolled at the University are stored in two data 4.4 shows every step in experiment execution, including data sources: The first one contains students’ personal data, for consolidation, data preprocessing and algorithms execution. example: age, gender, income, marital status and grades from the academic control system used by the University. 4.1 Planning Information related with frequency of access, participation, The research question that we would like to answer is: use of the tools available, and grades of students related the activities proposed within the environment are kept in RQ.Is our predictive model able to early identify the stu- second data source. dents with dropout risk? In the second layer, the pre-processing [11] activity over the In order to answer this question, EDM techniques with four data is initiated. Sequential steps are executed in this layer different classification methods were used, aiming to get a in order to prepare them to data mining process. In the predictive model which answers us with quality in precise original data some information can not be properly repre- ways which students have a dropout profile, taking in consid- sented in a expected format by data mining algorithm, data eration only data about the initial disciplines of a specified redundancy or even data with some kind of noise. These course. problems can produce misleading results or make the algo- rithm execution becomes computationally more expensive. 4.2 Subject Selection This layer is divided into the following stages: data extrac- 4.2.1 Data Selection tion, data cleaning, data transformation, data selection and The Federal University of Alagoas offers graduation courses, the choice of algorithm that best fits the model. Just below, postgraduate courses and E-learning courses. In the on line will be displayed briefly each step of this layer. courses, there are more than 1800 registered students[4]. Data extraction: The extraction phase establishes the con- An E-learning course is usually partitioned in semesters, nection with the data source and performs the extraction of where different disciplines are taught along these semesters. the data. Each semester usually has five disciplines per semester, and each discipline has a duration between five to seven weeks. Data cleaning: This routine tries to fill missing values, smooth Anonymous data, from the Information Systems E-learning out noise while identifying outliers, and correct data incon- course, were selected from this environment, relative to first sistencies. semester in 2013. Data of one discipline (Algorithm and Data Structure I), chosen based on its relevance, were anal- Data transformation: In this step, data are transformed and ysed. Such discipline has about 162 students enrolled. consolidated into appropriate forms for mining by perform- ing summary or aggregation operations. Sometimes, data 4.2.2 Machine Learning Algorithms Selection transformation and consolidation are performed before the In this work to predict student dropouts, four machine learn- data selection process, particularly in the case of data ware- ing algorithms were used, using different classification meth- housing. Data reduction may also be performed to obtain a ods. The methods used were: simple probabilistic classifier smaller representation of the original data without sacrific- based on the application of Bayes’ theorem, decision tree, ing its integrity. support vector’s machine and multilayer neural network. Data selection: In this step, relevant data to the analysis These techniques have been successfully applied to solve var- task are retrieved from the database. ious classification problems and function in two phases: (i) training and (ii) testing phase. During the training phase Choice of algorithm: An algorithm to respond with quality each technique is presented with a set of example data pairs in terms of accuracy, which has students elusive profile, was (X, Y), where X represents the input and Y the respective considered the algorithm that best applies to the model. output of each pair [13]. In this study, Y can receive one of the following values, “approved” or “reproved”, that cor- Finally, the last layer is the presentation of the model. This responds the student situation in discipline. layer is able to post-processing the result obtained in the lower layer and presenting it to the end-user of a most un- derstandable way. 4.3 Instrumentation The Pentaho Data Integration [2] tool was chosen to realize all preprocessing steps on selected data. Pentaho is a open- 4. EXPERIMENT SETTINGS source software, developed in Java, which covers extraction The main objective of this present research is to build a areas, transform and load of the data [2], making easier the predictive model for supporting educators in the hard task creation of an model able to : (i) extract information from of identifying prospective students with dropout profiles in data sources, (ii) attributes selection, (iii) data discretization online courses, using Educational Data Mining (EDM) tech- and (iv) file generation in a compatible format with the data niques [16]. This section is organized as follows: Section 4.1 mining software. describes the issue which drives our assessment. Section 4.2 shows which data were selected for to the data group uti- For execution of selected classification algorithms (see Sec- lized in the experiment and which algorithms were chosen for tion 4.2.2), the data mining tool Weka was selected. Such al- data mining execution. Section 4.3 indicates the employed gorithms are implemented on Weka software as NaiveBayes tools during the execution of experiment. Finally, Section (NB), J48 (AD), SMO (SVM), MultilayerPerceptron (RN) [17] respectively. Weka is a software of open code which con- The student with a grade higher or equal 9, was allo- tains a machine learning algorithms group to data’s mining cated for “A” group. Those ones who had their grades task [5]. between 8,99 and 7 were allocated for “B” group. the “C” students are those that had a grade between 6,99 Some features were taken in consideration for Weka [10] and 5, and those who had grades under 5,99 stayed at adoption, such as: ease of acquisition, facility and availabil- “D” group and finally those that doesn’t have a grade ity to directly download from the developer page with no associated were allocated in ”E” group. operation cost; Attendance of several algorithms versions set in data mining and availability of statistical resources to • Every student was labelled as approved or reproved compare results among algorithms. based on the situation informed by the academics reg- isters. The final score of each discipline is composed by two tests, if the student did not succeed in obtaining 4.4 Operation the minimum average, he will be leaded to the final The evaluation of experiment was executed on HP Probook reassessment and final test. 2.6 GHz Core-I5 with 8Gb of memory, running Windows 8.1. • In the “City” attribute, some inconsistencies were found, where different data about the same city were regis- tered in database. For instance, the instances of Ouro 4.4.1 Data’s Preprocessing Branco and Ouro Branco/AL are related to same Real-world data tend to be dirty, incomplete, and inconsis- city. This problem was totally solved, with application tent. Data preprocessing techniques can improve data qual- of techniques for grouping attributes. ity, thereby helping to improve the accuracy and efficiency of the subsequent mining process [11]. • The attribute “age” had to be calculated. For this, the student’s birth date, registered in database, was taken Currently, the data is spread in two main data sources: in consideration. LMS Moodle, utilized by the University as assistance on E- learning teaching, including data which show the access fre- quency, student’s participation using the available tools, as When all the attributes were used the accuracy was low. well as the student’s success level related to proposed activi- That is why we utilized feature selection methods to re- ties. Meanwhile, student’s personal files as age, sex, marital duce the dimensionality of the student data extracted from status, salary and disciplines grades are kept in the Aca- dataset. We improved the pre-processing method the data. demic Control System (ACS), which is a Software designed to keep the academic control of the whole University [4]. In order to preserve reliability of attributes for classification after the reduction. We use InfoGainAttributeEval algo- Aiming to reunite a major data group and work only with rithm that builds a rank of the best attributes considering relevant data to the research question that we want to an- the extent of information gain based on the concept of en- swer, we decided to perform consolidation of these two data tropy. source in a unique major data source, keeping their integrity and ensuring that only relevant information will be used dur- After this procedure, we reduced the set of attributes from ing data mining algorithms execution. 17 to 13 most relevant. The list of the refined set of at- tributes in relevance ordercan be found in Table 1. Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help improve the accuracy and speed of the data mining pro- Table 1: Selected Attributes cess [11]. Attributes Description AB1 First Evaluation Grade To maintain the integrity and reliability between data, a Blog Post count and blog view mandatory attribute, with unique value and present between Forum Post count and forum views in both data sources, was chosen. Thus, the CPF attribute Access Access Count in LMS was chosen to make data unification between the two se- Assign Sent files count e viewed lected data sources, once it permits the unique identification City City among selected students. Message Count of sent messages Wiki Post count and wiki view In order to facilitate algorithms execution and comprehen- Glossary Post count and glossary view sion of results,predicting the dropout in an early stage of Civil status Civil status the study. In order to achieve a high rate of accuracy and Gender Gender minimum of false negatives, i.e. students that have not been recognized to be in danger of dropout. Some attributes were Salary Salary transformed, as we can seen below: Status Status on discipline Taking in consideration that the main objective is to predict • The corresponding attributes related with discipline student’s final situation with the earlier advance as possible grades were discretized in a five-group-value (A,B,C,D inside the given discipline, to this study we will only use e E), depending on the discipline’s achieved grades. data until the moment of the first test. The Figure 2 presents all the executed stages, during the preprocessing phase, in order to generate a compatible file Table 2: Accuracy and rates with the mining software. Classifiers NB AD SVM RN Accuracy 85.50 86.46 92.03 90.86 True Positives 0.76 0.77 0.88 0.85 4.4.2 Algorithms Execution False Negatives 0.24 0.23 0.12 0.15 The k-fold method was applied to make a assessment the model generalization capacity, with k=10 (10-fold cross val- True Negatives 0.89 0.91 0.94 0.93 idation). The cross validation method, consists in splitting False Positives 0.11 0.09 0.06 0.07 of the model in k subgroups mutually exclusive and with the same size, from these subgroups, one subgroup is selected for positives is not suitable to our solution. In this case, we have test and the remaining k-1’s are utilized for training. The considered the algorithm which has the lower false positive average error rate of each training subgroup can be used as rates. an estimate of the classifier’s error rate. When Weka imple- ments the cross validation, it trains the classifier k times to As we can see on table 2 the algorithm SVM presented a low calculate the average error rate and finally, leads the build false positive rate and better accuracy. Therefore, only the classifier back utilizing the model as a training group. Thus, best algorithm was considered to our solution. The Naive the average error rate provides a better solution in terms of Bayes classifier had the worst result in terms of accuracy classifier’s error accuracy reliability [12]. and a high false positive rate. The other ones had an error average of 8%, and then, we end up with 8% of the students In order to get the best results of the algorithms without with dropout risk not so correctly classified. losing generalization, some parameters of SVM algorithms were adjusted. 5.1 Research Question The first parameter was set the parameter “C”. This pa- As can be seen in table 2, in our experiment, the SVM al- rameter is for the soft margin cost function, which controls gorithm obtained 92% of accuracy. According to Han J. et the influence of each individual support vector; this process al. [11] if the accuracy of the classifier is considered accept- involves trading error penalty for stability [3]. able, the classifier can be used to classify future data tuples for which the class label is not known. Thus, the results The default kernel used by Weka tool is the polynomial we are pointing to the viability of model able to early identify changed to the Gaussian setting the parameters Gamma. a possible student’s dropout, based on their failures in the Gamma is the free parameter of the Gaussian radial basis initial disciplines. function [3]. 5.2 Statistical Significance Comparision After several adjustments to the values of the two parame- We often need compare different learning schemes on the ters mentioned above, which showed the best results in term same problem to see which is the better one to use. This of accuracy and lower false positive rate, was C = 9.0 and is a job for a statistical device known as the t-test, or Stu- Gamma = 0.06 parameter. dent’s t-test. A more sensitive version of the t-test known as a paired t-test it was used. [17]. Using this value and de- For comparison of results related to selected algorithms, we sired significance level (5%), consequently one can say that used Weka Experiment Environment (WEE). The WEE al- these classifiers with a certain degree of confidence (100 - lows the selection of one or more algorithms available in the significance level) are significantly different or not. By using tool as well as analyse the results, in order to identify, if a the t-test paired in the four algorithms, performed via Weka classifier is, statistically, better than the other. In this ex- analysis tool, observed that the SVM algorithm is signifi- periment, the cross validation method, with the parameter cantly respectful of others. “k=10” [5], is used in order to calculate the difference on the results in each one of the algorithms related to a chosen 5.3 Threats to validity standard algorithm (baseline). The experiment has taken in consideration data from the In- formation System course and the Data Structure Algorithm 5. RESULTS AND DISCUSSIONS discipline. However, the aforementioned discipline was cho- In this section, the results of the experiment, described in sen, based on its importance in the context of Information Section 4, are analyzed. System course. The WEE tool calculated the average accuracy of each clas- 6. CONCLUSION AND FUTURE WORK sifier. Table 2 shows the result of each algorithms execu- Understand the reasons behind the dropout in E-learning tion. The accuracy represents the percentage of the test education and identify in which aspects can be improved is group instance which are correctly classified by the model a challenge to the E-learning. One factor, which has been built during training phases. If the built model has a high pointed as influencer of students’ dropout, is the academic accuracy, the classifier is treated as efficient and can be put element related with their performance at the initial disci- into production [11]. plines of the course. Comparing the results among the four algorithms, we can This research has addressed dropout problem by proposing verify that the accuracy oscillates around 85.5 to 92.03%. predictive models to provide educational managers with the Furthermore, a classifier which has a high error rate to false duty to identify students whom are in the dropout bound. Figure 2: Steps Data Preprocessing The adopted approach allowed us to perform predictions International Journal of Machine Learning and at an initial discipline phase. The preliminaries results has Computing, pages 476–481, Singapore, 2012. IACSIT shown that prediction model to identify students with dropout Press. profiles is feasible. These predictions can be very useful to [10] M. Hall, E. Frank, G. Holmes, B. Pfahringer, educators, supporting them in developing special activities P. Reutemann, and I. H. Witten. The weka data for these potential students, during the teaching-learning mining software: An update. SIGKDD Explor. Newsl., process. 11(1):10–18, Nov. 2009. [11] J. Han, M. Kamber, and J. Pei. Data Mining: As an immediate future work, some outstanding points still Concepts and Techniques. Morgan Kaufmann should be regarded to the study’s improvement, as apply the Publishers Inc., San Francisco, CA, USA, 3rd edition, same model in different institution databases with different 2011. teaching methods and courses, including new factors related [12] S. B. Kotsiantis, C. Pierrakeas, and P. E. Pintelas. to dropout as: professional, vocational and family data, ex- Preventing student dropout in distance learning using ecute some settings in algorithms’ parameters in order to machine learning techniques. In V. Palade, R. J. have the best achievements. Furthermore, a integrated soft- Howlett, and L. C. Jain, editors, KES, volume 2774 of ware to LMS, to provide this feedback to educators, will be Lecture Notes in Computer Science, pages 267–274. developed using this built model. Springer, 2003. [13] I. Lykourentzou, I. Giannoukos, V. Nikolopoulos, 7. REFERENCES G. Mpardis, and V. Loumos. Dropout prediction in [1] Abed - E-learning Brazilian Association. e-learning courses through the combination of machine http://www.abed.org.br/. Accessed December 2014. learning techniques. Comput. Educ., 53(3):950–965, [2] Pentaho - Pentaho Data Integration. Nov. 2009. http://www.pentaho.com/. Accessed January 2015. [14] L. M. B. Manhães, S. M. S. da Cruz, and G. Zimbrão. [3] SVM - support vector machines (svms). Wave: An architecture for predicting dropout in http://www.svms.org/parameters/. Accessed undergraduate courses using edm. In Proceedings of December 2014. the 29th Annual ACM Symposium on Applied [4] UFAL - Federal University of Alagoas. Computing, SAC ’14, pages 243–247, New York, NY, http://www.ufal.edu.br/. Accessed January 2015. USA, 2014. ACM. [5] Weka - the University of Waikato. [15] M. Pretorius and J. van Biljon. Learning management http://www.cs.waikato.ac.nz/ml/weka/. Accessed systems: Ict skills, usability and learnability. January 2015. Interactive Technology and Smart Education, [6] M. F. Barroso and E. B. Falcao. University dropout: 7(1):30–43, 2010. the case of ufrj physics institute. IX National Meeting [16] C. Romero and S. Ventura. Educational data mining: of Research in Physics Teaching, 2004. A review of the state of the art. Systems, Man, and [7] J. Bayer, H. Bydzovská, J. Géryk, T. Obsı́vac, and Cybernetics, Part C: Applications and Reviews, IEEE L. Popelı́nsky. Predicting drop-out from social Transactions on, 40(6):601–618, Nov 2010. behaviour of students. In A. H. M. Y. Kalina Yacef, [17] I. H. Witten, E. Frank, and M. A. Hall. Data Mining: Osmar Zaiane and J. Stamper, editors, Proceedings of Practical Machine Learning Tools and Techniques. the 5th International Conference on Educational Data Morgan Kaufmann Publishers Inc., San Francisco, Mining - EDM 2012, pages 103–109, Greece, 2012. CA, USA, 3rd edition, 2011. [8] G. Dekker, M. Pechenizkiy, and J. Vleeshouwers. [18] M. Xenos, C. Pierrakeas, and P. Pintelas. A survey on Predicting students drop out: A case study. In student dropout rates and dropout causes concerning T. Barnes, M. C. Desmarais, C. Romero, and the students in the course of informatics of the S. Ventura, editors, EDM, pages 41–50, 2009. Hellenic Open University. Computers Education, [9] E. Er. Identifying at-risk students using machine 39(4):361 – 377, 2002. learning techniques: A case study with is 100. In