An early warning dropout model in higher
     education degree programs: A case study in
                      Ecuador

  Vanessa Heredia-Jimenez1 , Alberto Jimenez1 , Margarita Ortiz-Rojas1 , Jon
 Imaz Marı́n2 , Pedro Manuel Moreno-Marcos2 , Pedro J. Muñoz-Merino2 , and
                           Carlos Delgado Kloos2
1
    Escuela Superior Politecnica del Litoral, ESPOL, Information Technology Center,
                                  Guayaquil, Ecuador
    {estefania.heredia,alberto.jimenez,margarita.ortiz}@cti.espol.edu.ec
                2
                   Universidad Carlos III de Madrid, Madrid, España
              jimaz@pa.uc3m.es,{pemoreno,pedmume,cdk}@it.uc3m.es


        Abstract. Worldwide, a significant concern of universities is to reduce
        academic dropout rate. Several initiatives have been made to avoid this
        problem; however, it is essential to recognize at-risk students as soon as
        possible. In this paper, we propose a new predictive model that can iden-
        tify the earliest moment of dropping out of a student of any semester in
        any undergraduate course. Unlike most available models, our solution is
        based on academic information alone, and our evidence suggests that by
        ignoring socio-demographics or pre-college entry information, we obtain
        more reliable predictions, even when a student has only one academic
        semester finished. Therefore, our prediction can be used as part of an
        academic counseling tool providing the performance factors that could
        influence a student to leave the institution. With this, the counselors
        can identify those students and take better decisions to guide them and
        finally, minimize the dropout in the institution. As a case study, we used
        the students’ data of all undergraduate programs from 2000 until 2019
        from a public high education university in Ecuador.

        Keywords: Data mining · Dropout prediction · Early detection · Algo-
        rithm · Learning analytics · Higher education


1     Introduction
Worldwide, one main concern in all Higher Education Institutions (HEI) is stu-
dent dropout [12, 19]. For instance, according to the Organization for Economic
Cooperation and Development (OECD), the dropout rate is about 24% [18].
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2       V. Heredia-Jimenez et al.

In the specific case of Latin America, the World Bank reports a similar rate
than the OECD with a 22% dropout rate [14]. Some proposals to solve this
situation have been, among others, offering students an introductory orienta-
tion phase [25], offering financial aids or scholarships [11], and use ICT in their
University Tutorials Systems [8]. Nonetheless, the overarching strategies and the
funding programs on academic success do not sufficiently reduce the academic
dropout. It is vital to recognize at-risk students before dropping out.
    The available data about students’ academic performance, creates an oppor-
tunity to identify what factors affect student dropout and take preventive mea-
sures to solve this problem. This is the reason why studies have been done using
prediction algorithms to solve this situation [6,21]. In the case of Latin America,
there are not many studies in this area. However, when found, these are confined
to: a) identify attrition through descriptive statistics [17], b) study only the first
years of student study [5], c) focus on a specific undergraduate program [15] or a
courses [13] and, d) use a different type of variables such as academic and socio-
demographic data, which results in over-fitting or biased results due to the large
number of variables [23]. Moreover, the studies do not specify the input char-
acteristics, causing that the models can not be replicated. This study addresses
the problems above by proposing a predictive module considering the academic
history of students throughout their undergraduate programs with a strong em-
phasis on analyzing which input variables have the most substantial influence on
predicting dropout. A case study is presented in a Latin American institution
of higher education where the model was tested, scaling it to all undergraduate
programs. This model will help as input for the Early Warning Systems (EWS)
dashboards to be developed in the Learning Analytics area.
    The rest of the paper is organized as follows. Section 2 provides a brief
overview of the relevant literature. Section 3 describes the data used in our
predictive model and summarises our work. Section 4 shows the experimental
results from the study. The discussion about our work is presented in Section
5. Finally, Section 6 draws conclusions from this analysis, and suggests possible
directions for future work in this area.


2    Related Work

When it comes to input variables to predict dropout, there are different models
and input features found in the literature, some focused on analyzing a single
type of data, while others mix different groups of data.
    Regarding a single type of data, for example, in studio [22], the authors
used psychosocial variables with a statistics model on 690 students, and the
results ranged from 11.8% to 22% of the variance. The studio of [24], focused on
analyzing academic data such as the first partial grade and the percentage of non-
attendance accumulated of 171 students. Unfortunately, the prediction rate is no
mentioned. The study of [16] combined intelligence, personality, and motivational
predictors in 137 students and the results showed 33% of the variance in GPA
and 30% of the variance in time to graduation.
                             Early dropout warning model in higher education    3

    In terms of mixing different groups of data, in [4], the authors used demo-
graphic, pre-college entry information, and complete transcript records of 69,116
students, generating a total of more than 700 features and had an accuracy of
66%. In [3], the authors used the Cox proportional hazards model with different
groups of variables, such as demography, family history, financial information,
high school, college enrollment and semester credits of 11,121 students. The pre-
cision of its model ranged between 71% and 82%. In [15], the attributes used
were about the student’s behavior in the first part of his undergraduate pro-
gram of 498 students, ignoring classical attributes. The precision of their models
ranged between 74% and 78%. Another predictive model, as well as an EWS,
is presented by [5], which only uses academic performance and data gathered of
LMS, the precision ranges where from 79% to 92%.
    While all of the above models are accurate in their contexts, they are not
scalable because they use small samples, only use the first academic years, or
focus on specific fields or undergraduate programs. Moreover, a mix of different
variables, such as psychological, demographic and academic, could cause biased
results or less accuracy prediction. Besides, certain models cannot be replicated
because the variables are not specified but rather mentioned in a general way.
Therefore, a model that can be scalable and replicable is needed to understand
the student’s path from beginning to end and consider variables that cause an
impact on learning performance.


3     Case Study

The study took place in Escuela Superior Politécnica del Litoral, ESPOL, a
public engineering-oriented university located in Ecuador. The model was im-
plemented in Spyder3 , an open-source platform, and we used Python V3.7 as
programming language and made use of the Scikit-learn library, an open-source
library implements many machine learning algorithms. The proposed model and
each of its phases is described below.


3.1    Data set

The information was provided by the Information Technology and Systems Man-
agement4 . For the model, we considered the bachelor’s degree in Industry En-
gineering, which has 753 students enrolled from 2000 until the first semester
of 2019. This is one of the undergraduate programs with the most significant
number of students enrolled during this period. The years selected were used as
boundaries to allow five years of full studies from the time the students began
their first academic semester until obtaining a professional degree.
3
    https://www.spyder-ide.org/
4
    https://www.serviciosti.espol.edu.ec/
4      V. Heredia-Jimenez et al.

3.2   Pre-processing

This phase focused on the information acquired, with the purpose of generating
consistent data that could be used for the algorithm. The data was categorized
as follows: socio-demographic data of the student (gender, marital status, em-
ployment status, city of residence, among others), data about the undergraduate
program (number of total credits, code of the course, among others) and data on
the student’s performance (courses taken, state of the course, number of times
the course has been taken, GPA, credits of the course, among others). During
the pre-processing phase, cleaning, transformation and integration procedures
were performed due to the origin of the information.
    A recognition process was carried out, in which the academic terms were ap-
propriately identified because the period of study varied depending on the term.
Therefore, the periods were the first and second term, also called ordinary peri-
ods, and third term also called extraordinary period. For the predictive model,
ordinary periods were selected as they had the same duration. Subsequently,
we enumerated the semesters in which students had been studying with their
respective academic record, in order to execute the analysis per semester.
    After the pre-processing, the most representative academic performance vari-
ables were calculated and used as inputs for the model, which are detailed in
Table 1. Besides, Section 4.1 explains the reasons why the model uses academic
information alone, and discards socio-demographics information. Other indica-
tors as undergraduate program performance indicators per semester were calcu-
lated in order to obtain additional information to compare the results gathered
from the prediction.


                   Table 1. Learning indicators for the model

            ID                         Description
               Average number of subjects taken per semester during
           V1
               the years of study
               Number of times of second enrollment in a subject
           V2
               after failing the first time
               Number of times of third enrollment in a subject
           V3
               after failing the second time
               The period in years that the student takes to return
           V4
               to study
           V5 GPA average of all subjects
               GPA average of subjects
           V6
               taken in the current undergraduate program
               GPA average of subjects with number of credits greater
           V7 than zero, penalized for the number of times
               the subjects are taken by the student
           V8 Ratio of the approved subjects taken by the student
           V9 Ratio of the failed subjects taken by the student
           V10 Ratio of the canceled subjects taken by the student
                            Early dropout warning model in higher education       5

3.3    Desertion alert algorithm solution

For this preliminary work, we define the concept of academic dropout when a
student has not been enrolled in any course for some time since his/her last
registration. Based on ESPOL’s internal guidelines and regulations5 , the period
is five years. Therefore, those students have a dropout value equal to one. On the
other hand, students who graduated had a dropout value equals to zero. Also,
students with 90% of their undergraduate degrees completed are assigned with
dropout equal to zero. This percentage based on other studies [2], which indicated
that people with this advanced level abandon their undergraduate degree for
reasons outside their academic performance. Thus, dropout is considered as a
single, binary outcome feature. As a classification problem, we used the Random
Forest Classifier (RFC) algorithm [7]. This is a well-known tree-based learning
algorithm, known for its low tendency to over-training and its high accuracy.
     After pre-processing, the process started with the dropout assignment equal
to one or zero, according to the definition of academic dropout indicated above.
Subsequently, the semester selection was made cumulatively. This means, se-
lecting the academic history of all students in their first semester and enter-
ing this data into the model; then, selecting the students in their first and
second semesters and entering this data into the model; and so on, until the
fifth semester, which represents two and a half years of studies. After the fifth
semester, the evaluation of all students who had more than five semesters in
their undergraduate programs was carried out. In Ecuador, according to [9], the
average of student dropout occurs between the second and third year of the
studies.
     Every group of semesters was used to calculate the academic performance
indicators, and consequently, define the set of data that was taken for the train-
ing and testing. The training data was built with the group of students whose
dropout was previously defined, and the algorithm was trained; while, for testing
data, the set of students who did not have a dropout defined was considered.
     In general, high probability values mean that the student has a high proba-
bility of dropping out of the undergraduate program, while low values indicate a
high probability of finishing the degree. Fig. 1 depicts the steps that were carried
out to produce meaningful information.


4     Results

4.1    One degree

For data acquired from the bachelor’s degree in Industry Engineering, we ap-
plied a correlation matrix to analyze how the different variables were affected
within the proposed model. As consequence, the socio-demographic variables
were discarded, such as gender, married status, school type, and work status,
because this variables had a correlation coefficient lower than 0.1; therefore, were
5
    http://reglamentos.espol.edu.ec/
6      V. Heredia-Jimenez et al.


                             Fig. 1. Predictive model


not included into the model. Additionally, we use the RFC with the following
parameters: number of trees equal to 300, depth of trees equal to 6, minimum
number of samples required to split equal to 6, minimum number of samples
required to be at a leaf node equal to 7. These parameters were chosen based on
the results in accuracy when the model was applied to our testing set with an
accuracy of more than 95%.
    In order to evaluate the effectiveness of the proposed algorithm, accuracy,
area under curve (AUC), F1-score, recall and precision were used as evaluation
criteria. Accuracy is the proportion of correct prediction of dropout. Precision
is the proportion of dropout learners predicted correctly by the classifier. Recall
is the proportion of dropout learners predicted correctly by the classifier in all
real dropout learners. F1-score is the harmonic mean of precision and recall [10].
Table 2 presents the results of the evaluation criteria per semester of the predic-
tive model based on the variables described in Section 3.2, for the undergraduate
program selected.

                     Table 2. Evaluation criteria per semester

               Semester AUC Accuracy F1 Recall Precision
                  1     0.997 99%   0.928 0.928  0.928
                  2     0.995 96%   0.948 0.915  0.984
                  3     0.994 96%   0.925 0.911  0.939
                  4     0.972 91%   0.904 0.863   0.95
                  5     0.996 97%   0.937 0.937  0.937
                 >5     0.996 98%   0.966 0.96   0.972


4.2   Scalability
It was decided to scale the model for all undergraduate programs that are dic-
tated in ESPOL. The data selected from 2000 until the first semester of 2019
                          Early dropout warning model in higher education       7

has a total of 29,983 students. Due to the time of the selected data, there were
some changes in denomination codes of undergraduate programs. Nevertheless,
the essential structure of the undergraduate programs had been preserved. With
the purpose of scalability, a process of unifying undergraduate program codes
was carried out to avoid the loss of information.
    A total of 65 undergraduate programs were predicted within the model. The
model predicted the dropout degree by degree, and as explained before, every
group of semesters was used to calculate the academic performance indicators,
and consequently, define the set of data that was taken for the training and
testing. In this way, in parallel to testing our proposal, we proved that socio-
demographic data are unnecessary for dropout prediction, and eliminate typical
biases from the learning process. A summary of the five evaluation criteria of the
model for the 10 undergraduate programs with the largest of number of students
enrolled are shown in Table 3.

Table 3. Prediction accuracy for the undergraduate programs with largest number of
students enrolled between 2000 and 2019

               Degrees AUC Accuracy F1 Recall Precision
               BIBIMBM 0.997 99%   0.929 0.929  0.929
               CI004   0.996 97%   0.938 0.938  0.938
               CI005   0.997 98%   0.966 0.960  0.973
               CI008   0.994 92%   0.929 0.929  0.929
               CI009   0.998 98%   0.975 0.975  0.975
               CI013   0.993 97%   0.959 0.935  0.983
               CI020   0.996 99%   0.989 0.979  1.000
               ECCBA   0.973 99%   0.940 0.986  0.899
               INACP   0.993 97%   0.934 0.949  0.918
               INALL   0.980 98%   0.902 0.974  0.841


5   Discussion
Our proposed model was based on academic data alone and the analysis of the
academic semesters of students throughout the student’s undergraduate pro-
gram. The initial results indicated a strong signal in the use of data purely
academic for predicting student dropout, with an accuracy of more than 95%.
In addition, the experimental results proved the effectiveness and scalability of
our algorithm due to the different numbers of students enrolled.
   Unfortunately, these results can not be compared with similar studies that
have only used academic data as well because of the following reasons. First, the
prediction model focuses on the first semesters only [1], while ours studies the
student throughout his/her undergraduate program. Second, the variables are
not the same. For instance, in [15], the authors used three academic variables,
but the studio considered the fees deadline as academic indicator, while we used
8      V. Heredia-Jimenez et al.

in our model data purely academic. Third, we tested our model in different
undergraduate programs. Meanwhile in [4] used one undergraduate program.
    We also acknowledge the limitation that prevents us from generating a bet-
ter model. These limitations arise from the data gathered. Although, it is true
we only focus on the use of academic data for the model, it should be noted
the importance of the students’ financial status, because this factor is one of
the motivation behind students’ decisions to stop their studies [20]. Thus, our
limitation is because the students’ financial status was calculated in ESPOL
using differed range values for students from 2000 until 2014 and for students
from 2015 onwards. Another limitation is when students withdraw for a period
of time and their undergraduate programs change the assigned code, so when
they return to their studies, they are registered into the database with the new
undergraduate program code, but with all their academic records as a single
semester. The portion of this student population with more than 10 subjects
taken was 6%. This limitation was addressed considering as maximum average
7 subjects taken per semester.


6   Conclusion and future work

Dropout prediction is an essential prerequisite to make interventions of at-risk
students. To address this issue, we proposed a predictive model, which used aca-
demic data alone and analyzed the information semester by semester to obtain
more reliable predictions. Besides, it was able to identify the earliest moment of
dropping out of a student. To the best of our knowledge, this is the first proposal
of a predictive model able to predict even for students who have just finished
their first semester.
    Our model was also tested with real data of ESPOL and was also applied
to other large data sets of the different undergraduate programs. Experimental
results showed that our model has better results than existing proposals in terms
of input features and accuracy. Therefore, we have proven that our model allows
to predict with sufficiently high accuracy but, at the same time, can predict the
intention of dropping out early enough, so that some corrective actions can be
attempted.
    Additionally, the fact that the variables that worked are specifically men-
tioned can help other universities to replicate this model or adapt it to their
needs.
    Our future work aims to make technical improvements to our prediction
model and use other machine learning algorithms to compare them as other re-
searches do. Furthermore, we plan to use this model in postgraduate programs
and in academic data sets from other types of universities where attrition rates
tend to be much higher. Additionally, we are currently working on a visualiza-
tion prototype to integrate the results of the predictive model into the Academic
Counseling System. Using visualization techniques with explanatory models, al-
lows counselors to have the opportunity to identify those students who have the
possibility of an early dropout, and be able to adapt a recommendation strategy
                             Early dropout warning model in higher education              9

to minimize the risk of failure of a student. Lastly, we plan to analyze and vi-
sualize which input variables have the most substantial influence on predicting
dropout.


Acknowledgements

Work partially funded by the LALA project (grant no.586120-EPP-1-2017-1-ES-
EPPKA2-CBHE-JP). This project has been funded with support from the Eu-
ropean Commission. This work has also been partially funded by the Madrid Re-
gional Government through the e-Madrid-CM Project under Grant S2018/TCS-
4307, a project which is co-funded by the European Structural Funds (FSE and
FEDER). This work has also been partially funded by the FEDER/Ministerio de
Ciencia, Innovación y Universidades–Agencia Estatal de Investigación, through
the Smartlet Project under Grant TIN2017-85179-C3-1-R. This publication re-
flects only the views of the authors, and the funders cannot be held responsible
for any use which may be made of the information contained therein.


References

 1. Abu-Oda, G.S., El-Halees, A.M.: Data mining in higher education: university stu-
    dent dropout case study. Data mining in higher education: university student
    dropout case study 5(1) (2015)
 2. Alvarracı́n, P., Daniel, J.: Identificación del perfil de egreso correspondiente a la
    licenciatura de la carrera de laboratorio clı́nico e histotecnológico de la universidad
    central del ecuador periodo 2017-2022 (2016)
 3. Ameri, S., Fard, M.J., Chinnam, R.B., Reddy, C.K.: Survival analysis based frame-
    work for early prediction of student dropouts. In: Proceedings of the 25th ACM
    International on Conference on Information and Knowledge Management. pp. 903–
    912. ACM (2016)
 4. Aulck, L., Velagapudi, N., Blumenstock, J., West, J.: Predicting student dropout
    in higher education. arXiv preprint arXiv:1606.06364 (2016)
 5. Baneres, D., Rodriguez-Gonzalez, M.E., Serra, M.: An early feedback prediction
    system for learners at-risk within a first-year higher education course. IEEE Trans-
    actions on Learning Technologies (2019)
 6. Barbu, M., Vilanova, R., Lopez Vicario, J., Pereira, M.J., Alves, P., Podpora, M.,
    Ángel Prada, M., Morán, A., Torreburno, A., Marin, S., et al.: Data mining tool
    for academic data exploitation: literature review and first architecture proposal.
    Projecto SPEET-Student Profile for Enhancing Engineering Tutoring (2017)
 7. Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001)
 8. Catalano, V., Murcia Pérez, D.E., Escudero, F., Gerlo, P.D., Pantoja, A.: Influ-
    ence of a tutorial system based on the use of ict in the decrease of dropout and
    academic failureof first-year students of the veterinary career of the juan agustı́n
    maza university. In: I Jornadas de Inclusión de Tecnologı́as Digitales en la Edu-
    cación Veterinaria (La Plata, 2018) (2018)
 9. Cevallos, T.: Cuadernos del contrato social por la educación. Cuaderno 10, 34–46
    (2014)
10      V. Heredia-Jimenez et al.

10. Chen, J., Feng, J., Sun, X., Wu, N., Yang, Z., Chen, S.: Mooc dropout prediction
    using a hybrid algorithm based on decision tree and extreme learning machine.
    Mathematical Problems in Engineering 2019 (2019)
11. Chen, R.: Financial aid and student dropout in higher education: A heterogeneous
    research approach. In: Higher education, pp. 209–239. Springer (2008)
12. Chen, R.: Institutional characteristics and college student dropout risks: A multi-
    level event history analysis. Research in Higher education 53(5), 487–505 (2012)
13. Dalipi, F., Imran, A.S., Kastrati, Z.: Mooc dropout prediction using machine learn-
    ing techniques: Review and research challenges. In: 2018 IEEE Global Engineering
    Education Conference (EDUCON). pp. 1007–1014. IEEE (2018)
14. Ferreyra, M.M., Avitabile, C., Botero Álvarez, J., Haimovich Paz, F., Urzúa, S.:
    At a crossroads : Higher education in latin america and the caribbean (2017)
15. Jimenez, F., Paoletti, A., Sanchez, G., Sciavicco, G.: Predicting the risk of aca-
    demic dropout with temporal multi-objective optimization. IEEE Transactions on
    Learning Technologies (2019)
16. Kappe, R., van der Flier, H.: Predicting academic success in higher education:
    what’s more important than being smart? European Journal of Psychology of
    Education 27(4), 605–619 (2012)
17. Murphy, J.P., Murphy, S.A.: Get ready, get in, get through: Factors that influence
    latino college student success. Journal of Latinos and Education 17(1), 3–17 (2018)
18. OECD:           Education         at       a        Glance      2019        (2019).
    https://doi.org/https://doi.org/https://doi.org/10.1787/f8d7880d-en,
    https://www.oecd-ilibrary.org/content/publication/f8d7880d-en
19. Paura, L.: Cause analysis of students’ dropout rate in higher education study
    program. Procedia - Social and Behavioral Sciences 109, 1282–1286 (01 2014).
    https://doi.org/10.1016/j.sbspro.2013.12.625
20. Sinchi Nacipucha, E.R., Gómez Ceballos, G.P.: Access and desertion in universities.
    financing alternatives. ALTERIDAD. Revista de Educación 13(2), 274–287 (2018)
21. Suganya, S., Narayani, V.: Analysis of students dropout forecasting using data min-
    ing. In: 3rd Internaational Conference on Lastest Trends in Engineering, Science,
    Humanities and Management (2017)
22. Ting, S.M.R., Man, R.: Predicting academic success of first-year engineering stu-
    dents from standardized test scores and psychosocial variables. International Jour-
    nal of Engineering Education 17(1), 75–80 (2001)
23. Viloria, A., Lezama, O.B.P.: Mixture structural equation models for classifying
    university student dropout in latin america. Procedia Computer Science 160, 629–
    634 (2019)
24. Viloria Silva, A.J., Parody, A.: Methodology for obtaining a predictive model aca-
    demic performance of students from first partial note and percentage of absence
    (2016)
25. Vossensteyn, J.J., Kottmann, A., Jongbloed, B.W., Kaiser, F., Cremonini, L., Sten-
    saker, B., Hovdhaugen, E., Wollscheid, S.: Dropout and completion in higher edu-
    cation in europe: Main report (2015)