=Paper=
{{Paper
|id=Vol-3282/icaiw_waai_1
|storemode=property
|title=Analysis on Early Dropouts in Engineering Careers
|pdfUrl=https://ceur-ws.org/Vol-3282/icaiw_waai_1.pdf
|volume=Vol-3282
|authors=Ignacio Manes,Tomas Lubertino,Jorge Anca,Karen Roberts,Hernán Merlino
|dblpUrl=https://dblp.org/rec/conf/icai2/ManesLARM22
}}
==Analysis on Early Dropouts in Engineering Careers==
<pdf width="1500px">https://ceur-ws.org/Vol-3282/icaiw_waai_1.pdf</pdf>
<pre>
Analysis on Early Dropouts in Engineering Careers
Ignacio Manes* , Tomas Lubertino, Jorge Anca, Karen Roberts and Hernán Merlino
Universidad de Buenos Aires, Buenos Aires, Argentina


                                      Abstract
                                      Low graduation rates, careers that extend longer than normal and university dropouts are a reality in
                                      Universities in Argentina. The objective of this work is to analyze, from the perspective of data sciences,
                                      the early dropouts in the field of the Faculty of Engineering of the University of Buenos Aires, in the
                                      context of the careers of Computer Engineering and Bachelor of Systems Analysis.

                                      Keywords
                                      University Dropouts, Higher Education Dropouts, Computer Science Careers, Data Science


1. Introduction
The present work takes as its main focus the careers of Computer Engineering and Bachelor
of Systems Analysis that are taught at the Faculty of Engineering of the University of Buenos
Aires, which historically register high levels of student dropouts that cause low graduation
rates and an increase in the average time to degree.


2. State of the art
This section summarizes the work of some academic research projects which try to explain the
phenomena of higher education students’ dropout from different models.
   Some explain these phenomena based on two sociological theories: "The student integration
model" [1, 2] where the integration of the student into the academic world directly affects the
determination of whether or not to drop out of school, another is the “Student attrition model”
[3] that gives relevance to factors external to the educational institution.
   According to [4] the relevance given to the variables that try to explain the phenomenon
of dropout and retention, whether family, individual or institutional, it addresses different
dimensions of analysis: Psychological, Economic, Sociological, Organizational and Interaction.
   In a report of the Argentine Ministry of Education [5], which included a total of 21 unified
terminals of the discipline according to CONFEDI (Federal Council of Deans of Engineering),
focusing on the Informatics/Systems career of public institutions, it was possible to carry out


ICAIW 2022: Workshops at the 5th International Conference on Applied Informatics 2022, October 27–29, 2022, Arequipa,
Peru
*
  Corresponding author
$ imanes@fi.uba.ar (I. Manes); tlubertino@fi.uba.ar (T. Lubertino); janca@fi.uba.ar (J. Anca); kroberts@fi.uba.ar
(K. Roberts); hmerlino@fi.uba.ar (H. Merlino)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                        1
Ignacio Manes et al. CEUR Workshop Proceedings                                             1–10


Figure 1: Students, Newly Enrolled, Re-enrolled, and Graduates of Computer Science/Systems Engi-
neering per year.


a series of graphs showing the evolution of the number of students between 2007 and 2016
(Figure 1).
   Figure 2 presents the evolution of engineering degree graduates is observed. It is evident
to note how the total number of students in the program (Figure 1), as well as the number of


Figure 2: Computer Science/Systems Engineering graduates over the years.


                                                 2
Ignacio Manes et al. CEUR Workshop Proceedings                                             1–10


Figure 3: Dropout of Computer Science/Systems Engineering students year by year.


graduates (Figure 2) decreased over the years. Where in addition, the percentage of the latter
was always very low in relation to the total (average of less than 1000 graduates per year).
   Finally, based on the number of students re-enrolled in this period, it was grouped by
consecutive years to see the difference between them (Figure 3) and thus calculate how many
students allegedly dropped out of the degree. For example, in 2007 there were a total of 27,179
re-enrolled students, while in 2008 the total was 26,079, so the difference was 1,100 fewer
students.
   The goal of this work is to detect the early dropout in the Systems Analysis and Computer
Engineering degree courses. We will try to obtain the behavior patterns of the myRPL.ar
platform database1 through the use of information exploitation processes.


3. Analysis
An automatic data analysis was carried out from the database of the myrpl.ar platform, using
Python2 , Jupyter3 notebooks and the Pandas Profiling library4 . We concluded that the tables
that were not included in the query that builds the dataset, didn’t have relevant information to
carry out the early dropout analysis. In this way, the following tables were chosen:

    • activities: Provides information on the activities carried out by the students.
1
  https://myrpl.ar/
2
  https://www.python.org/
3
  https://jupyter.org/
4
  https://github.com/ydataai/pandas-profiling


                                                 3
Ignacio Manes et al. CEUR Workshop Proceedings                                                 1–10


        • activitiy_submissions: Relates each of the activities with the submissions made by the
          students.
        • course_users: Provides information about all the courses in which the student is or was
          enrolled.

   From the discarded tables, the rpl_files_report table is included, which has the code of each of
the student submissions. We run a linter to obtain a score for each of the student’s submissions,
but this option was discarded since the activitiy_submissions table contains a status of the
submissions (failure, build_error, success, runtime_error, time_out) which was according to the
teacher’s criteria.
   A dataset was built from the information obtained that contains data about the student, the
semester of the subject, the code of the subject, and all the deliveries for the different tasks with
their respective status. Within the first EDA (Exploratory Data Analysis) carried out, it was
found that the data obtained belonged to 10 different careers.
   To complement the analysis made with Pandas Profiling, the D-tale library5 was used to
analyze the predictive power of each of the variables and see which models the library recom-
mended. To this analysis, the application of several clustering algorithms was added to find a
common pattern among the data that could mark a tendency to leave one or several subjects.
Using the PyCaret6 Automated Machine Learning library, the following models were run:

        • kmeans
        • meanshift
        • sc
        • hclust
        • birch
5
    https://pypi.org/project/dtale/
6
    https://pycaret.org/


Figure 4: 2D Spectral Clustering PCA.


                                                 4
Ignacio Manes et al. CEUR Workshop Proceedings                                              1–10


Figure 5: Distortion Score Elbow for the Spectral Clustering Model.


  The model that had the best performance was the Spectral Clustering model with a much
higher silhouette score than the other models, resulting in the following clustering (Figure 4)
  As it can be seen in Elbow (Figure 5) the ideal number of clusters is 3. By analyzing the
groups formed, did not notice any pattern that shows desertion by the students.
  After this stage, an anomaly detection model was applied to emulate the dropout variable that
could not be obtained. To address this issue, we used the PyCaret model for anomaly detection.
Three different types of models were tested:

    • knn
    • iforest
    • svm

   This last model was the only one that agglomerated a group of students in almost a single
point in space, in relation to the dimension reduction variables used by PyCaret.
   As can be seen in the 3D TSNE (Figure 6), the anomalous points are clustered in a single
region.
   The same thing happens in the uMAP (Figure 7), showing the same result for two dimensions.
   Two classification models were applied to the resulting dataset, each one from a different
library. The first one was the PyCaret classifier where the random forest was the most performant
model (Figure 8).


                                                 5
Ignacio Manes et al. CEUR Workshop Proceedings                                              1–10


Figure 6: 3D TSNE for Outliers (svm).


Figure 7: uMAP for Outliers (svm).


  Similarly, the confusion matrix (Figure 9). shows the results for this unbalanced dataset, with
only one false positive.
  On the other hand, it can be seen (Figure 10) that the FAILURE variable is extremely important
for the prediction.


                                                 6
Ignacio Manes et al. CEUR Workshop Proceedings                                              1–10


Figure 8: Analysis of models from the PyCaret library applied to the synthetic variable.


   The other model used was generated with the Teapot library7 . It produced a pipeline that
includes a logistic regression, and like PyCaret, a random forest. The metrics Figure 11) show
excellent results, just like the first model.
   All the notebooks used by this paper can be found on github [6].
   Finally, a dashboard8 was developed using Next.js9 for the frontend and Vercel10 for cloud
deployment, in order to give visibility to the data and reflect the problems mentioned above. It
shows the data table used for the present work together with some graphs that helps to see the
distribution of students and the types of events generated by the course.


4. Conclusions
As a conclusion of the process carried out, it can be seen how data science helps to answer
the research question if it is possible to detect student dropout, which implies low graduation
rates. Here, hints have been found about the existence of indicators such as the FAILURE
characteristic, which represents the number of times that the student fails in the assignment and
7
  http://epistasislab.github.io/tpot/
8
  https://dashboard-tp-profesional.vercel.app/
9
  https://nextjs.org/
10
   https://vercel.com/


                                                  7
Ignacio Manes et al. CEUR Workshop Proceedings                                              1–10


Figure 9: Confusion matrix


Figure 10: Variable’s importance for the pycaret classification model


would represent a good predictor in this first instance of the ongoing investigation, allowing
teachers to take actions in early stages to avoid students dropout. The ongoing research allows
us to start detecting alternative characteristics that are good predictors of university dropout.
This line of research will continue in order to find new predictive characteristics.


                                                  8
Ignacio Manes et al. CEUR Workshop Proceedings                                                1–10


Figure 11: Teapot Ranking Model Metrics


5. Future work
As next steps to obtain a model that reflects the reality of the students, the following datasets
should be included:

        • Data from the SIU Guaraní11 to obtain the dropout variable to be predicted.
        • Data of all the careers of the University, to extend the analysis carried out in computer
          science subjects to the subjects of all the careers.
        • Student satisfaction statistics.

   Having acquired the new data will require iterating on the model made to confirm that it fits
the newly added variables. Models of early dropout of the University could be implemented, as
well as performance and dropout by subject, making it easier to reinforce the monitoring of
students who are at risk of any of the situations mentioned previously.


6. Acknowledgments
The authors of this work acknowledge the Faculty of Engineering, University of Buenos Aires.
This work is framed within the PIDAE "Intelligent systems in university management".


References
[1] W. G. Spady, Dropouts from higher education: An interdisciplinary review and synthesis,
    Interchange 1 (1970) 64–85.
[2] V. Tinto, Dropout from higher education: A theoretical synthesis of recent research, Review
    of educational research 45 (1975) 89–125.
[3] J. P. Bean, Student attrition, intentions, and confidence: Interaction effects in a path model,
    Research in higher education 17 (1982) 291–320.
[4] J. Braxton, Reworking the student departure puzzle, Vanderbilt University Press, 2000.

11
     https://guaraniautogestion.fi.uba.ar/g3w/acceso


                                                       9
Ignacio Manes et al. CEUR Workshop Proceedings                                        1–10


[5] Ministerio de Educacion de Argentina, Informe especial: Estudiantes, nuevos inscriptos,
    reinscriptos y egresados de ingeniería, 2007-2016.
[6] I. Manes, Desgranamiento temprano en las carreras de ingeniería, 2022.


                                                 10

</pre>