Modelling Computer Engineering Student
Trajectories with Process Mining
Pablo Martinez1 , Oscar Montañes1 , Juan Manuel Serralta1 and Libertad Tansini1
1
    Computer Science Institute (Inco), Faculty of Engineering, Universidad de la República, Montevideo, Uruguay


                                         Abstract
                                         This work presents the analysis of student learning trajectories in their Computer Engineering studies.
                                         The analysis focuses on modelling characteristics that have impact on dropout rates. Hence students
                                         trajectories of dropout students and graduated students are analyzed and compared using Process Mining
                                         tools. Specific course that are considered “difficult” and prevent academic progress are identified, also
                                         the last courses with which students dropout or graduate. Some of the courses identified as “bottle
                                         necks” are Sistemas Operativos, Redes de Computadoras and Arquitectura de Computadores. And some
                                         of the courses that are left for last are Física 1 and Métodos Numéricos. The results show that dropout
                                         students manly finish after the first year and that they choose courses related to Programming. The
                                         models adequately describe student trajectories, with the usual metrics for Process Mining of fitness over
                                         97% on the trajectories of the log, and do not overfit the training data set showing high generalization.

                                         Keywords
                                         Student Learning Trajectories, Computer Engineering Degree, Process Mining, Modelling


1. Introduction
The motivation for this project is the need of the Education Unit of the Engineering Faculty
(Unidad de Enseñanza de Facultad de Ingeniería, de la Universidad de la República, Uruguay
(UEFI)) to explain the causes for dropout of students enrolled in the different degrees, based
on the available information in their Information System. This work is part of a larger project
[1] where the reports required by UEFI were automatized through the implementation of a
first version of a Data Warehouse; an in depth descriptive analysis of the variables that may
have greater incidence in student dropout was carried out, exploring new data sources (such as
Continuous Household Survey (ECH), primary shool scholarship from ANEP (Administración
Nacional de Educación Pública, https://www.anep.edu.uy/) and geo-referencing of the addresses
of the students); and finally the modelling of students trajectories and behaviour along their
studies in their Computer Engineering studies was made, using machine learning and process
mining techniques to analyze possible social and curricular reasons that explain dropout.
   This paper explains with more detail the last aspect of the project, that is, the analysis
of trajectories of Computer Engineering students using the process mining tool ProM Tools

   IV LATIN AMERICAN CONFERENCE ON LEARNING ANALYTICS
Envelope-Open p.martinezben@gmail.com (P. Martinez1 ); omontanes@gmail.com (O. Montañes1 ); juanserralta1@gmail.com
(J. M. Serralta1 ); libertad@fing.edu.uy (L. Tansini1 )
Orcid 0000-0002-1212-3871 (P. Martinez1 ); 0000-0002-7893-823X (O. Montañes1 ); 0000-0001-5564-7352 (J. M. Serralta1 );
0000-0001-6017-0114 (L. Tansini1 )
                                       © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
[2] to model an analyze the data in the “Administrative Information System” (AIS) of the
University, also called “Bedelía”, “University administrator office”, “Admissions office”, “Office
of the Registrar”, etc.
   62% of the students dropout from their Computer Engineering studies. The goal of this work
is to provide analytical tools for decision makers to understand the reasons why students leave.
To achieve this goal, dropout and graduated students trajectories were modelled and analyzed
in order to identify courses that hinder students from advancing or graduating, and courses
with which students dropout or graduate.
   The results show general behavioural patterns of the students during their studies,
identifying hard courses to pass or “bottle neck” courses such as Sistemas Operativos, Redes de
Computadoras and Arquitectura de Computadores. Also courses that are left for last, are found,
such as Física 1 and Métodos Numéricos. The results show that dropout students manly leave
without finishing the first year and those who advance more manage to pass courses related
to Programming. The models adequately describe the trajectories with a fitness above 97%
on the log traces, they do not overfit the provided training logs and have high generalization.
   The following sections describe related works, methodology, results, and finally conclusions
and future work.


2. Works related to Learning Analytics and Process Mining
In Uruguay, tens of thousands of students annually join the education system with three or four
years, starting in this way their “educational life”. Many of them will finish primary school, less
will finish high school and only a few will achieve university degrees. Information is naturally
produced and registered by students and teacher (grades, daily assistance, evaluation, etc.) in
administrative or pedagogical information systems, recording students educational ”traces”[3].
   For some decades now, there have been several initiatives advocating for the application of
technologies such as Data Mining, Artificial Intelligence, Big Data, among others to the analysis
of a variety of social problems, including education [4], in what has been called “Learning
Analytics” (LA) [5]. Privacy of information and regard for personal information is only one of
the many challenges of LA [6]. Several authors have used process mining or “Educational process
mining” (EPM) to model different behavioural aspects of students trajectories, for example
in the analysis of LMS (Learning Management Systems) logs [7], a recent survey of different
application areas and tools can be found in [8].


3. Methodology
ProM Tools is used as the process mining tool to model students trajectories from the data in
the “Administrative Information System” (AIS), with the aim to identify “bottle neck” courses,
and given the flexibility to choose courses, find courses that are left for last and specific ordering
patterns among curses for the different groups of students.
Table 1
Necessary information for Process Mining.
               Column        Type           Description
               id_number     INT            Id number
               course_id     VARCHAR        Name of the course
               date_start    DATE           Date of first register
               date_end      DATE           Date approval
               status        VARCHAR        State of the student
                                            (DROPOUT, ATTENDING, GRADUATED)


3.1. The AIS Data Base
In several meetings with the Education Unit of the Engineering Faculty (UEFI), having previously
signed a confidentiality agreement, it was possible to have access to the information in the AIS
Data Base. It is an SQLite base with various tables that describe the students activities, the most
important ones are described in this section.
   Table Activ2: contains all activities of the students in the different degrees they are enrolled
in. The activities are ordered by date and include course inscriptions and exams, with their
corresponding result and grade.
   Table Estudiante: this table contains Student entities with all its personal information, such
as birth date, gender, were they pursued high school studies and contact information.
   Table Estudiante-Carrera: this table registers the students dates of enrollments and gradu-
ation in different degrees. There are over ten degrees.
   Table Asignaturas: contains the codes of the courses in the Engineering Faculty and their
relationship with the different degrees. There may be different evaluation methods and credits
per course in the different degrees.
   With this set of tables it is possible to determine the scholarship of the students to load the
necessary information in the process mining tool. That is, to determine each of the course in
which the students have enrolled, have passed or failed, the number of each event and the dates.

3.2. Extraction, Transformation and Loading (ETL)
It was necessary to transform, clean and unify the data in order to load it into ProM Tools since
the structure of the before mentioned tables does not allow to obtain in a simple manner the
scholarships of the students. Table 1 shows the right format describing the courses passed by
the students, i.e. their scholarships.
   The following transformations, cleaning and unification operations were performed:

    • Elimination of trajectories with transferred courses: course credits have been trans-
      ferred from other schools or degrees, these courses are relatively few and distort the final
      models.
    • Elimination of optional courses: they represent a small portion of the total number of
      courses taken by the students and are to heterogeneous to model adequately in this stage.
    • Mandatory courses: the suggested courses by AIS [9] is used as the set of possible
      courses for the analysis.
    • Unification of courses: many courses undergo changes along the years, having different
      versions, with different names, codes and content. For example Cálculo 1 was named
      Calculo Dif. e Integral en una Variable from the year 2017 and on, with a slight change in
      the content. For this analysis, those differences are not relevant, hence, for this study all
      related courses were grouped together.


Table 2
Conversion from CSV to XES format.
                  Column        Description
                  Case_id       maps to id_number of the student
                  Event_id      maps to course_id
                  Date_finish   maps to approval date of the course or date_end

  The data is extracted in CSV format [10], transformed and loaded to ProM, where it is
processed. To apply the different process mining algorithms provided by the process mining
tool it is necessary to transform the raw data (CSV) to XES [11], which is the format required by
ProM. This allows to represent events based on a XML format. The conversion is made using a
plug-in provided by ProM, for which it is necessary to have the information shown in Table 2
and which is directly associated to the data provided in the log, see Table 2.
  After several meetings with UEFI and the Academic Program Director of the Computer
Engineering Degree it was decided to divide the students in three groups to be able to model
them adequately: dropout, advanced and graduated. This strategy allows to find specific models
for each group instead of pursuing more generalised models for all students together. It is
specially relevant to procure analytical tools for the dropout group of students that represents
around 62% of the students. To this end, models are built and analyzed for the dropout and
graduated groups of students.

3.3. Methodology for the Modelling and Analysis
In the first place the dependency model for the provided logs are obtained with the plug-
in Interactive Data-Aware Heuristic Miner [12], which allows the application of different
algorithms for modelling, besides offering the final model with different process representation
alternatives (Dependency nets, Petri nets, Causal nets, etc.). The generated dependency graph
with the Flexible Heuristic Miner [13] as discovery algorithm, gives a first approximation to the
characteristics of the underlying model in the log.
   Then, with the objective to model the trajectories of the students, several algorithm were
analyzed that allow to explore and discover underlying models in the provided logs. In this
study, an inductive algorithm was used [14], this was done through the plug-in “Mine Petri net
with Inductive Miner”, producing a Petri-Net as a result. The choice was based on previous
works in the same topic, such as “Discovering learning processes using Inductive Miner: A case
study with Learning Management Systems (LMSs)” [14]. The default configuration was used,
unless stated.
   Finally, for performance tests on the models werenmade with the plug-ins “Replay a log on
Petri Net for Conformance/Performance analysis” and “Measure precision/generalization”.

3.4. Metrics
The following metrics were used over the models to evaluate conformity and performance:

    • Fitness: percentage of traces that are recognized by the model.
    • Deviations: set of activities that present deviations with respect to the model.
    • Average Throughput: average duration of the traces.
    • Bottle Neck Detection: courses that require more time to pass.
    • Precision: how precisely the model represents the observed process.
    • Generalization: how well the model reproduces future behaviour, confidence in the
      precision.


Figure 1: Dependency graph of the courses of the graduated group of students.


4. Results
There is information available of 3477 students between the years 1997 and 2019. In total they
produce a log with 33396 events. In the following sections the models for the graduated and the
dropout groups of students are presented and analyzed.

4.1. Graduated students
This group is made of 684 students (approximately 20%), with a total of 16416 events (approxi-
mately 49%), since they are the ones with most registered activity. An initial inspection shows
that 45% of the entrances correspond to passing the basic math course Geometría y Álgebra
Lineal 1 as the first course they pass. Most of the students, around 67% of them, finish with
Proyecto de Grado which is the final Thesis. It is also interesting to mention that Redes de
Computadoras y Métodos Numéricos, appear second and third as the last courses approved to
finish the degree.
   The first model shows the dependency graph and presents the dependency relations between
courses. As can be seen in Figure 1, in the graph obtained from the log, the rectangles represent
the courses and the values on the arrows represent the confidence regarding the dependency
relations (the higher value, the more confidence). With this model it is possible to identify
common trajectories for students who graduate.


Figure 2: Petri net model for the graduated group of students.


  The following analysis is based on Petri nets [15], with the aim to identify relevant infor-
mation within the process. The analysis focuses on the metrics: Fitness, Deviations, Average
Throughput, Bottle Neck Detection, Precision and Generalization. Figure 2 shows the Petri net
obtained with the plug-in “Mine Petri net with Inductive Miner” on the log.

Table 3
Average approval times for the graduated group of students.
             Curses                                     Average approval time (years)
             Métodos Numéricos                                       2.5
             Introducción a la Ingeniería de Software                2.3
             Proyecto de Grado                                       2.1
             Fundamentos de Bases de Datos                           1.9
             Sistemas Operativos                                     1.9
             Proyecto de Ingeniería de Software                      1.3
             Programación 2                                          1.2
             Probabilidad y Estadística                              1.1
             Arquitectura de Computadores                            1.0
             Taller de Programación                                  1.0
             Programación 4                                          0.8
             Programación 3                                          0.6

  The conformity of the model for the provided log shows that the model reproduces 97% of
the traces, where 375 of the 648 analyzed traces align perfectly with the model. Also 9 of the
24 courses are aligned with the model and the log, and the rest present some deviations since
some transitions were detected only in the model and not in the log. In this cases the deviation
is less 5% of the traces. The performance analysis shows that the execution time of the traces
(throughput) in average for all students is 8 years, the minimum is 4.5 years and the maximum
is 20 years. The waiting time analysis shows there are 12 courses with high or very high time
of approval, as ca bee seen in Table 3.
   The waiting time analysis for the courses shows relevant information regarding bottle necks
in the model. Revealing the courses Sistemas Operativos, Fundamentos de Bases de Datos, Proyecto
de Grado, Introducción a la Ingeniería de Software and Métodos Numéricos exhibit approval times
four times above expected, considering they should be passed in one semester.
   Finally, the conformity metrics of the model give a Precision of 0.30126 and a Generalization
of 0.8949, indicating the models do not not overfit the training data and that they are capable
of reproducing behaviour not present in the original log, being flexible enough to model new
traces.

4.2. Dropout students
For this analysis there were 2158 students (approximately 62% of the total), with in total 8266
events (approximately 25%).
   A primary inspection of the log of the dropout group of students, shows that 69% of them
have as the last course one of the courses of the first year: Cálculo 1, Geometría y Álgebra Lineal
1, Física 1, Programación 1 or Matemática Discreta 1. This fact alone proves most of them do not
advance further than the first year.


Figure 3: Dependency graph of the courses of the dropout group of students.


  The dependency model that gives a first approximation to the characteristics of the underlying
model in the log, was made with “Mine Petri net with Inductive Miner” on the log and with the
default parameters, displayed only fist year courses, because most students dropout the first
year.
  Then the minimum frequency for a course to be considered by the algorithm is lowered, in
order to have more courses visible in the dependency model in Figure 3. The model shows
that those students that advance the most, manage to pass programming courses, specifically
Programación 1 to Programación 3.
  Figure 4 shows the Petri net generated with the inductive miner on the log for the dropout
group of students, to perform conformity and performance analysis.
  The resulting model allows to represent 97% of the traces in the log. The courses Cálculo 1,
Geometría y Álgebra Lineal 1, Programación 1 and Matemática Discreta 1 are the courses with
highest frequency. Among these courses, only Geometría y Álgebra Lineal 1 and Matemática
Discreta 1 show deviations from the model, less than 33% and 0.1% respectively.
Figure 4: Petri net model for the dropout group of students.


   The performance analysis of the model shows that in average, students dropout in 20.16
months or approximately one and a half year. The courses with high approval times are:
“Probabilidad y Estadística, Introducción a la Investigación de Operaciones, Programación 4, Métodos
Numéricos, Fundamentos de Bases de Datos and Taller de Programación, see Table 4. It is worth
to mention that very few of the dropout students reach the more advanced courses in the list,
hence more information is needed to completely understand the results.

Table 4
Average approval times for the dropout group of students.
         Curse                                            Average approval times (years)
         Probabilidad y Estadística                                     1.5
         Taller de Programación                                         1.6
         Métodos Numéricos                                               2
         Fundamentos de Bases de Datos                                  2.4
         Programación 4                                                 2.9
         Introducción a la Investigación de Operaciones                  3

   The models show Precision of 0.41671 and Generalization of 0.99775, which are superior
to the graduated students. The models do not overfit the log and have high generalization
capabilities.


5. Conclusions and future work
ProM turned out to be versatile and flexible, allowing general analysis of the data, the
generation of different models and a variety of metrics over them. Nevertheless, it is necessary
to have deep understanding of the information of the students, to perform an adequate pre-
possessing and cleaning, and finally an adaptation to the required format as entry to
ProM for the results to be useful.
   The models give insight into the learning trajectories or behaviour of the students in
their Computer Engineering studies, enabling the identification off “bottle neck courses” or
hard courses to pass, as well as courses that do not hinder students from advancing, like Física 1
and Métodos Numéricos. Some of the “bottle neck courses” are Sistemas Operativos, Redes de
Computadoras and Arquitectura de Computadores.
   Considering the dropout group of students, it was possible to verify, both with statistical
methods and process mining models, that most of them dropout the first year and that those
that advance the most choose courses related to programming.
   The models adequately describe the student information with fitness over 97% on the log
traces, they do not overfit to the log and allow to recognize other traces than those in the
training log with high precision and generalization.
   For future work it desirable to include updated information to the models, since for
organizational restrictions it was only possible to work with data until 2019. It is possible to
model other aspects of the learning trajectories by including all of the courses, such as the
optional and then transferred ones that were excluded in this work.
   ProM has a series of plug-ins that have not been tested and could be studied for the purpose
of modeling student behaviour.
   It is of great interest to consider other information than that in the AIS Data Base to be
added to the models, such as gender, age, work information, income, primary and high school
information.
   Finally, we aim at exploring the utility of the models segmented by semesters or years to
obtain more details.


References
 [1] P. Martínez, O. Montañés, J. Serralta, Modelado de trayectorias académicas de estudiantes
     universitarios mediante técnicas de analítica de aprendizaje, Tesis de grado. Universidad
     de la República (Uruguay). https://hdl.handle.net/20.500.12008/28848 (2021).
 [2] Prom tools home page, 2020. URL: http://www.promtools.org/doku.php?id=start, (Accessed
     on 03/12/2020).
 [3] Del papel a la nube: Cómo guiar la transformación digital de los sistemas de in-
     formación y gestión educativa (siged), 2019. URL: https://publications.iadb.org/es/
     del-papel-la-nube-como-guiar-la-transformacion-digital-de-los-sistemas, -Banco Inter-
     americano de Desarrollo (Accessed on 10/10/2019).
 [4] I. Jara, J. Ochoa, Usos y efectos de la inteligencia artificial en educación, Sector Social
     división educación. Documento para discusión número IDB-DP-00-776. BID. doi: http://dx.
     doi. org/10.18235/000238 0 (2020).
 [5] P. Siemens, George y Long, Penetrating the fog: Analytics in learning and education.,
     EDUCAUSE Review 46 (2011) 30.
 [6] A. Pardo, G. Siemens, Ethical and privacy principles for learning analytics, British Journal
     of Educational Technology 45 (2014) 438–450.
 [7] C. Romero, R. Cerezo, A. Bogarín, M. Sánchez-Santillán, EDUCATIONAL PROCESS MIN-
     ING: Applications in Edu. Research, 2016, pp. 1–28. doi:1 0 . 1 0 0 2 / 9 7 8 1 1 1 8 9 9 8 2 0 5 . c h 1 .
 [8] A. Bogarín, R. Cerezo, C. Romero,            A survey on educational process mining,
     WIREs Data Mining and Knowledge Discovery 8 (2018) e1230. URL: https://wires.
     onlinelibrary.wiley.com/doi/abs/10.1002/widm.1230. doi:h t t p s : / / d o i . o r g / 1 0 . 1 0 0 2 / w i d m .
     1230. arXiv:https://wires.onlinelibrary.wiley.com/doi/pdf/10.1002/widm.1230.
 [9] Trayectoria sugerida para la carrera en ingeniería en computación, plan 97, 2020.
     URL: https://www.fing.edu.uy/carreras/grado/computacion/implementacion/archivos/
     TrayectoriaSugerida.pdf, (Accessed on 15/12/2020).
[10] Csv, comma separated values file, 2020. URL: https://tools.ietf.org/html/rfc4180#section-2,
     (Accessed on 15/12/2020).
[11] Xes, extensible event stream, 2020. URL: http://xes-standard.org/, (Accessed on 15/12/2020).
[12] F. Mannhardt, M. de Leoni, H. A. Reijers, Heuristic mining revamped: An interactive,
     data-aware, and conformance-aware miner., in: BPM (Demos), 2017.
[13] A. J. M. M. Weijters, J. T. S. Ribeiro, Flexible heuristics miner (fhm), 2011 IEEE Symposium
     on Computational Intelligence and Data Mining (CIDM) (2011) 310–317.
[14] A. Bogarín, R. Cerezo, C. Romero, Discovering learning processes using inductive miner:
     A case study with learning management systems (lmss) (2018).
[15] T. Murata, Petri nets: Properties, analysis and applications, Proceedings of the IEEE 77
     (1989) 541–580. doi:1 0 . 1 1 0 9 / 5 . 2 4 1 4 3 .