<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Computer Engineering Student Trajectories with Process Mining</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pablo Martinez</string-name>
          <email>p.martinezben@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oscar Montañes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan Manuel Serralta</string-name>
          <email>juanserralta1@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Libertad Tansini</string-name>
          <email>libertad@fing.edu.uy</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Institute (Inco), Faculty of Engineering, Universidad de la República</institution>
          ,
          <addr-line>Montevideo</addr-line>
          ,
          <country country="UY">Uruguay</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This work presents the analysis of student learning trajectories in their Computer Engineering studies. The analysis focuses on modelling characteristics that have impact on dropout rates. Hence students trajectories of dropout students and graduated students are analyzed and compared using Process Mining tools. Specific course that are considered “dificult” and prevent academic progress are identified, also the last courses with which students dropout or graduate. Some of the courses identified as “bottle necks” are Sistemas Operativos, Redes de Computadoras and Arquitectura de Computadores. And some of the courses that are left for last are Física 1 and Métodos Numéricos. The results show that dropout students manly finish after the first year and that they choose courses related to Programming. The models adequately describe student trajectories, with the usual metrics for Process Mining of fitness over 97% on the trajectories of the log, and do not overfit the training data set showing high generalization.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The motivation for this project is the need of the Education Unit of the Engineering Faculty
IV LATIN AMERICAN CONFERENCE ON LEARNING ANALYTICS
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to model an analyze the data in the “Administrative Information System” (AIS) of the
University, also called “Bedelía”, “University administrator ofice”, “Admissions ofice”, “Ofice
of the Registrar”, etc.
      </p>
      <p>62% of the students dropout from their Computer Engineering studies. The goal of this work
is to provide analytical tools for decision makers to understand the reasons why students leave.
To achieve this goal, dropout and graduated students trajectories were modelled and analyzed
in order to identify courses that hinder students from advancing or graduating, and courses
with which students dropout or graduate.</p>
      <p>The results show general behavioural patterns of the students during their studies,
identifying hard courses to pass or “bottle neck” courses such as Sistemas Operativos, Redes de
Computadoras and Arquitectura de Computadores. Also courses that are left for last, are found,
such as Física 1 and Métodos Numéricos. The results show that dropout students manly leave
without finishing the first year and those who advance more manage to pass courses related
to Programming. The models adequately describe the trajectories with a fitness above 97%
on the log traces, they do not overfit the provided training logs and have high generalization.</p>
      <p>The following sections describe related works, methodology, results, and finally conclusions
and future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Works related to Learning Analytics and Process Mining</title>
      <p>
        In Uruguay, tens of thousands of students annually join the education system with three or four
years, starting in this way their “educational life”. Many of them will finish primary school, less
will finish high school and only a few will achieve university degrees. Information is naturally
produced and registered by students and teacher (grades, daily assistance, evaluation, etc.) in
administrative or pedagogical information systems, recording students educational ”traces”[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        For some decades now, there have been several initiatives advocating for the application of
technologies such as Data Mining, Artificial Intelligence, Big Data, among others to the analysis
of a variety of social problems, including education [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], in what has been called “Learning
Analytics” (LA) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Privacy of information and regard for personal information is only one of
the many challenges of LA [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Several authors have used process mining or “Educational process
mining” (EPM) to model diferent behavioural aspects of students trajectories, for example
in the analysis of LMS (Learning Management Systems) logs [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a recent survey of diferent
application areas and tools can be found in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>ProM Tools is used as the process mining tool to model students trajectories from the data in
the “Administrative Information System” (AIS), with the aim to identify “bottle neck” courses,
and given the flexibility to choose courses, find courses that are left for last and specific ordering
patterns among curses for the diferent groups of students.</p>
      <sec id="sec-3-1">
        <title>3.1. The AIS Data Base</title>
        <p>In several meetings with the Education Unit of the Engineering Faculty (UEFI), having previously
signed a confidentiality agreement, it was possible to have access to the information in the AIS
Data Base. It is an SQLite base with various tables that describe the students activities, the most
important ones are described in this section.</p>
        <p>Table Activ2: contains all activities of the students in the diferent degrees they are enrolled
in. The activities are ordered by date and include course inscriptions and exams, with their
corresponding result and grade.</p>
        <p>Table Estudiante: this table contains Student entities with all its personal information, such
as birth date, gender, were they pursued high school studies and contact information.</p>
        <p>Table Estudiante-Carrera: this table registers the students dates of enrollments and
graduation in diferent degrees. There are over ten degrees.</p>
        <p>Table Asignaturas: contains the codes of the courses in the Engineering Faculty and their
relationship with the diferent degrees. There may be diferent evaluation methods and credits
per course in the diferent degrees.</p>
        <p>With this set of tables it is possible to determine the scholarship of the students to load the
necessary information in the process mining tool. That is, to determine each of the course in
which the students have enrolled, have passed or failed, the number of each event and the dates.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Extraction, Transformation and Loading (ETL)</title>
        <p>It was necessary to transform, clean and unify the data in order to load it into ProM Tools since
the structure of the before mentioned tables does not allow to obtain in a simple manner the
scholarships of the students. Table 1 shows the right format describing the courses passed by
the students, i.e. their scholarships.</p>
        <p>The following transformations, cleaning and unification operations were performed:
• Elimination of trajectories with transferred courses: course credits have been
transferred from other schools or degrees, these courses are relatively few and distort the final
models.
• Elimination of optional courses: they represent a small portion of the total number of
courses taken by the students and are to heterogeneous to model adequately in this stage.
• Mandatory courses: the suggested courses by AIS [9] is used as the set of possible
courses for the analysis.
• Unification of courses : many courses undergo changes along the years, having diferent
versions, with diferent names, codes and content. For example Cálculo 1 was named
Calculo Dif. e Integral en una Variable from the year 2017 and on, with a slight change in
the content. For this analysis, those diferences are not relevant, hence, for this study all
related courses were grouped together.</p>
        <p>The data is extracted in CSV format [10], transformed and loaded to ProM, where it is
processed. To apply the diferent process mining algorithms provided by the process mining
tool it is necessary to transform the raw data (CSV) to XES [11], which is the format required by
ProM. This allows to represent events based on a XML format. The conversion is made using a
plug-in provided by ProM, for which it is necessary to have the information shown in Table 2
and which is directly associated to the data provided in the log, see Table 2.</p>
        <p>After several meetings with UEFI and the Academic Program Director of the Computer
Engineering Degree it was decided to divide the students in three groups to be able to model
them adequately: dropout, advanced and graduated. This strategy allows to find specific models
for each group instead of pursuing more generalised models for all students together. It is
specially relevant to procure analytical tools for the dropout group of students that represents
around 62% of the students. To this end, models are built and analyzed for the dropout and
graduated groups of students.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Methodology for the Modelling and Analysis</title>
        <p>In the first place the dependency model for the provided logs are obtained with the
plugin Interactive Data-Aware Heuristic Miner [12], which allows the application of diferent
algorithms for modelling, besides ofering the final model with diferent process representation
alternatives (Dependency nets, Petri nets, Causal nets, etc.). The generated dependency graph
with the Flexible Heuristic Miner [13] as discovery algorithm, gives a first approximation to the
characteristics of the underlying model in the log.</p>
        <p>Then, with the objective to model the trajectories of the students, several algorithm were
analyzed that allow to explore and discover underlying models in the provided logs. In this
study, an inductive algorithm was used [14], this was done through the plug-in “Mine Petri net
with Inductive Miner”, producing a Petri-Net as a result. The choice was based on previous
works in the same topic, such as “Discovering learning processes using Inductive Miner: A case
study with Learning Management Systems (LMSs)” [14]. The default configuration was used,
unless stated.</p>
        <p>Finally, for performance tests on the models werenmade with the plug-ins “Replay a log on
Petri Net for Conformance/Performance analysis” and “Measure precision/generalization”.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Metrics</title>
        <p>The following metrics were used over the models to evaluate conformity and performance:
• Fitness: percentage of traces that are recognized by the model.
• Deviations: set of activities that present deviations with respect to the model.
• Average Throughput: average duration of the traces.
• Bottle Neck Detection: courses that require more time to pass.
• Precision: how precisely the model represents the observed process.
• Generalization: how well the model reproduces future behaviour, confidence in the
precision.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>There is information available of 3477 students between the years 1997 and 2019. In total they
produce a log with 33396 events. In the following sections the models for the graduated and the
dropout groups of students are presented and analyzed.</p>
      <sec id="sec-4-1">
        <title>4.1. Graduated students</title>
        <p>This group is made of 684 students (approximately 20%), with a total of 16416 events
(approximately 49%), since they are the ones with most registered activity. An initial inspection shows
that 45% of the entrances correspond to passing the basic math course Geometría y Álgebra
Lineal 1 as the first course they pass. Most of the students, around 67% of them, finish with
Proyecto de Grado which is the final Thesis. It is also interesting to mention that Redes de
Computadoras y Métodos Numéricos, appear second and third as the last courses approved to
ifnish the degree.</p>
        <p>The first model shows the dependency graph and presents the dependency relations between
courses. As can be seen in Figure 1, in the graph obtained from the log, the rectangles represent
the courses and the values on the arrows represent the confidence regarding the dependency
relations (the higher value, the more confidence). With this model it is possible to identify
common trajectories for students who graduate.</p>
        <p>The following analysis is based on Petri nets [15], with the aim to identify relevant
information within the process. The analysis focuses on the metrics: Fitness, Deviations, Average
Throughput, Bottle Neck Detection, Precision and Generalization. Figure 2 shows the Petri net
obtained with the plug-in “Mine Petri net with Inductive Miner” on the log.</p>
        <p>The conformity of the model for the provided log shows that the model reproduces 97% of
the traces, where 375 of the 648 analyzed traces align perfectly with the model. Also 9 of the
24 courses are aligned with the model and the log, and the rest present some deviations since
some transitions were detected only in the model and not in the log. In this cases the deviation
is less 5% of the traces. The performance analysis shows that the execution time of the traces
(throughput) in average for all students is 8 years, the minimum is 4.5 years and the maximum
is 20 years. The waiting time analysis shows there are 12 courses with high or very high time
of approval, as ca bee seen in Table 3.</p>
        <p>The waiting time analysis for the courses shows relevant information regarding bottle necks
in the model. Revealing the courses Sistemas Operativos, Fundamentos de Bases de Datos, Proyecto
de Grado, Introducción a la Ingeniería de Software and Métodos Numéricos exhibit approval times
four times above expected, considering they should be passed in one semester.</p>
        <p>Finally, the conformity metrics of the model give a Precision of 0.30126 and a Generalization
of 0.8949, indicating the models do not not overfit the training data and that they are capable
of reproducing behaviour not present in the original log, being flexible enough to model new
traces.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Dropout students</title>
        <p>For this analysis there were 2158 students (approximately 62% of the total), with in total 8266
events (approximately 25%).</p>
        <p>A primary inspection of the log of the dropout group of students, shows that 69% of them
have as the last course one of the courses of the first year: Cálculo 1, Geometría y Álgebra Lineal
1, Física 1, Programación 1 or Matemática Discreta 1. This fact alone proves most of them do not
advance further than the first year.</p>
        <p>The dependency model that gives a first approximation to the characteristics of the underlying
model in the log, was made with “Mine Petri net with Inductive Miner” on the log and with the
default parameters, displayed only fist year courses, because most students dropout the first
year.</p>
        <p>Then the minimum frequency for a course to be considered by the algorithm is lowered, in
order to have more courses visible in the dependency model in Figure 3. The model shows
that those students that advance the most, manage to pass programming courses, specifically
Programación 1 to Programación 3.</p>
        <p>Figure 4 shows the Petri net generated with the inductive miner on the log for the dropout
group of students, to perform conformity and performance analysis.</p>
        <p>The resulting model allows to represent 97% of the traces in the log. The courses Cálculo 1,
Geometría y Álgebra Lineal 1, Programación 1 and Matemática Discreta 1 are the courses with
highest frequency. Among these courses, only Geometría y Álgebra Lineal 1 and Matemática
Discreta 1 show deviations from the model, less than 33% and 0.1% respectively.</p>
        <p>The performance analysis of the model shows that in average, students dropout in 20.16
months or approximately one and a half year. The courses with high approval times are:
“Probabilidad y Estadística, Introducción a la Investigación de Operaciones, Programación 4, Métodos
Numéricos, Fundamentos de Bases de Datos and Taller de Programación, see Table 4. It is worth
to mention that very few of the dropout students reach the more advanced courses in the list,
hence more information is needed to completely understand the results.</p>
        <p>The models show Precision of 0.41671 and Generalization of 0.99775, which are superior
to the graduated students. The models do not overfit the log and have high generalization
capabilities.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and future work</title>
      <p>ProM turned out to be versatile and flexible , allowing general analysis of the data, the
generation of diferent models and a variety of metrics over them. Nevertheless, it is necessary
to have deep understanding of the information of the students, to perform an adequate
prepossessing and cleaning, and finally an adaptation to the required format as entry to
ProM for the results to be useful.</p>
      <p>The models give insight into the learning trajectories or behaviour of the students in
their Computer Engineering studies, enabling the identification of “bottle neck courses” or
hard courses to pass, as well as courses that do not hinder students from advancing, like Física 1
and Métodos Numéricos. Some of the “bottle neck courses” are Sistemas Operativos, Redes de
Computadoras and Arquitectura de Computadores.</p>
      <p>Considering the dropout group of students, it was possible to verify, both with statistical
methods and process mining models, that most of them dropout the first year and that those
that advance the most choose courses related to programming.</p>
      <p>The models adequately describe the student information with fitness over 97% on the log
traces, they do not overfit to the log and allow to recognize other traces than those in the
training log with high precision and generalization.</p>
      <p>For future work it desirable to include updated information to the models, since for
organizational restrictions it was only possible to work with data until 2019. It is possible to
model other aspects of the learning trajectories by including all of the courses, such as the
optional and then transferred ones that were excluded in this work.</p>
      <p>ProM has a series of plug-ins that have not been tested and could be studied for the purpose
of modeling student behaviour.</p>
      <p>It is of great interest to consider other information than that in the AIS Data Base to be
added to the models, such as gender, age, work information, income, primary and high school
information.</p>
      <p>Finally, we aim at exploring the utility of the models segmented by semesters or years to
obtain more details.
WIREs Data Mining and Knowledge Discovery 8 (2018) e1230. URL: https://wires.
onlinelibrary.wiley.com/doi/abs/10.1002/widm.1230. doi:h t t p s : / / d o i . o r g / 1 0 . 1 0 0 2 / w i d m .
1 2 3 0 . a r X i v : h t t p s : / / w i r e s . o n l i n e l i b r a r y . w i l e y . c o m / d o i / p d f / 1 0 . 1 0 0 2 / w i d m . 1 2 3 0 .
[9] Trayectoria sugerida para la carrera en ingeniería en computación, plan 97, 2020.</p>
      <p>URL: https://www.fing.edu.uy/carreras/grado/computacion/implementacion/archivos/
TrayectoriaSugerida.pdf, (Accessed on 15/12/2020).
[10] Csv, comma separated values file, 2020. URL: https://tools.ietf.org/html/rfc4180#section-2,
(Accessed on 15/12/2020).
[11] Xes, extensible event stream, 2020. URL: http://xes-standard.org/, (Accessed on 15/12/2020).
[12] F. Mannhardt, M. de Leoni, H. A. Reijers, Heuristic mining revamped: An interactive,
data-aware, and conformance-aware miner., in: BPM (Demos), 2017.
[13] A. J. M. M. Weijters, J. T. S. Ribeiro, Flexible heuristics miner (fhm), 2011 IEEE Symposium
on Computational Intelligence and Data Mining (CIDM) (2011) 310–317.
[14] A. Bogarín, R. Cerezo, C. Romero, Discovering learning processes using inductive miner:</p>
      <p>A case study with learning management systems (lmss) (2018).
[15] T. Murata, Petri nets: Properties, analysis and applications, Proceedings of the IEEE 77
(1989) 541–580. doi:1 0 . 1 1 0 9 / 5 . 2 4 1 4 3 .</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Martínez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Montañés</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Serralta</surname>
          </string-name>
          , Modelado de trayectorias académicas de estudiantes universitarios mediante técnicas de analítica de aprendizaje, Tesis de grado. Universidad de la República (Uruguay). https://hdl.handle.
          <source>net/20.500.12008/28848</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[2] Prom tools home page</source>
          ,
          <year>2020</year>
          . URL: http://www.promtools.org/doku.php?id=start, (Accessed on 03/12/
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] Del papel a la nube: Cómo guiar la transformación digital de los sistemas de información y gestión educativa (siged</article-title>
          ),
          <year>2019</year>
          . URL: https://publications.iadb.org/es/ del
          <article-title>-papel-la-nube-como-guiar-la-transformacion-</article-title>
          <string-name>
            <surname>digital-</surname>
          </string-name>
          de-los-sistemas, -Banco Interamericano de Desarrollo (Accessed on 10/10/
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I.</given-names>
            <surname>Jara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ochoa</surname>
          </string-name>
          ,
          <article-title>Usos y efectos de la inteligencia artificial en educación, Sector Social división educación</article-title>
          .
          <source>Documento para discusión número IDB-DP-00-776</source>
          . BID. doi: http://dx. doi. org/10.18235/000238 0 (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Siemens</surname>
          </string-name>
          ,
          <article-title>George y Long, Penetrating the fog: Analytics in learning and education</article-title>
          .,
          <source>EDUCAUSE Review 46</source>
          (
          <year>2011</year>
          )
          <fpage>30</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pardo</surname>
          </string-name>
          , G. Siemens,
          <article-title>Ethical and privacy principles for learning analytics</article-title>
          ,
          <source>British Journal of Educational Technology</source>
          <volume>45</volume>
          (
          <year>2014</year>
          )
          <fpage>438</fpage>
          -
          <lpage>450</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cerezo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bogarín</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Sánchez-Santillán, EDUCATIONAL PROCESS MINING: Applications in Edu</article-title>
          . Research,
          <year>2016</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          .
          <source>doi:1 0 . 1 0</source>
          <volume>0 2 / 9 7 8 1 1 1 8 9 9 8 2 0 5</volume>
          . c h 1 .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bogarín</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cerezo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Romero</surname>
          </string-name>
          ,
          <article-title>A survey on educational process mining,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>