<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Predicting Academic Performance in a Subject Using Classifier Algorithms</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Edwar</forename><surname>Abril</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Saire</forename><surname>Peralta</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Universidad Nacional de San Agustín de Arequipa</orgName>
								<address>
									<addrLine>Santa Catalina 117</addrLine>
									<settlement>Arequipa Perú</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giovanni</forename><forename type="middle">Rolando</forename><surname>Cabrera Málaga</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Universidad Nacional de San Agustín de Arequipa</orgName>
								<address>
									<addrLine>Santa Catalina 117</addrLine>
									<settlement>Arequipa Perú</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sonia</forename><surname>Benilda</surname></persName>
						</author>
						<author role="corresp">
							<persName><forename type="first">Calloapaza</forename><surname>Pari</surname></persName>
							<email>scalloapaza@unsa.edu.pe</email>
							<affiliation key="aff0">
								<orgName type="institution">Universidad Nacional de San Agustín de Arequipa</orgName>
								<address>
									<addrLine>Santa Catalina 117</addrLine>
									<settlement>Arequipa Perú</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">International Congress on Education and Technology in Sciences</orgName>
								<address>
									<addrLine>December 04-06</addrLine>
									<postCode>2023</postCode>
									<settlement>Zacatecas</settlement>
									<country key="MX">Mexico</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Predicting Academic Performance in a Subject Using Classifier Algorithms</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">9CEC32B63D6A49C9D6ED73B9AA2F84ED</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:31+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Academic performance</term>
					<term>Data Mining</term>
					<term>Classification Algorithms</term>
					<term>Supervised Learning</term>
					<term>Data Mining. 1</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The objective of this research is to determine the academic performance of students starting a Systems Engineering course. The discrete structure I course which is considered a course that is difficult to pass. The population is represented by 827 students, the research was approached from a quantitative approach, non-experimental design and at a correlational level. The methodology implemented is CRISP-DM (Cross Industry Standard Process for Data Mining) through the supervised learning technique using binary classification models based on random forest algorithms, xgboost and support vector machines. The results have allowed predicting if a student will pass or fail the course. The classification models that have shown the best results are based on random forest and xgboots algorithms with an accuracy of 82.5%.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Academic performance has been a concern in university education for many years. The biggest challenge has always been to provide quality education, which means to improve academic performance <ref type="bibr" target="#b0">[1]</ref>. One of the consequences of low academic performance of students is failing courses, and one of the ways to solve the problem is to analyze the academic background of the most influential data of students entering the university by data mining.</p><p>The transition from high school to college can be a difficult transition for entering students. In Engineering of Systems of the National University of San Agustin de Arequipa -Peru, the course of discrete structures I has shown according to statistics provided by the same career, that on average 50% of entrants have failed the course in their first enrollment, being considered a course that shows difficulty to be approved. It is necessary to explore and analyze with what tendency of academic performance students enter the university.</p><p>Data mining allows to explore, analyze and find patterns in the data obtaining useful information with the objective of understanding the performance and the environment where the student performs <ref type="bibr" target="#b1">[2]</ref>. The results of finding patterns of behavior through the application of data mining, allows decision making to solve problems in educational settings <ref type="bibr" target="#b2">[3]</ref>.</p><p>Both <ref type="bibr" target="#b3">[4]</ref> and <ref type="bibr" target="#b4">[5]</ref> point out that academic performance is influenced by a set of internal and external factors of the student, where the final result of the performance obtains a quantitative value, reflected in the status of the courses with labels of passed and failed. Furthermore <ref type="bibr" target="#b5">[6]</ref> indicate that the numerical average obtained in a course is the most accurate signal of a student's academic achievement.</p><p>Most of the research on academic performance has been approached using supervised classification algorithms, where they point out that models can be built which can learn from experiences (historically recorded experiences), and the more experiences the model improves in its predictive learning <ref type="bibr" target="#b6">[7]</ref>. <ref type="bibr" target="#b8">[8]</ref> point out that supervised learning allows finding trends based on behavioral patterns, which have been obtained from large amounts of data.</p><p>The present research aims to predict the status of the discrete structure course I (pass or fail), applying the CRISP-DM methodology through supervised classification algorithms. The final result of the research will allow to identify in advance if a student passes or fails the course before the beginning of the semester. Identifying in advance who will fail the course will allow alerting teachers and educational authorities to take tutoring actions to improve academic performance. <ref type="bibr" target="#b9">[9]</ref> have developed an investigation to predict students' academic performance based on academic, demographic and sociodemographic data. The algorithms used are decision tree, K-Nearest-Neighbor, support vector machines and naive bayes implemented with the Python programming language. The research approach is quantitative and has worked with 4738 students of Industrial Engineering and Electronic Engineering. The data were collected from 2008 to 2018, with 324 variables for each student. In view of so many variables, the variables that have the most influence on academic performance have been selected. Finally, the algorithm that showed the best results was KNN with an accuracy ranging from 78.5% to 80%. The reduction techniques of the most influential variables in the academic performance is a determinant work in the prediction. A strength of this work is the number of attributes and records available.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related works</head><p>[10] developed a model to predict public college dropout in COVID-19 time. Their goal was to determine the most efficient Machine Learning algorithm that could classify students based on historical data from 2018 to 2021. They applied the algorithms to a population of 652 students with 106 variables with a descriptive type of research. In the end it was obtained as a result that the K-Nearest-Neighbor algorithm found better results with an accuracy of 91% having as inputs data related to the academic and socioeconomic aspect. With the results found, it was concluded that the model is useful to predict early, in the first semesters, who are the possible university students that could drop out. The dropout diagnosis shows early warnings for the university, so that it can support these students with tutoring or other academic programs in favor of the students. Reducing the dimension of 106 variables to the most significant ones indicates that the model works with the most influential variables, which is a decisive contribution to the model presented in the context presented.</p><p>[11] investigated the main predictor variables that influence the academic performance of students after six semesters have elapsed since they entered university. They worked with 622 students and applied twelve classification algorithms, where an ensemble was used based on the algorithms that showed the best results in the values of their metrics, which are logistic regression, naive Bayes and support vector machines. When applying the ensemble with optimal cut-off point, a specificity of 0.695 and a sensitivity of 0.947 were obtained. The grade obtained in mathematics was a determining factor and sociodemographic factors had no influence. An important fact in this research is that many of the sociodemographic variables did not have a strong influence on the result. The score obtained in the university entrance exam was not taken into account, which is something that is striking, since, in other research, it represents a decisive variable in academic performance.</p><p>[12] developed a model based on supervised machine learning with the purpose of predicting whether a student passes the leveling course. They used Gradient Boosting and Logistic Regression algorithms, where the inputs were the predictor variables grouped into demographic, socioeconomic, family, institutional and academic performance in the application. The population consisted of 7139 students. With the first algorithm, an accuracy of 96% was obtained in the cross-validation and 89% for predicting new data. The logistic regression algorithm indicates that the average grade of the first bimester, the average grade with which the student entered the university and his geographical location of origin, among others, do affect the probability that the student will pass the course. Meanwhile, the variables that have determined that a student fails the course are the grade obtained when entering the university, the province of origin and the lack of academic support or tutoring. There remains the possibility of testing other machine learning algorithms to see their accuracy and verify which would be the most influential attributes in determining whether or not a student passes or fails the course.</p><p>[13] developed models with predictive ability of student academic risk, using educational data mining, for early detection of academic risk. In this research, sociodemographic data and the results of university entrance exams of 415 students of computer science majors enrolled between the years 2016 and 2019 were applied. The best classification model was based on the LMT algorithm with an accuracy of 75.42% and a value of 0.805 for the area under the ROC curve. The data that have shown the most influence, such as college entrance exam score, were identified. The research began with 65 attributes and when determining the most significant ones, 9 variables remained, since this depends on the quality of the data and the predictive power they have in relation to the target variable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Application of the methodology</head><p>The research has had a quantitative approach and has worked with 778 students, where the CRISP-DM data mining methodology has been applied. Figure <ref type="figure" target="#fig_0">1</ref> shows the outline of the model to be applied in the research based on the data mining methodology. The data used in the research are shown in Table <ref type="table" target="#tab_0">1</ref>. The proposed predictive model has as input the admission data, academic data and other data that have been calculated such as age at high school graduation, time elapsed before entering college and age at college entrance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1 Data input and output I/O Items</head><p>Input data</p><formula xml:id="formula_0">• Admission Data • Academic Data • Calculated data Output data</formula><p>• Classification of students: Pass/Fail Below, in Figure <ref type="figure" target="#fig_2">2</ref> we can see the scheme proposed in the research.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Understanding the business</head><p>Universities have three objectives, which are teaching, research and social responsibility. Both licensing and accreditation contribute to achieving quality education. Entering students travel a difficult path from college to university, which must be gradual in order for them to adapt. The task of adaptation should be considered a priority for the university, since it must know what the student is facing in the first semesters.</p><p>Knowing in advance what the academic performance of incoming students will be is uncertain. The evaluation of academic performance is classified in this research by labeling the student as pass or fail. The problem of the present investigation is the lack of knowledge about their possible academic performance in the course of discrete structure I, of the systems career of the Universidad Nacional de San Agustin de Arequipa. According to data provided by the School of Systems, statistics show that from 2011 to 2020 there has been an average of approximately 50% of students who failed their first enrollment in the course. In Table <ref type="table" target="#tab_0">1</ref> we can see the percentage of passed and failed students from 2011 to 2020. I course, even though they tried to pass by taking the course up to 3 times. The population for the present research is made up of students from the graduating classes from 2011 to 2020, which consists of a total of 778 students.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Understanding the data</head><p>The data requested for the project come from two sources, the first source is related to the admission data of the students entering the systems career and the second source is related to the academic data of the students of the School of Systems who have enrolled. The data provided are shown in Table <ref type="table" target="#tab_2">2</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Data preparation</head><p>From all the attributes provided by the university, those to be used in the prediction models have been selected. We have excluded data that do not have any contribution, such as the student's entrance code, last names and first names, among others. New attributes have also been generated. Table <ref type="table" target="#tab_3">3</ref> shows the final data that will be used to train the models. Figure <ref type="figure" target="#fig_3">3</ref> shows a view of the data to be used in the classification algorithms. The status column is to be predicted (pass or fail the course) and the other columns are the predictor attributes. For the modeling stage the status column will only take two values which is represented by DESA (fail) and APRO (pass). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Modeling</head><p>The data flow to build the classifier predictive model are shown in Figure <ref type="figure" target="#fig_4">4</ref>, which allows predicting whether a student passes or fails the course, for which there are nine inputs and one output. • Separate from the total columns or variables, which are the predictor variables and which is the target variable. The X variable represents the predictor variables, while the Y variable represents the objective variable, as shown in Figure <ref type="figure">5</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 5:</head><p>Predictor and objective variables • In the initial test, no balancing techniques were applied, because the data were found to be 50% balanced in both groups (pass and fail students in course 1 of their first enrollment). • Converting categorical variables to dummy variables. The categorical variables to be converted are: sex, place of birth, place of school, type of school and mode of entry, as shown in Figure <ref type="figure" target="#fig_5">6</ref>.  • Scaling the data, has the objective of transforming the values of the features so that they are within a range domain. In the research it was necessary to scale the data, because there are machine learning algorithms that have problems when finding outliers or values that show bias. In Figure <ref type="figure" target="#fig_7">8</ref> we can see the results of the data scaling process • Same conversion procedure was done for the test data, which is represented by X_test, which is data that the model has never seen and will be used to validate the model. The following is a summary of the tests that have been performed with some classification algorithms. After several experiments and tests it has been determined that the best result has been achieved by using the first 8 variables shown with the mutual information technique. The models were experimented, first with four more determinant or significant variables (college entrance score, college entrance age, time elapsed since leaving school until entering college and finally the age at which they left school), and because it did not show results of improvement in the prediction, it was experimented with three variables, four, five, six, seven, eight, nine, etc. until a line that determines the point of improvement could be found. After several attempts, it was determined that the first eight variables showed the best predictions, reflected in the values of the metrics.</p><p>We worked several times in the search for the most suitable values for the hyperparameters, using Bayesian Optimization, which is very similar to GridSearch. We worked and tested again with the random forest, xgboost and support vector machines algorithms, with the eight predictors and the hyperparameters found, we retrained the models. The source code of the tested algorithms is shown below. Figure <ref type="figure" target="#fig_8">9</ref> shows the random forest algorithm.   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.">Validation of the approach</head><p>All the proposed supervised models that have been implemented in classification tasks were evaluated, the values of the performance metrics of the models were taken into account and the most optimal classifier model was selected. The evaluation of the obtained models will verify the efficiency with the test data, which represents 20% of the total, in other to say, those data that were separated and that the obtained model does not know and was not taken into account in the training of the models. In order to validate the models presented in this research, two aspects have been taken into account, which are cross-validation (training data) and testing with the test. Metrics represent values to measure the efficiency of a classifier model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.1.">Internal validation</head><p>Internal validation refers to cross-validation, which is the training that the models underwent. The accuracy metric of the models has shown values between 0 and 1, at the beginning the values were from 0.4 to 0.6 depending on the model. However, after many tests and experiments we have been able to reach up to 0.82. In Table <ref type="table" target="#tab_4">4</ref> we can see the test results in the training stage. In this testing stage the results showed that random forest is the algorithm that has shown the best results in predicting if a student passes or fails the Discrete Structures I course in his first enrollment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.2.">External validation</head><p>The external validation consisted of testing if the model works with new data, this has been tested by entering the data that were initially separated, which corresponds to 20% of records. The tests indicated that the algorithms that have shown the highest prediction accuracy are the model implemented with random forest and the model implemented with xgboost with an accuracy of 82.5% as shown in Table <ref type="table" target="#tab_5">5</ref>. It is important to highlight that the values of the metrics vary, due to the fact that the training and test data are selected randomly. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results and Discussion</head><p>Based on the research developed and the results obtained, the subfield of Machine Learning related to supervised learning has shown great advances when applied in the field of education, not only to predict the academic performance of students, but also to predict student desertion, student dropout, learning patterns, among others, as shown in the literature consulted. The reduction of dimensionality through the technique of mutual information and permutation of the random forest algorithm have allowed improving the results, showing that the most determinant variables in this context are college entrance score, age of graduation from high school, time elapsed and age of university entrance among the admission data. The research of <ref type="bibr" target="#b15">[15]</ref> approached the aspect of dimensionality and determined that the most influential variable was the age at which they began their studies, which is a result that is common to this research. However, there are other investigations such as that of <ref type="bibr" target="#b16">[16]</ref> which, by collecting historical data from a public institution and applying the decision tree algorithm, has shown that the admission score was not significant in the prediction of academic performance, since other variables to be considered were present, such as credits approved in relation to theoretical credits that should have been approved; these changes are due to the fact that other additional data were available, in comparison with the present investigation, where the score was the most determining variable in the prediction.</p><p>Another research with which we can compare is that of <ref type="bibr" target="#b17">[17]</ref>, which also worked with historical data from a public institution, and determined that the number of failed courses and the level of education of the father were determinant. These comparisons are mentioned because it is different to work with historical data that the educational institution has been recording without the intention of using it in research, compared to those institutions that have planned it for research purposes, which increases the richness of the results.</p><p>In the literature it has been found that predictions are sought by classifying students as pass/fail, dropout/non-dropout, low performance/high performance among others, using algorithms such as neural networks, random forest, decision trees, support vector machines, logistic regression among others, seeking the best prediction accuracy as seen in the research of <ref type="bibr" target="#b14">[14]</ref>, <ref type="bibr" target="#b18">[18]</ref>, <ref type="bibr" target="#b20">[19]</ref> and <ref type="bibr" target="#b21">[20]</ref> where they have managed to obtain predictions with an average accuracy of 80%, even with more data compared to the research presented here. The literature reviewed regarding this line of research and educational contexts, shows predictions of binary classification of an object or event, however, the contribution of the present research work lies in predicting whether a student passes or fails the course of discrete structures I with few attributes which were not intended to be collected for research purposes, in addition to having few attributes, for which it has been necessary to filter and select the most decisive attributes to achieve better results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions</head><p>The model that achieved the highest accuracy was implemented with the random forest algorithm with an accuracy of 82.5% when tested with the test data and achieved an accuracy of 85% when tested with the training data. The quality of the data are determinant to achieve a higher accuracy in the predictions of the models. The present research worked with data that were not intended for that purpose, however, they were used and positive results have been found in the use of them based on trial and error, looking for the most influential attributes and also looking for the best values for the hyperparameters of the algorithms. Another important aspect in the training of the models is the small amount of records, which is determinant for an optimal training of the models, which could bring as a consequence low accuracies in the models. It is important to have balanced data for model training, since this way we will avoid developing biased models and achieve reliable model predictions. Finally, the most important conclusion that has been deduced is that the quality of the attributes related to the subject to be predicted is determinant for the success of the classification models.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Stages of the CRISP-DM Methodology Note: Source: Schematic generated based on [14].</figDesc><graphic coords="3,225.51,416.40,81.89,81.94" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>Modeling (Random forest, Xgboost and Logistic regression)Students database</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Research approach</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Data provided by the university</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Data provided by the university</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Dummies Variables</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: Split of the training and test data</figDesc><graphic coords="7,34.53,136.15,792.91,445.98" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 8 :</head><label>8</label><figDesc>Figure 8: Scaler data</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 9 :</head><label>9</label><figDesc>Figure 9: Random Forest</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head>Figure 10 :</head><label>10</label><figDesc>Figure 10: Xgboost</figDesc><graphic coords="9,80.02,-50.21,924.45,520.07" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_10"><head>Figure 11 :</head><label>11</label><figDesc>Figure 11: Vector Support Machines</figDesc><graphic coords="9,75.48,84.28,982.94,552.92" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 Pass and fail percentages by year</head><label>1</label><figDesc>According to data provided by the school of systems in the discrete structures I course, there were 137 students who dropped out because they were never able to pass the discrete structures</figDesc><table><row><cell>Year</cell><cell>Passed</cell><cell>Failed</cell></row><row><cell>2011</cell><cell>47%</cell><cell>53%</cell></row><row><cell>2012</cell><cell>41%</cell><cell>59%</cell></row><row><cell>2013</cell><cell>36%</cell><cell>64%</cell></row><row><cell>2014</cell><cell>40%</cell><cell>60%</cell></row><row><cell>2015</cell><cell>36%</cell><cell>64%</cell></row><row><cell>2016</cell><cell>53%</cell><cell>47%</cell></row><row><cell>2017</cell><cell>53%</cell><cell>47%</cell></row><row><cell>2018</cell><cell>65%</cell><cell>35%</cell></row><row><cell>2019</cell><cell>62%</cell><cell>38%</cell></row><row><cell>2020</cell><cell>69%</cell><cell>31%</cell></row><row><cell></cell><cell>50.2%</cell><cell>49.8%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Task: Select the most accurate Predictive Model</head><label></label><figDesc></figDesc><table><row><cell cols="3">Task: Determine Predictors of Academic Performance</cell></row><row><cell>Admission data</cell><cell>Academic Data</cell><cell>Calculated data</cell></row><row><cell></cell><cell cols="2">Task: Determine the most influential Predictors</cell></row><row><cell></cell><cell>Most decisive factors in prediction</cell><cell></cell></row><row><cell></cell><cell>Task: Build Machine Learning models</cell><cell></cell></row><row><cell>Random Forest</cell><cell>XGBoost</cell><cell>Support Vector Machines</cell></row><row><cell></cell><cell>Student classifier (pass/fail)</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2</head><label>2</label><figDesc></figDesc><table><row><cell cols="2">Admission and academic data</cell><cell></cell></row><row><cell cols="2">ADMISSION DATA</cell><cell></cell></row><row><cell>N°</cell><cell>Attribute</cell><cell>Description</cell></row><row><cell>1</cell><cell>Last Name and First Name</cell><cell>Last Name and First Name of student</cell></row><row><cell>2</cell><cell>Gender</cell><cell>Student's gender</cell></row><row><cell>3</cell><cell>Date of birth</cell><cell>Date of birth of student</cell></row><row><cell>4</cell><cell>Place of Birth (Department)</cell><cell>Department where the student was born</cell></row><row><cell>5</cell><cell>Place of birth (Province)</cell><cell>Province where the student was born</cell></row><row><cell>6</cell><cell>Place of birth (District)</cell><cell>Province where the student was born</cell></row><row><cell>7</cell><cell>Place of birth (code)</cell><cell>Place of birth (code)</cell></row><row><cell>8</cell><cell>School</cell><cell>Origin of high school</cell></row><row><cell>9</cell><cell>School code</cell><cell>Code of school</cell></row><row><cell>10</cell><cell>Location of school (Department)</cell><cell>Department where school is located</cell></row><row><cell>11</cell><cell>Location of school (Province)</cell><cell>Province where the school is located</cell></row><row><cell>12</cell><cell>Location of school (District)</cell><cell>District where school is located</cell></row><row><cell>13</cell><cell>Type of school</cell><cell>Type of school</cell></row><row><cell>14</cell><cell>Year of school leaving</cell><cell>Year of graduation from school</cell></row><row><cell>15</cell><cell>Admission mode</cell><cell>University entrance mode</cell></row><row><cell>16</cell><cell>Score</cell><cell>University entrance score</cell></row><row><cell>17</cell><cell>Extraordinary admission</cell><cell>Extraordinary mode of admission</cell></row><row><cell cols="2">ACADEMIC DATA</cell><cell></cell></row><row><cell>N°</cell><cell>Attribute</cell><cell>Description</cell></row><row><cell>1</cell><cell>CUI</cell><cell>Student code</cell></row><row><cell>2</cell><cell>Entrance code</cell><cell>University entrance code</cell></row><row><cell>3</cell><cell>Last name and first name</cell><cell>Last name and first name of student</cell></row><row><cell>4</cell><cell>Course</cell><cell>Course in which the student is enrolled</cell></row><row><cell>5</cell><cell>Grade</cell><cell>Grade achieved in the course</cell></row><row><cell>6</cell><cell>Condition</cell><cell>Student's condition</cell></row><row><cell>7</cell><cell>#Enrollment</cell><cell>Number of enrollment</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3 Final Data</head><label>3</label><figDesc></figDesc><table><row><cell></cell><cell>Data</cell><cell>Description</cell></row><row><cell>1</cell><cell>Gender</cell><cell>Gender of student</cell></row><row><cell>2</cell><cell>Placebirth</cell><cell>Place of birth</cell></row><row><cell>3</cell><cell>PlaceSchool</cell><cell>Place of school</cell></row><row><cell>4</cell><cell>Typeschool</cell><cell>Type of school</cell></row><row><cell>5</cell><cell>School leaving age</cell><cell>Age of leaving school</cell></row><row><cell>6</cell><cell>ElapsedTime</cell><cell>Elapsed time from school to university</cell></row><row><cell>7</cell><cell>AgeEntrance</cell><cell>Age of university entrance</cell></row><row><cell>8</cell><cell>Modality</cell><cell>University entrance modality</cell></row><row><cell>9</cell><cell>score</cell><cell>University entrance test score</cell></row><row><cell>10</cell><cell>Condition</cell><cell>Pass/Fail Condition</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 4 Metric</head><label>4</label><figDesc></figDesc><table><row><cell>values</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Algorithm/Model</cell><cell>Accuracy</cell><cell>Recall</cell><cell>F1</cell></row><row><cell>Random Forest</cell><cell>0.8511</cell><cell>0.7877</cell><cell>0.8260</cell></row><row><cell>XGBoost</cell><cell>0.8477</cell><cell>0.7611</cell><cell>0.8112</cell></row><row><cell>Support Vector Machines</cell><cell>0.7030</cell><cell>0.6047</cell><cell>0.6533</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 5</head><label>5</label><figDesc></figDesc><table><row><cell>Metric values</cell><cell></cell><cell></cell><cell></cell></row><row><cell>Algorithm/Model</cell><cell>Accuracy</cell><cell>Recall</cell><cell>F1</cell></row><row><cell>Random Forest</cell><cell>0.8258</cell><cell>0.7779</cell><cell>0.8163</cell></row><row><cell>XGBoost</cell><cell>0.8258</cell><cell>0.7402</cell><cell>0.8085</cell></row><row><cell>Support Vector Machines</cell><cell>0.6838</cell><cell>0.5844</cell><cell>0.6474</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">An Empirical Study of the Applications of Data Mining Techniques in Higher Education</title>
		<author>
			<persName><forename type="first">V</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chadha</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Advanced Computer Science and Applications</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="80" to="84" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A Study on Feature Selection Techniques in Educational Data Mining</title>
		<author>
			<persName><forename type="first">M</forename><surname>Ramaswami</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Working Group On Educational Data Mining</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">1</biblScope>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Heiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Baker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Yacef</surname></persName>
		</author>
		<title level="m">Proceedings of Educational Data Mining workshop. 8th International Conference on Intelligent Tutoring Systems</title>
				<meeting>Educational Data Mining workshop. 8th International Conference on Intelligent Tutoring Systems</meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Análisis exploratorio de las variables que condicionan el rendimiento académico</title>
		<author>
			<persName><forename type="first">A</forename><surname>Pérez-Luño</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Jerónimo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sánchez</surname></persName>
		</author>
		<author>
			<persName><surname>Vázquez</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2000">2000</date>
			<publisher>Universidad Pablo de Olavide</publisher>
			<pubPlace>Sevilla, España</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Factors associated with academic performance in medical students</title>
		<author>
			<persName><forename type="first">M</forename><surname>Vélez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Van</surname></persName>
		</author>
		<author>
			<persName><surname>Roa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PSIC. Educación Médica</title>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Rendimiento académico en la transición secundariauniversidad</title>
		<author>
			<persName><forename type="first">S</forename><surname>Rodriguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Eva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mercedes</surname></persName>
		</author>
		<ptr target="http://hdl.handle.net/11162/67356" />
	</analytic>
	<monogr>
		<title level="j">Revista de Educación</title>
		<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Machine learning: hands-on for developers and technical professionals</title>
		<author>
			<persName><forename type="first">J</forename><surname>Bell</surname></persName>
		</author>
		<editor>J. W. &amp; Sons.</editor>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">Second</forename><surname>Edi</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Fundamentals of machine learning for predictive data analytics: algorithms, worked examples, and case studies</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Kelleher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mac Namee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
			<publisher>M. Press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Predicción del rendimiento académico universitario mediante mecanismos de aprendizaje automático y métodos supervisados</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">Contreras</forename><surname>Bravo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Nieves-Pimiento</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Gonzalez-Guerrero</surname></persName>
		</author>
		<idno type="DOI">10.14483/23448393.19514</idno>
		<ptr target="https://doi.org/https://doi.org/10.14483/23448393.19514" />
	</analytic>
	<monogr>
		<title level="j">Ingeniería</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1" to="25" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Deserción universitaria: Evaluación de diferentes algoritmos de Machine Learning para su predicción</title>
		<author>
			<persName><forename type="first">J</forename><surname>Valero Cajahuanca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Navarro Raymundo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Franco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">Julca</forename><surname>Flores</surname></persName>
		</author>
		<idno type="DOI">10.31876/rcs.v28i3.38480</idno>
		<ptr target="https://doi.org/10.31876/rcs.v28i3.38480" />
	</analytic>
	<monogr>
		<title level="j">Revista de Ciencias Sociales</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="362" to="375" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Predicción de la situación académica en alumnos se pregrado usando algoritmos de Machine Learning</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">E</forename><surname>Gamboa Unsihuay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Salinas Flores</surname></persName>
		</author>
		<idno type="DOI">10.47187/perf.v1i27.142</idno>
		<ptr target="https://doi.org/10.47187/perf.v1i27.142" />
	</analytic>
	<monogr>
		<title level="j">Perfiles</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">27</biblScope>
			<biblScope unit="page" from="4" to="10" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Modelo de predicción del rendimiento académico para el curso de nivelación de la Escuela Politécnica Nacional a partir de un modelo de aprendizaje supervisado</title>
		<author>
			<persName><forename type="first">K</forename><surname>Calva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Flores</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Porras</surname></persName>
		</author>
		<idno type="DOI">10.5281/zenodo.5770905</idno>
		<ptr target="https://doi.org/10.5281/zenodo.5770905" />
	</analytic>
	<monogr>
		<title level="j">Latin American Journal of Computing</title>
		<imprint>
			<biblScope unit="volume">VIII</biblScope>
			<biblScope unit="issue">1</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Modelos predictivos de riesgo académico en carreras de computación con minería de datos educativos</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">Ayala</forename><surname>Franco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">E</forename><surname>López Martínez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">H</forename><surname>Menéndez Domínguez</surname></persName>
		</author>
		<idno type="DOI">10.6018/red.463561</idno>
		<ptr target="https://doi.org/10.6018/red.463561" />
	</analytic>
	<monogr>
		<title level="j">Revista de Educación a Distancia (RED)</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="issue">66</biblScope>
			<biblScope unit="page" from="1" to="36" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">CRISP-DM 1.0: Step-by-step data mining guide</title>
		<author>
			<persName><forename type="first">P</forename><surname>Chapman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Julian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Randy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Thomas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Thomas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">C</forename><surname>Wirth</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Desarrollo de un modelo para predecir el rendimiento académico de estudiantes de la EPN en base a su nivel de acceso a TICS y factores socioeconómicos</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Reinoso</forename><surname>Quijo</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2023">2023</date>
			<pubPlace>Quito</pubPlace>
		</imprint>
		<respStmt>
			<orgName>Escuela Politecnica Nacional</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">tesis maestria</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Analysis of the academic performance of systems engineering students, desertion possibilities and proposals for retention</title>
		<author>
			<persName><forename type="first">N</forename><surname>Bedregal-Alpaca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Tupacyupanqui-Jaén</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Cornejo-Aparicio</surname></persName>
		</author>
		<idno type="DOI">10.4067/S0718-33052020000400668</idno>
		<ptr target="https://doi.org/10.4067/S0718-33052020000400668" />
	</analytic>
	<monogr>
		<title level="j">Ingeniare</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="668" to="683" />
			<date type="published" when="2020">2020b</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Rendimiento académico empleando minería de datos</title>
		<author>
			<persName><forename type="first">L</forename><surname>Quiñones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">L</forename><surname>Carrasco</surname></persName>
		</author>
		<idno type="DOI">10.48082/espacios-a20v41n44p17</idno>
		<ptr target="https://doi.org/10.48082/espacios-a20v41n44p17" />
	</analytic>
	<monogr>
		<title level="j">Espacios</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="issue">44</biblScope>
			<biblScope unit="page" from="277" to="285" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Modelo matemático para predecir el grado de deserción de los estudiantes en el Instituto Superior Tecnológico Bolívar</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">Mejía</forename><surname>Zamora</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2023">2023</date>
			<pubPlace>Ecuador</pubPlace>
		</imprint>
		<respStmt>
			<orgName>Universidad Técnica de Ambato</orgName>
		</respStmt>
	</monogr>
	<note>tesis de maestria</note>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<ptr target="https://repositorio.uta.edu.ec/bitstream/123456789/37204/1/t2153mma.pdf" />
		<title level="m">Repositorio Institucional de la Universidad Técnica de Ambato</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Modelo Para La Predicción De La Deserción De Estudiantes De Pregrado</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">J</forename><surname>Camargo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Basado En Técnicas De Minería De Datos</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
		<respStmt>
			<orgName>Universidad de la Costa -Pregrado</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Modelo predictivo basado en machine learning como soporte para el seguimiento académico del estudiante universitario</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">E</forename><surname>Gismondi</surname></persName>
		</author>
		<ptr target="https://hdl.handle.net/20.500.14278/3804" />
		<imprint>
			<date type="published" when="2021">2021a</date>
			<pubPlace>, Perú</pubPlace>
		</imprint>
		<respStmt>
			<orgName>Universidad Nacional del Santa</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">tesis doctor</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
