<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Predictive modeling of academic outcomes based on socioeconomic variables⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rostyslav Zatserkovnyi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roksoliana Zatserkovna</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>12 Stepan Bandera Str., Lviv, 79013</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lviv University of Trade and Economics</institution>
          ,
          <addr-line>10 Tuhan-Baranovskyi Str., Lviv, 79008</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Students' academic performance, both in schools and universities, is influenced by a wide variety of socioeconomic, demographic, and behavioral factors. Identifying which of these factors most strongly correlate with poor academic outcomes can help educational institutions to more effectively allocate resources, as well as support students at risk of dropping out. In this study, we apply predictive machine learning models to a public dataset from Portuguese secondary schools in order to forecast student success. This forecasting is based on features such as parental education, employment status, access to educational support, and family relationships. Our results show that predictive modeling can effectively predict potential low academic performance, as well as highlight the socioeconomic indicators most critical in shaping a student's final grade. Information like this can be used as a basis for early intervention to help troubled students in the education system.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;machine learning</kwd>
        <kwd>predictive modeling</kwd>
        <kwd>classification</kwd>
        <kwd>student performance</kwd>
        <kwd>socioeconomic indicators 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>•
•
•
•
household income;
the education level of parents and guardians;
the presence of a stable and supportive home environment;
quality of instruction in schools;
availability of academic assistance;
relationships with peers and classmates.</p>
      <p>These non-academic factors can heavily influence a student’s ability to learn, especially in
countries with significant social inequality.</p>
      <p>Machine learning, with its ability to model complex, nonlinear relationships, is well-suited to
analyzing such nonlinear relationships. Aside from achieving high predictive accuracy, certain
algorithms are "explainable", i.e., come with the ability to reveal the importance and contribution
of individual features towards a decision. This makes it possible to determine how specific
socioeconomic variables contribute to student success or failure.</p>
      <p>In this study, we focus on a publicly available dataset of Portuguese secondary school students.
This dataset contains detailed information on academic performance (grades in two core subjects),
and a variety of socioeconomic or lifestyle variables. We apply machine learning techniques to
identify which indicators are most predictive of low academic performance. The core machine
learning problem is framed as a classification task – determining if a student is a high or low
performer based on known indicators. Therefore, we aim to create models that can help best
predict this value, potentially helping educators target students which match the indicators for
personalized intervention.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>Literature review</title>
      <p>
        Machine learning techniques have been used in a wide variety of domains and fields, and
education is no exception. Several studies have examined the use of predictive models to identify
students who are at risk of dropping out of school, with a particular emphasis on socioeconomic
and behavioral variables [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. These approaches often frame the problem as a classification task,
using algorithms such as Random Forests, Gradient Boosting, and Support Vector Machines to
predict whether a student will pass or fail.
      </p>
      <p>
        One well-cited dataset in this area is the UCI Student Performance Dataset, originally introduced
by Cortez and Silva [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. It has been used in numerous studies to explore the influence of parental
education, school support, and daily routines on student outcomes. Kabakchieva [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], for example,
used decision tree classifiers to predict final grades and found that family relationship quality,
absences, and study time were among the most important predictors. Similarly, Kotsiantis et al. [ 4]
used ensemble methods and logistic regression to analyze dropout risk. They reported that early
performance and socioeconomic background were significant contributors to this risk.
      </p>
      <p>Recent work has also focused on the interpretability of models in educational contexts. For
example, Lundberg and Lee [5] introduced SHAP (SHapley Additive exPlanations), which has
become a foundational tool for interpreting complex models. These approaches allow us to identify
key socioeconomic indicators in a transparent way, which is critical for models used by educators
or social workers. Similarly, Romero and Ventura [6] emphasized that in education, predictive
accuracy is not enough. It’s key for educators to understand why a model is assigning a high risk
value to a particular student.</p>
      <p>Another relevant factor in existing research is fairness and bias in such prediction systems. A
study by Chen et al. [7] has shown that machine learning models trained on biased data can
reinforce existing social inequalities. This is especially prominent when socioeconomic status or
parental background is strongly correlated with educational achievement.</p>
      <p>Our study contributes to this research area by using a dataset collected at the secondary school
level, while applying a modern classification pipeline as well as an explainable classifier. The main
goal of this project is to develop modern models that can help school administrators, teachers, and
policy designers intervene early to help a student in need.</p>
    </sec>
    <sec id="sec-3">
      <title>Dataset Overview</title>
      <p>
        The primary data source for this study is the UCI Student Performance Dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which contains
anonymized data on students from two Portuguese secondary schools.
      </p>
      <p>The dataset includes detailed information on a student’s academic performance in two subjects:
mathematics, and Portuguese language. For the purposes of this research, we use the combined
dataset of both subjects, which contains a total of 1044 rows across 33 features.</p>
      <p>Each entry corresponds to an individual student, with some crossover between mathematics and
Portuguese language datasets. Its variables and features can be grouped into the following
categories:
• Academic performance: First-period (G1), second-period (G2), and final (G3) grades, on a
scale from 0 to 20.
• Family and parental background: Includes parental education (Medu, Fedu), parental job
types (Mjob, Fjob), family relationship quality (famrel), and whether the student lives in a
two-parent home (Pstatus).
• Support and study habits: Includes whether the student receives educational support
(schoolsup), family support (famsup), time spent studying per week (studytime), past class
failures (failures), and access to extracurricular activities.
• Demographics and daily life: Includes student age, gender, travel time to school (traveltime),
internet access (internet), free time after school (freetime), going out frequency (goout), and
weekday/weekend alcohol consumption (Dalc, Walc).</p>
      <p>In this study, the target variable for classification is the final grade (G3), which we binarize into
two classes: pass (G3 ≥ 10) and fail (G3 ≤ 10). This is consistent with the passing standard in the
Portuguese school system. A transformation like this allows us to formulate the task as a binary
classification problem, which is aimed at predicting academic underperformance based on
socioeconomic features.</p>
      <p>Initial data exploration reveals some mild class imbalance: approximately 65% of students pass,
while only 35% of them fail. To address this, we preserve class distribution in all train/test splits
using stratified sampling. Several features also require preprocessing. Categorical features, such as
parental job types and school names, are encoded numerically using one-hot encoding or ordinal
mappings. Continuous features, such as study time and absences, are scaled using standard
normalization techniques.</p>
      <p>The dataset does not contain missing values, but there is a potential correlation issue between
academic period grades (G1, G2 and G3). G1 and G2 are intermediate grades, which makes them
good predictors of the final grade G3; but this obscures the socioeconomic factors we actually aim
to analyze. Therefore, the intermediate grades are excluded from the input feature set: this decision
ensures that the model primarily focuses on background and lifestyle indicators found in the
dataset.</p>
      <p>To better understand the data structure and feature relevance, we conduct exploratory visual
analysis. Figure 1 illustrates the distribution of final grades. A significant concentration is present
around the 10-14 range.</p>
      <p>Figures 2 and 3 show correlations between parental education and student outcomes, as well as
the impact of family relationship quality on passing rates. These initial plots confirm thehypothesis
that socioeconomic and household factors can significantly contribute to academic success.</p>
    </sec>
    <sec id="sec-4">
      <title>Implementation &amp; Results</title>
      <p>To model student academic outcomes, we frame the task as a binary classification problem:
predicting whether a student will pass or fail based on non-academic variables. As noted in the
previous section, we exclude early grade variables (G1 and G2). This makes sure that the model
only focuses on socioeconomic, behavioral, and demographic inputs, instead of simply predicting
future academic performance based on past performance.</p>
      <p>We train and evaluate four supervised machine learning models, selected for their performance
and compatibility with structured tabular data:
1. XGBoost: A scalable gradient boosting algorithm optimized for accuracy and training speed
[8];
2. LightGBM: A leaf-wise boosting model that performs particularly well with categorical and
imbalanced data [9];
3. CatBoost: An algorithm with built-in handling of categorical features, which reduces
preprocessing complexity and makes it well-suited for our heavily categorical dataset;
4. Explainable Boosting Machine (EBM): An inherently interpretable model that balances
accuracy with the ability to be transparent about its output [10].
where F is the space of regression trees, f krepresents an individual decision tree, and K is the
total number of trees. The objective function to be minimized usually includes both the empirical
loss and a regularization term:
where l is a differentiable loss function (e.g., logistic or squared loss), and Omega ( f ) is a
complexity penalty term, defined as:</p>
      <p>n K
L=∑ l ( yi , y^i)+∑ Omega ( f k )
i=1 k=1</p>
      <p>ω( f )=γ T + 12 λ ∑=T1 j w2j
Here, T is the number of leaves in the tree, w j are the leaf weights, and γ , λ are the regularization
parameters controlling model complexity. During training, each new tree f t is fit to the negative
gradients (residuals) of the loss with respect to the current predictions:
(2)
(3)
(4)
(5)
gi=
∂ l ( yi , y^i) ,
∂ y^i
hi=
∂2 l ( yi , y^i)
∂ y^i2
To find the best split for a node, the model evaluates the gain in the objective:
(GL+GR)2
H L+ H R + λ
]−γ
where GL, GRand H L, H R are the sums of first and second derivatives in the left and right child
nodes respectively.</p>
      <p>Furthermore, XGBoost and LightGBM implement optimizations such as histogram-based
splitting and leaf-wise growth. CatBoost incorporates ordered boosting and native categorical
feature handling to reduce overfitting and improve generalization, and EBM focuses on
interpretability of its results. This is done by making each term describe either a single feature, or a
combination of or interaction between features.</p>
      <p>The dataset is split into 80% training and 20% test data using stratified sampling to preserve the
pass/fail class distribution. For each model, hyperparameters are optimized using 5-fold
crossvalidation. Evaluation metrics include:
• Accuracy: Overall proportion of correctly classified students;
• Precision: Proportion of predicted failures that were correct;
• Recall: Proportion of actual failures correctly identified;
• F1 Score: Harmonic mean of precision and recall;
• ROC-AUC: Area under the Receiver Operating Characteristic curve – this captures overall
classification effectiveness.</p>
      <p>Model</p>
      <p>EBM
CatBoost</p>
      <sec id="sec-4-1">
        <title>LightGBM XGBoost 0,7847</title>
        <p>0,7895
0,7847
0,7751
0,8318
0,8182
0,8242
0,8187</p>
        <p>Recall
0,9387
0,9387
0,9202
0,9141
0,8718
0,8743
0,8696
0,8638
0,6850
0,6761
0,6692
0,6634</p>
        <p>Intuitively, support-related features such as access to educational help (schoolsup) and family
support (famsup) also rank highly, suggesting actionable insights for educators.</p>
        <p>Aside from summary rankings, EBM also allows us to visualize the contribution of individual
variables in isolation. That is, we can plot the effect of a single variable - such as failures - on the
predicted outcome across its entire value range. This provides interpretable insights into how
specific socioeconomic indicators affect classification decisions globally. Figure 7 suggests that as
the number of past class failures increases, the student’s grade is far more likely to decrease.</p>
        <p>These findings suggest that both strong predictive performance and transparency are achievable
using modern machine learning approaches, as the EBM model offers fully interpretable
diagnostics for each of its choices.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>This study explored the use of machine learning models to predict student academic outcomes
based solely on socioeconomic and behavioral factors. Using a dataset of Portuguese secondary
school students, we developed and evaluated several classification models that identified students
at risk of academic failure without relying on prior academic grades.</p>
      <p>Our results show that modern ensemble models - including XGBoost, LightGBM, and CatBoost
are generally effective at predicting academic risk using variables such as parental education, study
habits, family support, and daily life indicators. The Explainable Boosting Machine (EBM) model
achieved the best performance overall: it performed competitively while offering significant
advantages in model interpretability. Its ability to produce global and local explanations makes it
suitable for educational contexts where transparency and explainability are critical.</p>
      <p>Among the most predictive features, we observed that students with large numbers of past class
failures, low parental education levels, little time to study, and frequent social outings lower were
substantially more likely to fail. Importantly, our approach avoids the use of prior grades (G1, G2)
to ensure that interventions can be proposed even at the beginning of the academic year, when
prior performance data from the current semesters may not yet be available.</p>
      <p>That said, our study has certain limitations. The dataset is limited in size (1044 rows) and scope,
restricted to two schools in a single country. The binary classification of student success into
pass/fail categories also oversimplifies the nuance of the grading system; and cultural factors
specific to Portugal may not generalize well to other education systems. Despite these limitations,
our findings demonstrate the feasibility of predictive modelling to assess educational risk. Future
extensions to our work may include:
• Incorporating multiple datasets from different countries, where different data columns with
the same substance are matched together;
• Researching predictive modelling with a focus on Ukrainian schools and universities;
• Developing user interfaces for the model, such as a dashboard that can be used by school
administrators and teachers.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This Word template was created by Tiago Prince Sales (University of Twente, NL) in collaboration
with Manfred Jeusfeld (University of Skövde, SE). It is derived from the template designed by
Aleksandr Ometov (Tampere University of Applied Sciences, FI). The template is made available
under a Creative Commons License Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <sec id="sec-7-1">
        <title>The author(s) have not employed any Generative AI tools.</title>
        <p>[4] S. Kotsiantis, C. Pierrakeas, P. Pintelas, Predicting students’ performance in distance learning
using machine learning techniques, Applied Artificial Intelligence 18 (2004) 411–426.
doi:10.1080/08839510490442058.
[5] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, in: Advances
in Neural Information Processing Systems (NeurIPS), 2017, pp. 4765–4774.
[6] C. Romero, S. Ventura, Data mining in education, Wiley Interdisciplinary Reviews: Data</p>
        <p>Mining and Knowledge Discovery 3 (2013) 12–27. doi:10.1002/widm.1075.
[7] I. Chen, F. Johansson, D. Sontag, Why is my classifier discriminatory?, in: Advances in Neural</p>
        <p>Information Processing Systems (NeurIPS), volume 31, 2018, pp. 3539–3550.
[8] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp.
785–794. doi:10.1145/2939672.2939785.
[9] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu, Lightgbm: A highly
efficient gradient boosting decision tree, in: Advances in Neural Information Processing
Systems, volume 30, 2017, pp. 3149–3157.
[10] H. Nori, E. Jenkins, J. Koch, R. Caruana, Interpretml: A unified framework for machine
learning interpretability, arXiv preprint arXiv:1909.09223 (2019).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kabakchieva</surname>
          </string-name>
          ,
          <article-title>Predicting student performance by using data mining methods for classification</article-title>
          ,
          <source>Cybernetics and Information Technologies</source>
          <volume>13</volume>
          (
          <year>2013</year>
          )
          <fpage>61</fpage>
          -
          <lpage>72</lpage>
          . doi:
          <volume>10</volume>
          .2478/cait-2013-0006.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cortez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <article-title>Using data mining to predict secondary school student performance</article-title>
          ,
          <source>in: Proceedings of 5th Future Business Technology Conference (FUBUTEC)</source>
          , Porto, Portugal,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cortez</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Silva,</surname>
          </string-name>
          <article-title>Student performance data set</article-title>
          , https://archive.ics.uci.edu/ml/datasets/Student+Performance,
          <year>2008</year>
          . UCI Machine Learning Repository.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>