1. Introduction

L. James); roberta.calegari@unibo.it (R. Calegari)

AI-fairness and equality of opportunity: a case study on educational achievement

Ángel S. Marrero

Gustavo A. Marrero

Carlos Bethencourt

Liam James

Roberta Calegari

0 0 Department of Computer Science and Engineering Alma Mater Studiorum , Univerisità di Bologna , Italy 1 Department of Economics and Research Center of Social Inequality and Governance, University of La Laguna , Spain

2024

000 0 0003

This study focuses on predicting students' academic performance, examining how AI predictive models often reflect socioeconomic inequalities influenced by factors such as parental socioeconomic status and home environment, which afect the fairness of predictions. We compare three AI models aimed at performing an ablation study to understand how these sensitive features (referred to as circumstances) influence predictions. Our findings reveal biases in predictions that favor advantaged groups, depending on whether the goal is to identify excellence or underperformance. Additionally, a two-stage estimation procedure is proposed in the third model to mitigate the impact of sensitive features on predictions, thereby ofering a model that can be considered fair with respect to inequality of opportunity.

eol>AI-fairness socioeconomic equality of opportunity AI-ethics

1. Introduction

ablation study to understand how circumstances can influence predictions in predictive models and how this type of analysis leads to strategies for detecting unfairness in predictions. A two stage estimation procedure is proposed. In the first stage, the predictors are cleaned of the influence of circumstances. In the second stage, the target variable is predicted (or adjusted, depending on the objective) using a model that includes predictors not dependent on the sensitive variables. We also observe that a model incorporating sensitive variables that explain the existing inequality of opportunities generates biases in the predictions, favoring one class or another depending on the part of the distribution being predicted. Thus, if the aim is to predict the upper tail of the distribution (to detect excellence), the worst predictions occur for those classes most disadvantaged by the most relevant circumstances (i.e., students whose parents have lower levels of education or who live in homes with a poorer cultural environment). The opposite occurs when the objective is to predict the lower tail of the distribution (i.e., students with worse educational performance). The opposite occurs when the objective is to predict the lower tail of the distribution (i.e., students with worse educational performance).

Accordingly, the paper is organized as follows. Section 2 describes the dataset. Section 3 reports the results and discusses them. Finally, Section 4 concludes this preliminary work.

2. Data and Method 2.1. Database

In this paper, we use the data on primary and secondary education provided by the Canary Agency for University Quality and Educational Evaluation (ACCUEE) from academic years 2015/16 to 2018/19. ACCUEE is a public organization founded with the purpose of continuously improving education at both the university level and other educational stages within the Canary Islands. Among its activities, ACCUEE is responsible for the evaluation and accreditation of programs implemented in various educational centers. It collects data on students’ academic performance and relevant variables that determine their environment, considering census characteristics and demographic context, ofering a more reliable approach than other national or even international statistics.

One of the main advantages of this database is that the surveys conducted by ACCUEE on the student population are longitudinal in nature, allowing for the evaluation of the progress and development of the individuals under study. This longitudinal data collection helps to control for factors intrinsic to temporal circumstances while enhancing data quality and enabling estimates and comparisons of the individual over time.

The database provided by ACCUEE comprises 83,857 observations. Each row refers to a single student at a given grade and academic year. Primary education data for the academic year 2015/16 and 2018/19 is gathered through a comprehensive census of the entire population. For other grades and academic years, the data is collected through sampling. Longitudinal data is also included: students in 3rd grade (primary school) during the 2015/2016 academic year are sampled again in their 6th grade, and this is the information that we use in the application of this paper. The database contains information on 561 variables (columns) for each student, representing data collected from various contextual questionnaires and performance tests across diferent subjects. Specifically, these columns are categorized into seven thematic blocks: • Block 1: ID Variables. This block consists of variables that identify the individual through diferent approaches (organizational, educational center, academic year studied, survey ID etc. ). • Block 2: Informative Variables. These include codes that identify whether the surveyed individuals responded to the diferent blocks of questions. • Block 3: Grades Obtained. This block comprises the grades obtained by students in the reference subjects, using a continuous or categorical classification. • Block 4: Student Questionnaire. This block includes questions aimed at understanding the level of agreement (categorical) or the situation (coding/continuous) of the surveyed student. • Block 5: Principal Questionnaire. This block includes questions aimed at understanding the level of agreement (categorical) or the situation (coding/continuous) of the principal of the student’s school. Coherence is assumed for the same educational centers. • Block 6: Family Questionnaire. This block includes questions aimed at understanding the level of agreement (categorical) or the situation (coding/continuous) of the family of the surveyed student. Coherence is assumed for the same family units. • Block 7: Teacher Questionnaire. This block includes questions aimed at understanding the level of agreement (categorical) or the situation (coding/continuous) of the tutor of the surveyed student.

Coherence is assumed for the same tutors.

The dataset consists of 47,043,777 data points (resulting from the multiplication of rows and columns). However, 21,225,226 of these are classified as missing values, constituting 45.12the database. The fact that this database is purely Canary in nature reduces statistical issues such as sample representativeness or selection bias. However, there are still potential biases in the database. The most relevant ones are: • Family Questionnaire Bias. In each grade and academic year, there is a high percentage of students (around 20 - 40%) without information on their family situation (Block 6). There may be unexamined correlations between not responding to the family questionnaire and the target variable (academic performance). • Missing Values Bias. As mentioned, there is a large amount of missing data, which could indicate correlations between the student’s situation (unobservable) and the target variable. • Sample Bias. Since data collection takes place in the Canary Islands, transferring the obtained results to other regions might not account for characteristics unique to this specific territory.

We have a target variable , which is the performance of students in the 6th grade of primary education. − 1, the achievement observed for students (in the same school) in previous years (in 3rd grade in our application), is the most widely used predictor in the related literature. We have a set of sensitive factors (or circumstances), denoted by 1, . . . , (see Table 1). These factors are all beyond a student’s control, hence any inequality resulting from these factors must be considered as unfair and therefore corrected or compensated. Specifically, the considered circumstances are shown in Table 1.

Our protected and unprotected groups are defined according to the circumstances considered. Each circumstance is split into two categories, which form the groups. We consider the left category as the unprotected group, while the right category forms the protected groups. These groups can be seen in Table 2. = [ ˆ, ˆ ] ˆ2^

= ˆ ˆˆ^ [, ˆ ] where ˆ is the estimated OLS coeficient from 1 associated with circumstance variance of the achievement in the 3rd grade (i.e. the fitted target variable in 1). , and ˆ^ is the

2.3. Predictive Models

In this section, all the models used in the use case will be defined. The models employed will be based on regression models. We want to start with a simple linear framework to better understand the relevance of the diferent variables (features) included in the model. This analysis can be easily extended to more sophisticated predictive models, such as conditional inference trees, random forests, or neural network approaches. Our baseline (linear) predicted model is the following (Model 1):

2.2. Inequality of Opportunity

To estimate the inequality of opportunity in educational achievement (IOEA) in 3rd grade, we follow the ex-ante approach proposed by [ 5 ] and recently used by [ 6, 7 ], and estimate the following reduced form equation:

,− 1 = + ∑︁ ,,− 1 + ,− 1

=1 From a practical point of view, we want to measure the IOEA at the time policy makers would take policy actions; that is, at the end of the 3rd grade, in order to correct potential unfairness in the 6th grade.

We estimate equation 1 by ordinary least square (OLS), recover the fitted part, and our measure of IOEA would be the ratio of the variance of − 1 explained by the set of circumstances (i.e., the variance of the fitted part) with respect to the total variance. In a linear model, this ratio is exactly the 2 of the estimates. [ 5 ] explains that, when using standardized achievement measures such as those used in this paper, we must disregard using standard inequality indices such as the Gini or the MLD, and a convenient one is using the variance.

Next, we want to estimate the relative importance of each circumstance on educational achievement, and we use a multivariate regression-based decomposition approach [ 8, 9 ], which adapts the decomposition of [ 10 ] to determine the contribution of each factor (or group of factors) to explaining educational achievement. The relative factor IOEA weight for any circumstance is given by: (1) (2) , = + 1,− 1 + , where ,− 1 is the academic performance in mathematics at time (6th grade of primary education), ,− 1 is the academic performance at time − 1 (3rd grade of primary education), 1 is a parameter to be estimated and , is an error term.

In order to understand the origin of potential AI unfairness, we also use the following variants. Model 2 extend model 1 including the set of circumstances (protected features) measured at − 1: in addition to including the student’s academic performance at − 1, we also include a set of circumstances. (3) (4) , = + 1,− 1 + ∑︁ 2 ,,− 1 + ,

Notice that the interpretation of 1 in 4 is diferent than its interpretation in model 1, since ,− 1 correlates with ,. Now, the 1 is capturing the impact of the educational achievement in 3rd and in 6th, but taking into consideration the potential (unfair) diferences generated by our set of circumstances in 3rd grade and extended to 6th and probably to the future. Estimated 2 represent the impact of circumstances in 3rd afecting the achievement in 6th not being channeled through its achievement in 3rd. If the entire impact of circumstances is channeled through its efect on the achievement in 3rd, the estimated 2 coeficients should be close to zero.

Model 3 decomposes the predictor ,− 1 in the part explained by circumstances and the part nonexplained by circumstances (a residual term). In the IO literature, see [ 11 ] this residual term is associated with the part of achievement not associated with (observed) circumstances and instead associated with efort-related aspects and non-observed circumstances. They might correlate diferently with the target variable. For instance, [ 11 ] and [ 12 ], show that the circumstance part is negatively related to posterior economic growth, while the efort component is positively correlated.

To estimate Model 3, we start from estimates of model (1). Then, we decompose , into its fitted part (the part explained by circumstances and associated with the IO estimated in (1), ˆ ,− 1 and the residual part, ˆ,− 1, which captures other factors not included in the model uncorrelated with the considered circumstances. To simplify notation, we call this residual term “Efort”. We then estimate Model 3 as follows: , = + 4ˆ ,− 1 + 5ˆ,− 1 + , (5)

By doing that, we want to distinguish predictions of , due exclusively to circumstances ( + 4 ,− 1) and due exclusively to efort ( + 4 ,− 1 + 5ˆ,− 1), where ˆ ,− 1 represents the average ˆ ˆ value of the predictions ˆ ,− 1 to have the same average levels in both predictions.

2.4. AI fairness metrics

We evaluate the models in terms of fairness using the equalized odds metric. This metric is satisfied when the model’s predictions ensure that students from both protected and unprotected groups (e.g., females and males) have equal recall. Recall is defined as the ratio of True Positives (TP) to the sum of True Positives (TP) and False Negatives (FN).

For each model, we estimate the student’s academic performance prediction (ˆ). Since this prediction is continuous, we discretize it into quartiles to create our predicted classes. Likewise, we categorize the actual academic performance variable into quartiles to establish our True Class.

Next, to evaluate the models’ predictive fairness for each group, we construct confusion matrices for diferent quartiles of academic performance. For instance, the confusion matrix for the first quartile of academic performance (below 25th percentile) is shown in Table 3:

We specifically calculate the equalized odds metric for the low tail of the academic performance distribution (Q1, below the 25th percentile), the center of the distribution (between 25th and 75th percentiles), and the high tail of the distribution (above 75th percentile).

For each sensitive feature, we calculate the equalized odds metric, whose values can be associated with fair or unfair model predictions. Fair: Odds are close to 1, thus the model is predicting equally well all groups within each circumstance. Unfair: Odds are far from 1; we might have Odds lower than 1, which means that the predictions will benefit protected groups (lower categories); Odds above 1, which means that they are benefiting unprotected groups (upper categories).

We will show that the degree of AI unfairness depends on the model used (the type of variables included in the model) and the part of the distributions we look at (upper, middle, or lower).

3. Results 3.1. Inequality of opportunity estimates

We estimate equation 1 through OLS. Table 4 shows the relative factor shares: father education, the number of books at home (as proxy of cultural environment), followed by mother’s education, general socioeconomic status of the household and the start schooling age, are the most relevant circumstances explaining achievement variability in 3rd grade.

In the AI terminology, a circumstance must always be considered as a sensitive feature, since, in a fair society, they should not be correlated with its achievement. How does the relevance of each circumstance correlate with AI-(un)fairness measure associated with each sensitive factor?

3.2. Discussion

We show below the equalized-odds measure for each circumstance (sensitive feature) obtained for each of the three models used (and four predictions generated): model 1, which includes only achievement in 3rd grade; model 2, which assumes model 1 extended with circumstances; model 3, in which we show the predictions of the part only using circumstances (which would be the inequality of opportunity model) and the part using the efort component (which would be the efort model). We do this for each circumstance and for the prediction of three parts of the distribution.

• Looking at performance below the 25th percentile, it could be relevant if the policy is aimed at giving reinforcement in 3rd grade to the most disadvantaged students to reduce educational failure in 6th grade. • Looking at performance in the upper tail (with performance above the 75th percentile), which could be relevant if the objective is to reward excellence with a scholarship program, for example. • Looking at intermediate performance, to detect a group of students representative of a class or school to select case and control groups for an education policy experiment.

The results are shown in Figures 1-3 for the diferent predicted percentiles. The detailed results for each sensitive variable (circumstance) are presented in Tables 5-7.

All detailed results can be found in the appendix. We summarize the main results below, which are also recapped in Table 5.

The model using the predictions of our efort proxy is the only one that achieves AI fairness regardless of which part of the distribution we want to predict. For all sensitive variables, the odd-ratio is very close to one, indicating that the model (whether good or bad) predicts almost equally well one group as another.

On the other hand, when we consider models (the rest of the predictions) that include circumstances in the model, the predictions come out diferently depending on the group we consider and the part of the performance distribution we are predicting.

Thus, for example, when we seek to predict the upper part of the distribution (i.e., to predict whether the student will be excellent in 6th grade), the baseline model and the rest of the models that include information from the sensitive variables predict worse for the most disadvantaged category (i.e., lower socioeconomic status, with lower educational level of the parents, with fewer books in the home, etc.) than for the favored categories. Using these predictions for decision making (e.g., to award prizes for excellence) would generate a clear injustice in favor of those who have a more advantageous starting point, with the consequent increase (most likely) of unequal opportunities in the future. The model generates more unfair predictions the greater the weight of circumstances in the model (in our case the model that only uses circumstances as a predictor).

Moreover, the greatest diferences in the predictions are seen among the circumstances that turn out to be most relevant in explaining the inequality of opportunities in the initial year. Thus, the model that generates the greatest injustice is when we compare the predictions between students of high and low educated parents, or when we compare students in homes with high and low cultural environments.

On the contrary, if the aim is to predict the lower tail of the distribution (for example, to give classes to reinforce learning), the prediction model that includes circumstances generates injustices (in the sense that it predicts better for one class than for another) but, in this case, the favored group (that generates better predictions) is the most disadvantaged. That is, it predicts school failure better for children from lower classes or worse circumstances than for children from better circumstances. Applying the predictions of this model would probably help to reduce inequality of opportunity in the future, but applying a bad policy: it improves inequality of opportunity by improving the most disadvantaged, and worsening the most advantaged.

Ideally, across the entire distribution, predictive models should be fair in the sense that they do not discriminate one group of diferent circumstances against another, and that this will eventually equalize the outcome of individuals.

4. Conclusion

In this study, we have explored the complexities of predicting students’ academic performance using AI models, with a particular focus on addressing socioeconomic inequalities that influence predictive outcomes. Our analysis revealed significant biases in predictions favoring advantaged groups, particularly when sensitive variables related to socioeconomic status and home environment are not appropriately managed in predictive modeling.

Through a comparative evaluation of three AI models and an ablation study, we demonstrated how these sensitive features, also known as circumstances, can distort predictions, leading to unfair outcomes in educational assessments. Importantly, we proposed a two-stage estimation procedure in our third model to mitigate these biases.

The findings underscore the critical importance of integrating fairness considerations into predictive modeling practices, particularly in educational settings where equitable outcomes are essential. Future research should further refine and validate these methodologies across diverse datasets and educational contexts to foster more inclusive and equitable predictive models.

Acknowledgments

This paper was partially supported by the “AEQUITAS” project funded by the European Union’s Horizon Europe research and innovation programme under grant number 101070363.

A. Complete Results

In the tables shown in the appendix, numbers or letters are used in brackets to distinguish results achieved by the diferent models presented in this work: • 1: refers to model 1 (Equation 3) • 2: refers to model 2 (Equation 4) • C: refers to the circumstance-based model • E: refers to the efort-based model (for this and the previous bullet point see the discussion regarding Model 3 in Section 2.3)

Males Females Tertiary education

Category 0.48 0.63 1.00

[1]

Diener ,

M. L.

Diener ,

Diener , Factors predicting the subjective well-being of nations ., Journal of personality and social psychology 69 5 ( 1995 ) 851 - 64 . URL: https://api.semanticscholar.org/ CorpusID:20833520.

[2]

H. F.

Ladd , Holding schools accountable: Performance-based reform in education ., 1996 . URL: https://api.semanticscholar.org/CorpusID:21305932.

[3]

Yu ,

Li ,

Fischer ,

Doroudi ,

Xu , Towards accurate and fair prediction of college success: Evaluating diferent sources of student data , in: Educational Data Mining , 2020 . URL: https://api.semanticscholar.org/CorpusID:220486717.

[4]

W. H.

Sewell , Inequality of opportunity for higher education ., American Sociological Review 36 ( 1971 ) 793 . URL: https://api.semanticscholar.org/CorpusID:34820586.

[5]

F. H. G.

Ferreira ,

Gignoux , The measurement of educational inequality : Achievement and opportunity 1 , 2014 . URL: https://api.semanticscholar.org/CorpusID:260636749.

[6]

G. A.

Marrero ,

J. G.

Rodríguez , Unfair inequality and growth , The Scandinavian Journal of Economics ( 2023 ). URL: https://api.semanticscholar.org/CorpusID:258023458.

[7]

G. A.

Marrero ,

J. C.

Palomino , G. Sicilia, Inequality of opportunity in educational achievement in western europe: contributors and channels , The Journal of Economic Inequality 22 ( 2023 ) 383 - 410 . URL: https://api.semanticscholar.org/CorpusID:265183210.

[8]

Brewer ,

Wren-Lewis , Accounting for changes in income inequality: Decomposition analyses for the uk , 1978 - 2008 , ERN: Other Econometrics: Mathematical Methods & Programming (Topic) ( 2016 ). URL: https://api.semanticscholar.org/CorpusID:19612310.

[9]

G. S.

Fields , Accounting for income inequality and its change: A new method, with application to the distribution of earnings in the united states , 2012 . URL: https://api.semanticscholar.org/ CorpusID:202257605.

[10]

A. F.

Shorrocks , Inequality decompositions by factor components , Econometrica 50 ( 1982 ) 193 - 211 . URL: https://api.semanticscholar.org/CorpusID:7478703.

[11]

G. A.

Marrero , Inequality of opportunity and growth , 2013 . URL: https://api.semanticscholar.org/ CorpusID:67827959.

[12]

G. A.

Marrero ,

J. G.

Rodríguez , Inequality of opportunity in europe, Microeconomics: Welfare Economics & Collective Decision-Making eJournal ( 2012 ). URL: https://api.semanticscholar.org/ CorpusID:154345034.