1. Introduction

Do Foundation Models Learn Fair Representations? A Critical Evaluation of TabPFN on Algorithmic Fairness Benchmarks

Sam Schifman

0 0 Columbia Prep , New York, NY 10025 , USA

2025

Foundation models have revolutionized machine learning across domains, with tabular foundation models like TabPFN promising similar advances for structured data. TabPFN's extensive pretraining on millions of synthetic datasets with diverse class distributions theoretically suggests inherent robustness to data imbalances that often precipitate algorithmic bias. We empirically test this conjecture through systematic evaluation on three canonical fairness benchmarks-Adult Income, COMPAS Recidivism, and German Credit. Our findings decisively challenge the assumption that architectural sophistication alone can address fairness challenges. Conventional algorithms with simple post-hoc calibration substantially outperform TabPFN in its native state: logistic regression with group-specific thresholding achieves an Equal Opportunity gap of 0.029 on Adult Income, while uncalibrated TabPFN exhibits a gap of 0.099-a 3.4-fold degradation. We provide theoretical grounding for these empirical observations by connecting them to fundamental impossibility results in fair machine learning. Our work demonstrates that for fairness-critical tabular applications, the benefits of large-scale pretraining do not extend to bias mitigation-targeted algorithmic interventions remain indispensable regardless of model sophistication.

eol>foundation models algorithmic fairness tabular data equal opportunity post-hoc calibration fairness impossibility theorems

1. Introduction

The emergence of foundation models has transformed machine learning, with large-scale pretraining on diverse data enabling strong generalization across tasks [ 1 ]. This paradigm, long successful in language and vision, has recently reached tabular data via models such as TabPFN (Tabular Prior-Fitted Networks) [2], which are pretrained on millions of synthetic datasets to deliver competitive accuracy without per-dataset training or hyperparameter tuning.

A natural question is whether such pretraining confers incidental benefits for algorithmic fairness. Because formal criteria for fairness are mutually incompatible in general [3], we cannot expect a model to satisfy all notions of fairness at once. Instead, we focus on a single, given axis such as worst-group accuracy (WGA), performance on the hardest subgroup under spurious correlations and group shifts [4]. There is accumulating evidence in vision that stronger, more robust representations can improve such metrics. For example, on the Waterbirds dataset, constructed to test robustness to a background–label spurious correlation [4], larger pre-trained models often yield higher WGA [5], and CLIP-based methods with stronger backbones (e.g., ViT-L/14) achieve substantially better WGA than smaller backbones when appropriately adapted [6].

This motivates the intuition that if pretraining yields representations less reliant on spurious signals, a tabular foundation model might also exhibit higher robustness/fairness out of the box. Yet there are reasons to expect the opposite for tabular FMs such as TabPFN [2]. First, TabPFN’s pretraining distribution is synthetic (drawn from structural causal models and related priors) rather than scraped from the natural data that induces real demographic shifts; as a result, the learned invariances may not align with the socio-demographic spurious correlations that drive group disparities in practice [2]. Second, group shifts in tabular problems are often driven by sampling/measurement processes and explicit protected attributes, where representation scale may help less than data balancing or group-aware training. Third, TabPFN is trained to approximate Bayesian prediction under its prior rather than to optimize any group-robust objective, so improvements in average accuracy need not translate into wider gains.

Widely deployed systems such as COMPAS have exhibited substantially higher false-positive rates for one group than another leading to harm [7], underscoring why worst-group performance matters. Motivated by this, our central hypothesis is that, despite extensive pretraining on diverse synthetic tables, TabPFN does not have an inherent advantage in worst-group accuracy over strong conventional baselines with simple fairness interventions (e.g., group thresholding [8]). We test this hypothesis by systematically comparing TabPFN to standard tabular learners across established fairness benchmarks.

2. Background and Motivation

To understand why foundation models might not solve fairness challenges, we must first examine the theoretical landscape of fair machine learning. Algorithmic fairness encompasses multiple, often incompatible definitions [8]:

Demographic Parity requires equal positive prediction rates across groups: (ˆ = 1| = ) = (ˆ = 1| = ) for protected attribute . While intuitive, this criterion ignores diferences in base rates and can significantly reduce accuracy when enforced [9].

Equalized Odds demands equal true positive and false positive rates: (ˆ = 1| = , = ) = (ˆ = 1| = , = ) for ∈ {0, 1}. This balances error rates across groups but imposes strong constraints [8].

Equality of Opportunity, our focus, requires only equal true positive rates: (ˆ = 1| = 1, = ) = (ˆ = 1| = 1, = ). This ensures qualified individuals have equal chances of positive outcomes while minimizing accuracy loss [8].

Kleinberg et al. [3] proved these criteria are mutually incompatible except in degenerate cases. Specifically, they showed that when base rates difer across groups (i.e., ( = 1| = ) ̸= ( = 1| = )), no classifier can simultaneously satisfy calibration, balance for the positive class, and balance for the negative class. This impossibility result establishes fundamental trade-ofs that constrain all classifiers, regardless of their architecture or training procedure. No amount of pretraining data or architectural sophistication can circumvent these limits. A model can at best navigate the existing fairness-accuracy frontier more eficiently, but cannot transcend it.

2.1. Foundation Models and the Promise of Diverse Pretraining

Foundation models’ success stems from learning generalizable representations through exposure to diverse training data. In language models, this enables few-shot adaptation to new tasks [10]. For tabular data, TabPFN applies similar principles through several key innovations:

Prior-fitted training : The model approximates Bayesian inference by training on samples from a prior over causal models. This approach theoretically enables the model to learn a distribution over possible data-generating processes.

In-context learning: TabPFN conditions on the entire training set, performing inference without gradient updates. This design allows the model to adapt to new datasets without fine-tuning.

Synthetic diversity: Training data includes approximately 100 million synthetic datasets generated from Structural Causal Models with varying properties, including diferent feature relationships, noise levels, and class distributions.

This design theoretically exposes TabPFN to diverse data distributions, potentially learning representations that generalize beyond specific biases. The synthetic data generation process creates datasets with controlled properties, which could enable the model to disentangle spurious correlations from genuine predictive signals. Yet recent work on pretrained models in other domains reveals persistent fairness challenges that suggest limits to this approach. Language models exhibit social biases inherited from training corpora despite their scale [11], while vision-language models show demographic disparities in performance [12]. These findings indicate that exposure to diverse data alone may not solve fairness problems, particularly when the diversity is synthetic rather than capturing real-world social dynamics.

2.2. Contributions and Significance

We present the first comprehensive fairness evaluation of tabular foundation models, making several key contributions:

Empirical Assessment with Theoretical Grounding: We systematically compare TabPFN against traditional models (logistic regression, random forests, CatBoost) across three canonical fairness benchmarks, evaluating both raw performance and response to standard fairness interventions.

Counterintuitive Negative Result: Despite theoretical advantages from diverse pretraining, TabPFN shows no inherent robustness/fairness benefits. Simple models with post-hoc calibration often outperform the foundation model, with logistic regression achieving 3.4× lower Equal Opportunity gaps on Adult Income.

Theoretical Implications: We connect our empirical findings to fundamental limits in fair representation learning. Our results support recent theoretical work suggesting that learning “fair representations” may be fundamentally limited [13], and that fairness requires explicit intervention rather than emerging from architectural choices.

Practical Guidance: We demonstrate that all models—from basic linear classifiers to sophisticated transformers—converge to similar fairness-accuracy trade-ofs when properly calibrated. This suggests practitioners should prioritize proven fairness interventions over architectural complexity for bias mitigation.

3. Methods and Experimental Design

We ground our study in three core research questions that follow from the considerations outlined above. First, we ask whether foundation models trained on diverse synthetic data exhibit lower fairness gaps than traditional models on real-world biased datasets. This probes whether the synthetic diversity that underpins TabPFN’s pretraining meaningfully translates into fairness advantages in practice. Second, we investigate how foundation models respond to standard fairness interventions compared to traditional approaches, allowing us to test whether architectural diferences, particularly those arising from pretraining and transformer-based design, afect the efectiveness of post-hoc calibration. Finally, we consider whether simple models, once calibrated, can achieve fairness levels comparable to or even surpassing those of sophisticated pretrained models. Together, these questions allow us to assess the true value of architectural complexity for addressing fairness challenges.

To answer them, we draw on three canonical fairness benchmarks, each representing a distinct domain and bias pattern. The Adult Incomedataset [14] contains 48,842 instances from 1994 U.S. Census data predicting income above $50,000, with gender as the protected attribute. The COMPAS Recidivism dataset [7] comprises 7,214 criminal justice records predicting two-year recidivism, with race as the protected attribute. Finally, the German Credit dataset [15] includes 1,000 loan applications predicting credit risk, with sex as the protected attribute. These datasets are widely used in the fairness literature, making them a robust foundation for systematic comparison.

In evaluating model performance, we balance predictive accuracy with fairness considerations. Alongside standard metrics such as accuracy and balanced accuracy, we employ two complementary fairness measures. The first is the Equal Opportunity Gap, defined as

∆ = | =1 − =0|, which captures disparities in true positive rates between protected groups. The second is the WorstGroup Balanced Accuracy, given by ∈{0,1}

min BalAcc, which reflects the performance of the worst-of group. Together, these metrics highlight diferent aspects of fairness while remaining interpretable and actionable for practitioners.

We compare four models spanning a spectrum of complexity. At the simplest end, Logistic Regression (LR) provides a convex linear baseline using scikit-learn with ℓ2 regularization. Random Forest (RF) extends this baseline through an ensemble of 100 CART trees with bootstrap sampling and random feature selection, while CatBoost (CB) introduces gradient boosting with native categorical handling. At the most sophisticated end, we evaluate TabPFN (TPFN), a transformer-based tabular foundation model pretrained on 100 million synthetic datasets and fine-tuned using the library’s automatic routine. This range of models enables a direct assessment of whether complexity and pretraining confer fairness benefits.

To incorporate fairness interventions, we implement group-specific thresholding following Hardt et al. [8]. For each group , we find the threshold * that equalizes true positive rates: * = arg min | ( ) − overall|. (1) This method is theoretically grounded: Corbett-Davies et al. [16] proved that optimal fair classifiers require group-specific thresholds, making this approach both principled and practical.

All experiments employ 5-fold stratified cross-validation with identical splits across models to ensure paired statistical comparisons. Feature pre-processing follows standard practices: categorical variables are one-hot encoded and numeric features standardized. Unless otherwise specified, we retain default hyperparameters to avoid overfitting or optimization that could confound fairness comparisons.

4. Results and Analysis

Group-specific thresholding reduces Equal Opportunity gaps by 71-87% across all models while sacrificing only 1-3 percentage points of accuracy. Critically, this benefit is model-agnostic: sophisticated and simple models respond similarly to calibration. Post-calibration, logistic regression achieves an EO gap of 0.029 on Adult Income—3.4× better than uncalibrated TabPFN (0.099). This dramatic reversal challenges assumptions about the relationship between model complexity and fairness. We performed paired t-tests across cross-validation folds to validate our observations.

TabPFN’s accuracy and balanced accuracy are statistically indistinguishable from traditional models (p > 0.05 for all pairwise comparisons). Raw EO gap diferences between models do not reach significance (p > 0.1), confirming no model is inherently fairer. Group-specific thresholding significantly reduces EO gaps for all models (p < 0.001) while accuracy drops remain non-significant (p > 0.1). These results strongly support our hypothesis that architectural sophistication alone does not confer fairness advantages.

5. Discussion

Our findings provide strong empirical support for theoretical predictions about the limits of fair representation learning. TabPFN’s pretraining regime is built on synthetic distributions that, while diverse in statistical properties, remain mathematical abstractions. Such data cannot reproduce the complex historical and structural biases embedded in real-world systems. As a result, TabPFN inherits the strengths of pretraining for accuracy, but it does not acquire mechanisms to address group disparities. This limitation is reinforced by its training objective, which prioritizes predictive accuracy without any explicit fairness constraints. In both pretraining and fine-tuning, gradient updates systematically reward accuracy even when this amplifies inequities across groups.

Our results demonstrate the real-world manifestation of fairness impossibility theorems. Kleinberg et al. [3] proved that when base rates difer across groups, no model can simultaneously achieve calibration, equalized odds, and balance conditions. We observe precisely this trade-of in our experiments. For instance, on the Adult Income dataset, TabPFN attains accuracy comparable to logistic regression but sufers from a substantially higher Equal Opportunity gap in the uncalibrated setting. Post-hoc calibration narrows this gap, yet the improvement comes at the cost of reduced balanced accuracy. These outcomes illustrate how even highly sophisticated foundation models are bound by the same mathematical constraints as traditional classifiers.

Beyond illustrating theoretical limits, our study sheds light on why TabPFN in particular does not deliver fairness advantages. The synthetic datasets used in pretraining vary along axes such as feature relationships and class distributions, but they lack the embedded discriminatory patterns that characterize real social data. Without exposure to these complex sources of bias, TabPFN has no opportunity to learn fairness-aware representations. Moreover, its loss function contains no fairness component, leaving accuracy as the sole optimization goal. Together, these design choices explain why large-scale pretraining shifts the model along the fairness–accuracy frontier but cannot redraw the frontier itself.

The implications for practice are significant. Architectural sophistication and vast pretraining do not provide a shortcut to fairness, and practitioners cannot rely on foundation models to automatically mitigate bias. Our experiments show that post-hoc calibration remains essential regardless of whether one uses a simple linear classifier or a transformer-based foundation model. In fact, fairness interventions developed for traditional models remain just as relevant—and in many cases more efective—when applied to TabPFN. This reinforces the view that fairness arises from explicit intervention rather than passive reliance on model complexity.

These results also resonate with broader theoretical work. Zhao and Gordon [13] prove that no representation can simultaneously preserve utility and satisfy multiple fairness criteria, and our empirical findings provide clear validation of this claim. The efectiveness of group-specific thresholding further supports the fairness through awareness paradigm of Dwork et al. [9], which emphasizes the importance of explicitly considering protected attributes in decision-making. Our results suggest that fairness-aware design principles, rather than scale alone, should guide future advances in trustworthy machine learning.

TabPFN’s scale and sophistication do not overcome the mathematical limits of fairness. The model performs competitively on accuracy but remains subject to the same trade-ofs that govern all classifiers when base rates difer across groups. By situating these empirical outcomes within the framework of impossibility results, we demonstrate that foundation models neither escape nor transcend established fairness constraints. Instead, progress in fairness will come from combining architectural innovation with deliberate fairness interventions grounded in theory.

5.1. Limitations and Future Work

Although our evaluation is systematic, several limitations must be acknowledged. First, our analysis is restricted to three benchmark datasets, each with a binary protected attribute. Outcomes may difer in settings involving continuous sensitive attributes or intersectional analyses that account for multiple overlapping demographic characteristics. Second, we evaluate fairness primarily through Equal Opportunity and Worst-Group Balanced Accuracy. While these measures are widely used and theoretically motivated, alternative fairness definitions could reveal diferent patterns. Nevertheless, impossibility results suggest that similar trade-ofs would persist regardless of the chosen metric. Another limitation is that our study does not examine fairness stability under hard distribution shift, an issue of critical importance for real-world deployment [17]. Demographic changes or shifts in base rates may exacerbate disparities in ways that static evaluations cannot anticipate. Understanding how foundation models behave under such conditions remains an open question.

Looking forward, several promising directions emerge. One is to embed fairness constraints directly into the pretraining objective, potentially reshaping rather than merely shifting the fairness–accuracy trade-of frontier. Another is to investigate how the diversity of synthetic pretraining interacts with the complex, historically embedded patterns of bias present in real data. Such work could inform improved design principles for foundation models, ensuring that their scale and flexibility translate into genuine gains for fairness-critical applications.

6. Conclusion

We present the first comprehensive evaluation of algorithmic fairness in tabular foundation models, testing whether TabPFN’s extensive pretraining on diverse synthetic data confers advantages in handling biased real-world datasets. Our results, grounded in fairness impossibility theorems, deliver a clear verdict: Despite theoretical promise, foundation models ofer no inherent fairness benefits over traditional approaches.

This work contributes to a growing understanding of foundation model limitations. Although these models excel at many tasks, they do not provide a universal solution to machine learning challenges. Our findings emphasize that achieving trustworthy AI requires more than architectural innovation; it requires careful attention to fairness interventions regardless of model sophistication.

As foundation models proliferate across domains, maintaining this measured perspective becomes increasingly critical. Our results show that fairness-impossibility theorems remain binding even in the context of foundation models. This suggests that the path to fairness lies not in architectural scale or model sophistication alone but in explicitly integrating theoretical insights with deliberate interventions. Achieving trustworthy AI therefore requires embracing fairness-aware design choices.

Acknowledgments

We thank the developers of TabPFN for open-sourcing their model and providing comprehensive documentation. We are grateful to the creators and maintainers of the fairness benchmark datasets that made this evaluation possible. Special thanks to my parents for their unwavering support throughout this research project, and to Mr. Hoek for introducing me to LaTeX. We also thank the TRUST-AI reviewers for their constructive feedback that helped strengthen the theoretical grounding of this work.

Declaration on Generative AI

The author(s) have not employed any Generative AI tools. K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Goel, N. Goodman, S. Grossman, N. Guha, T. Hashimoto, P. Henderson, J. Hewitt, D. E. Ho, J. Hong, K. Hsu, J. Huang, T. Icard, S. Jain, D. Jurafsky, P. Kalluri, S. Karamcheti, G. Keeling, F. Khani, O. Khattab, P. W. Koh, M. Krass, R. Krishna, R. Kuditipudi, A. Kumar, F. Ladhak, M. Lee, T. Lee, J. Leskovec, I. Levent, X. L. Li, X. Li, T. Ma, A. Malik, C. D. Manning, S. Mirchandani, E. Mitchell, Z. Munyikwa, S. Nair, A. Narayan, D. Narayanan, B. Newman, A. Nie, J. C. Niebles, H. Nilforoshan, J. Nyarko, G. Ogut, L. Orr, I. Papadimitriou, J. S. Park, C. Piech, E. Portelance, C. Potts, A. Raghunathan, R. Reich, H. Ren, F. Rong, Y. Roohani, C. Ruiz, J. Ryan, C. Ré, D. Sadigh, S. Sagawa, K. Santhanam, A. Shih, K. Srinivasan, A. Tamkin, R. Taori, A. W. Thomas, F. Tramèr, R. E. Wang, W. Wang, B. Wu, J. Wu, Y. Wu, S. M. Xie, M. Yasunaga, J. You, M. Zaharia, M. Zhang, T. Zhang, X. Zhang, Y. Zhang, L. Zheng, K. Zhou, P. Liang, On the Opportunities and Risks of Foundation Models, 2022. URL: http://arxiv.org/abs/2108.07258. doi:10.48550/arXiv.2108.07258, arXiv:2108.07258. [2] N. Hollmann, S. Müller, K. Eggensperger, F. Hutter, TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second, 2023. URL: http://arxiv.org/abs/2207.01848. doi: 10. 48550/arXiv.2207.01848, arXiv:2207.01848. [3] J. Kleinberg, S. Mullainathan, M. Raghavan, Inherent Trade-Ofs in the Fair Determination of Risk Scores, 2016. URL: http://arxiv.org/abs/1609.05807. doi:10.48550/arXiv.1609.05807, arXiv:1609.05807. [4] S. Sagawa, P. W. Koh, T. B. Hashimoto, P. Liang, Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization, CoRR abs/1911.08731 (2019). URL: http://arxiv.org/abs/1911.08731. arXiv:1911.08731. [5] A. Pham, The Efect of Model Size on Worst-Group Generalization, Master’s thesis, EECS Department, University of California, Berkeley, 2022. URL: http://www2.eecs.berkeley.edu/Pubs/ TechRpts/2022/EECS-2022-138.html. [6] B. An, S. Zhu, M.-A. Panaitescu-Liess, C. K. Mummadi, F. Huang, Perceptionclip: Visual classiifcation by inferring and conditioning on contexts, 2024. URL: https://arxiv.org/abs/2308.01313. arXiv:2308.01313. [7] J. Angwin, L. Jef, S. Mattu, K. Lauren, Machine Bias, 2016. URL: https://www.propublica.org/ article/machine-bias-risk-assessments-in-criminal-sentencing. [8] M. Hardt, E. Price, N. Srebro, Equality of Opportunity in Supervised Learning, 2016. URL: http: //arxiv.org/abs/1610.02413. doi:10.48550/arXiv.1610.02413, arXiv:1610.02413 [cs]. [9] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, R. Zemel, Fairness Through Awareness, 2011. URL: http://arxiv.org/abs/1104.3913. doi:10.48550/arXiv.1104.3913, arXiv:1104.3913. [10] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language Models are Few-Shot Learners, 2020.

URL: http://arxiv.org/abs/2005.14165. doi:10.48550/arXiv.2005.14165, arXiv:2005.14165. [11] E. M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? , in: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, ACM, Virtual Event Canada, 2021, pp. 610–623. URL: https: //dl.acm.org/doi/10.1145/3442188.3445922. doi:10.1145/3442188.3445922. [12] J. Buolamwini, T. Gebru, Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification, in: Proceedings of the 1st Conference on Fairness, Accountability and Transparency, PMLR, 2018, pp. 77–91. URL: https://proceedings.mlr.press/v81/buolamwini18a.html, iSSN: 2640-3498. [13] H. Zhao, G. J. Gordon, Inherent Tradeofs in Learning Fair Representations, 2022. URL: http: //arxiv.org/abs/1906.08386. doi:10.48550/arXiv.1906.08386, arXiv:1906.08386. [14] R. K. Barry Becker, Adult, 1996. URL: https://archive.ics.uci.edu/dataset/2. doi:10.24432/C5XW20. [15] H. Hofmann, Statlog (German Credit Data), 1994. URL: https://archive.ics.uci.edu/dataset/144.

doi:10.24432/C5NC77. [16] S. Corbett-Davies, J. D. Gaebler, H. Nilforoshan, R. Shrof, S. Goel, The Measure and Mismeasure of Fairness, 2023. URL: http://arxiv.org/abs/1808.00023. doi:10.48550/arXiv.1808.00023, arXiv:1808.00023 [cs]. [17] A. D’Amour, K. Heller, D. Moldovan, B. Adlam, B. Alipanahi, A. Beutel, C. Chen, J. Deaton, J. Eisenstein, M. D. Hofman, F. Hormozdiari, N. Houlsby, S. Hou, G. Jerfel, A. Karthikesalingam, M. Lucic, Y. Ma, C. McLean, D. Mincu, A. Mitani, A. Montanari, Z. Nado, V. Natarajan, C. Nielson, T. F. Osborne, R. Raman, K. Ramasamy, R. Sayres, J. Schrouf, M. Seneviratne, S. Sequeira, H. Suresh, V. Veitch, M. Vladymyrov, X. Wang, K. Webster, S. Yadlowsky, T. Yun, X. Zhai, D. Sculley, Underspecification Presents Challenges for Credibility in Modern Machine Learning, 2020. URL: http://arxiv.org/abs/2011.03395. doi:10.48550/arXiv.2011.03395, arXiv:2011.03395.

[1]

Bommasani ,

D. A.

Hudson , E. Adeli,

Altman ,

Arora ,

S. v.

Arx ,

M. S.

Bernstein ,

Bohg ,

Bosselut ,

Brunskill ,

Brynjolfsson ,

Buch ,

Card ,

Castellon ,

Chatterji ,

Chen ,

Creel ,

J. Q.

Davis ,

Demszky ,

Donahue ,

Doumbouya ,

Durmus ,

Ermon , J. Etchemendy,