<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Do Foundation Models Learn Fair Representations? A Critical Evaluation of TabPFN on Algorithmic Fairness Benchmarks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sam Schifman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Columbia Prep</institution>
          ,
          <addr-line>New York, NY 10025</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Foundation models have revolutionized machine learning across domains, with tabular foundation models like TabPFN promising similar advances for structured data. TabPFN's extensive pretraining on millions of synthetic datasets with diverse class distributions theoretically suggests inherent robustness to data imbalances that often precipitate algorithmic bias. We empirically test this conjecture through systematic evaluation on three canonical fairness benchmarks-Adult Income, COMPAS Recidivism, and German Credit. Our findings decisively challenge the assumption that architectural sophistication alone can address fairness challenges. Conventional algorithms with simple post-hoc calibration substantially outperform TabPFN in its native state: logistic regression with group-specific thresholding achieves an Equal Opportunity gap of 0.029 on Adult Income, while uncalibrated TabPFN exhibits a gap of 0.099-a 3.4-fold degradation. We provide theoretical grounding for these empirical observations by connecting them to fundamental impossibility results in fair machine learning. Our work demonstrates that for fairness-critical tabular applications, the benefits of large-scale pretraining do not extend to bias mitigation-targeted algorithmic interventions remain indispensable regardless of model sophistication.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;foundation models</kwd>
        <kwd>algorithmic fairness</kwd>
        <kwd>tabular data</kwd>
        <kwd>equal opportunity</kwd>
        <kwd>post-hoc calibration</kwd>
        <kwd>fairness impossibility theorems</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The emergence of foundation models has transformed machine learning, with large-scale pretraining
on diverse data enabling strong generalization across tasks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This paradigm, long successful in
language and vision, has recently reached tabular data via models such as TabPFN (Tabular Prior-Fitted
Networks) [2], which are pretrained on millions of synthetic datasets to deliver competitive accuracy
without per-dataset training or hyperparameter tuning.
      </p>
      <p>A natural question is whether such pretraining confers incidental benefits for algorithmic fairness.
Because formal criteria for fairness are mutually incompatible in general [3], we cannot expect a model
to satisfy all notions of fairness at once. Instead, we focus on a single, given axis such as worst-group
accuracy (WGA), performance on the hardest subgroup under spurious correlations and group shifts [4].
There is accumulating evidence in vision that stronger, more robust representations can improve such
metrics. For example, on the Waterbirds dataset, constructed to test robustness to a background–label
spurious correlation [4], larger pre-trained models often yield higher WGA [5], and CLIP-based methods
with stronger backbones (e.g., ViT-L/14) achieve substantially better WGA than smaller backbones
when appropriately adapted [6].</p>
      <p>This motivates the intuition that if pretraining yields representations less reliant on spurious signals,
a tabular foundation model might also exhibit higher robustness/fairness out of the box. Yet there are
reasons to expect the opposite for tabular FMs such as TabPFN [2]. First, TabPFN’s pretraining
distribution is synthetic (drawn from structural causal models and related priors) rather than scraped from the
natural data that induces real demographic shifts; as a result, the learned invariances may not align
with the socio-demographic spurious correlations that drive group disparities in practice [2]. Second,
group shifts in tabular problems are often driven by sampling/measurement processes and explicit
protected attributes, where representation scale may help less than data balancing or group-aware
training. Third, TabPFN is trained to approximate Bayesian prediction under its prior rather than to
optimize any group-robust objective, so improvements in average accuracy need not translate into
wider gains.</p>
      <p>Widely deployed systems such as COMPAS have exhibited substantially higher false-positive rates
for one group than another leading to harm [7], underscoring why worst-group performance matters.
Motivated by this, our central hypothesis is that, despite extensive pretraining on diverse synthetic
tables, TabPFN does not have an inherent advantage in worst-group accuracy over strong conventional
baselines with simple fairness interventions (e.g., group thresholding [8]). We test this hypothesis by
systematically comparing TabPFN to standard tabular learners across established fairness benchmarks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Motivation</title>
      <p>To understand why foundation models might not solve fairness challenges, we must first examine
the theoretical landscape of fair machine learning. Algorithmic fairness encompasses multiple, often
incompatible definitions [8]:</p>
      <p>Demographic Parity requires equal positive prediction rates across groups:  (ˆ = 1| = ) =
 (ˆ = 1| = ) for protected attribute . While intuitive, this criterion ignores diferences in base
rates and can significantly reduce accuracy when enforced [9].</p>
      <p>Equalized Odds demands equal true positive and false positive rates:  (ˆ = 1| = ,  = ) =
 (ˆ = 1| = ,  = ) for  ∈ {0, 1}. This balances error rates across groups but imposes strong
constraints [8].</p>
      <p>Equality of Opportunity, our focus, requires only equal true positive rates:  (ˆ = 1| = 1,  =
) =  (ˆ = 1| = 1,  = ). This ensures qualified individuals have equal chances of positive
outcomes while minimizing accuracy loss [8].</p>
      <p>Kleinberg et al. [3] proved these criteria are mutually incompatible except in degenerate cases.
Specifically, they showed that when base rates difer across groups (i.e.,  ( = 1| = ) ̸=  ( =
1| = )), no classifier can simultaneously satisfy calibration, balance for the positive class, and balance
for the negative class. This impossibility result establishes fundamental trade-ofs that constrain all
classifiers, regardless of their architecture or training procedure. No amount of pretraining data or
architectural sophistication can circumvent these limits. A model can at best navigate the existing
fairness-accuracy frontier more eficiently, but cannot transcend it.</p>
      <sec id="sec-2-1">
        <title>2.1. Foundation Models and the Promise of Diverse Pretraining</title>
        <p>Foundation models’ success stems from learning generalizable representations through exposure to
diverse training data. In language models, this enables few-shot adaptation to new tasks [10]. For
tabular data, TabPFN applies similar principles through several key innovations:</p>
        <p>Prior-fitted training : The model approximates Bayesian inference by training on samples from a
prior over causal models. This approach theoretically enables the model to learn a distribution over
possible data-generating processes.</p>
        <p>In-context learning: TabPFN conditions on the entire training set, performing inference without
gradient updates. This design allows the model to adapt to new datasets without fine-tuning.</p>
        <p>Synthetic diversity: Training data includes approximately 100 million synthetic datasets generated
from Structural Causal Models with varying properties, including diferent feature relationships, noise
levels, and class distributions.</p>
        <p>This design theoretically exposes TabPFN to diverse data distributions, potentially learning
representations that generalize beyond specific biases. The synthetic data generation process creates datasets
with controlled properties, which could enable the model to disentangle spurious correlations from
genuine predictive signals. Yet recent work on pretrained models in other domains reveals persistent
fairness challenges that suggest limits to this approach. Language models exhibit social biases inherited
from training corpora despite their scale [11], while vision-language models show demographic
disparities in performance [12]. These findings indicate that exposure to diverse data alone may not solve
fairness problems, particularly when the diversity is synthetic rather than capturing real-world social
dynamics.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Contributions and Significance</title>
        <p>We present the first comprehensive fairness evaluation of tabular foundation models, making several
key contributions:</p>
        <p>Empirical Assessment with Theoretical Grounding: We systematically compare TabPFN against
traditional models (logistic regression, random forests, CatBoost) across three canonical fairness
benchmarks, evaluating both raw performance and response to standard fairness interventions.</p>
        <p>Counterintuitive Negative Result: Despite theoretical advantages from diverse pretraining,
TabPFN shows no inherent robustness/fairness benefits. Simple models with post-hoc calibration
often outperform the foundation model, with logistic regression achieving 3.4× lower Equal
Opportunity gaps on Adult Income.</p>
        <p>Theoretical Implications: We connect our empirical findings to fundamental limits in fair
representation learning. Our results support recent theoretical work suggesting that learning “fair
representations” may be fundamentally limited [13], and that fairness requires explicit intervention rather than
emerging from architectural choices.</p>
        <p>Practical Guidance: We demonstrate that all models—from basic linear classifiers to sophisticated
transformers—converge to similar fairness-accuracy trade-ofs when properly calibrated. This suggests
practitioners should prioritize proven fairness interventions over architectural complexity for bias
mitigation.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods and Experimental Design</title>
      <p>We ground our study in three core research questions that follow from the considerations outlined
above. First, we ask whether foundation models trained on diverse synthetic data exhibit lower fairness
gaps than traditional models on real-world biased datasets. This probes whether the synthetic diversity
that underpins TabPFN’s pretraining meaningfully translates into fairness advantages in practice.
Second, we investigate how foundation models respond to standard fairness interventions compared to
traditional approaches, allowing us to test whether architectural diferences, particularly those arising
from pretraining and transformer-based design, afect the efectiveness of post-hoc calibration. Finally,
we consider whether simple models, once calibrated, can achieve fairness levels comparable to or even
surpassing those of sophisticated pretrained models. Together, these questions allow us to assess the
true value of architectural complexity for addressing fairness challenges.</p>
      <p>To answer them, we draw on three canonical fairness benchmarks, each representing a distinct
domain and bias pattern. The Adult Incomedataset [14] contains 48,842 instances from 1994 U.S. Census
data predicting income above $50,000, with gender as the protected attribute. The COMPAS Recidivism
dataset [7] comprises 7,214 criminal justice records predicting two-year recidivism, with race as the
protected attribute. Finally, the German Credit dataset [15] includes 1,000 loan applications predicting
credit risk, with sex as the protected attribute. These datasets are widely used in the fairness literature,
making them a robust foundation for systematic comparison.</p>
      <p>In evaluating model performance, we balance predictive accuracy with fairness considerations.
Alongside standard metrics such as accuracy and balanced accuracy, we employ two complementary
fairness measures. The first is the Equal Opportunity Gap, defined as</p>
      <p>∆  = |  =1 −   =0|,
which captures disparities in true positive rates between protected groups. The second is the
WorstGroup Balanced Accuracy, given by
∈{0,1}</p>
      <p>min BalAcc,
which reflects the performance of the worst-of group. Together, these metrics highlight diferent
aspects of fairness while remaining interpretable and actionable for practitioners.</p>
      <p>We compare four models spanning a spectrum of complexity. At the simplest end, Logistic Regression
(LR) provides a convex linear baseline using scikit-learn with ℓ2 regularization. Random Forest
(RF) extends this baseline through an ensemble of 100 CART trees with bootstrap sampling and random
feature selection, while CatBoost (CB) introduces gradient boosting with native categorical handling.
At the most sophisticated end, we evaluate TabPFN (TPFN), a transformer-based tabular foundation
model pretrained on 100 million synthetic datasets and fine-tuned using the library’s automatic routine.
This range of models enables a direct assessment of whether complexity and pretraining confer fairness
benefits.</p>
      <p>To incorporate fairness interventions, we implement group-specific thresholding following Hardt et
al. [8]. For each group , we find the threshold  * that equalizes true positive rates:

 * = arg min |  ( ) −   overall|.
(1)
This method is theoretically grounded: Corbett-Davies et al. [16] proved that optimal fair classifiers
require group-specific thresholds, making this approach both principled and practical.</p>
      <p>All experiments employ 5-fold stratified cross-validation with identical splits across models to ensure
paired statistical comparisons. Feature pre-processing follows standard practices: categorical variables
are one-hot encoded and numeric features standardized. Unless otherwise specified, we retain default
hyperparameters to avoid overfitting or optimization that could confound fairness comparisons.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analysis</title>
      <p>Group-specific thresholding reduces Equal Opportunity gaps by 71-87% across all models while
sacrificing only 1-3 percentage points of accuracy. Critically, this benefit is model-agnostic: sophisticated
and simple models respond similarly to calibration. Post-calibration, logistic regression achieves an EO
gap of 0.029 on Adult Income—3.4× better than uncalibrated TabPFN (0.099). This dramatic reversal
challenges assumptions about the relationship between model complexity and fairness. We performed
paired t-tests across cross-validation folds to validate our observations.</p>
      <p>TabPFN’s accuracy and balanced accuracy are statistically indistinguishable from traditional models
(p &gt; 0.05 for all pairwise comparisons). Raw EO gap diferences between models do not reach significance
(p &gt; 0.1), confirming no model is inherently fairer. Group-specific thresholding significantly reduces
EO gaps for all models (p &lt; 0.001) while accuracy drops remain non-significant (p &gt; 0.1). These
results strongly support our hypothesis that architectural sophistication alone does not confer fairness
advantages.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>Our findings provide strong empirical support for theoretical predictions about the limits of fair
representation learning. TabPFN’s pretraining regime is built on synthetic distributions that, while
diverse in statistical properties, remain mathematical abstractions. Such data cannot reproduce the
complex historical and structural biases embedded in real-world systems. As a result, TabPFN inherits
the strengths of pretraining for accuracy, but it does not acquire mechanisms to address group disparities.
This limitation is reinforced by its training objective, which prioritizes predictive accuracy without
any explicit fairness constraints. In both pretraining and fine-tuning, gradient updates systematically
reward accuracy even when this amplifies inequities across groups.</p>
      <p>Our results demonstrate the real-world manifestation of fairness impossibility theorems. Kleinberg et
al. [3] proved that when base rates difer across groups, no model can simultaneously achieve calibration,
equalized odds, and balance conditions. We observe precisely this trade-of in our experiments. For
instance, on the Adult Income dataset, TabPFN attains accuracy comparable to logistic regression
but sufers from a substantially higher Equal Opportunity gap in the uncalibrated setting. Post-hoc
calibration narrows this gap, yet the improvement comes at the cost of reduced balanced accuracy.
These outcomes illustrate how even highly sophisticated foundation models are bound by the same
mathematical constraints as traditional classifiers.</p>
      <p>Beyond illustrating theoretical limits, our study sheds light on why TabPFN in particular does
not deliver fairness advantages. The synthetic datasets used in pretraining vary along axes such as
feature relationships and class distributions, but they lack the embedded discriminatory patterns that
characterize real social data. Without exposure to these complex sources of bias, TabPFN has no
opportunity to learn fairness-aware representations. Moreover, its loss function contains no fairness
component, leaving accuracy as the sole optimization goal. Together, these design choices explain why
large-scale pretraining shifts the model along the fairness–accuracy frontier but cannot redraw the
frontier itself.</p>
      <p>The implications for practice are significant. Architectural sophistication and vast pretraining do
not provide a shortcut to fairness, and practitioners cannot rely on foundation models to automatically
mitigate bias. Our experiments show that post-hoc calibration remains essential regardless of whether
one uses a simple linear classifier or a transformer-based foundation model. In fact, fairness interventions
developed for traditional models remain just as relevant—and in many cases more efective—when
applied to TabPFN. This reinforces the view that fairness arises from explicit intervention rather than
passive reliance on model complexity.</p>
      <p>These results also resonate with broader theoretical work. Zhao and Gordon [13] prove that no
representation can simultaneously preserve utility and satisfy multiple fairness criteria, and our
empirical findings provide clear validation of this claim. The efectiveness of group-specific thresholding
further supports the fairness through awareness paradigm of Dwork et al. [9], which emphasizes the
importance of explicitly considering protected attributes in decision-making. Our results suggest that
fairness-aware design principles, rather than scale alone, should guide future advances in trustworthy
machine learning.</p>
      <p>TabPFN’s scale and sophistication do not overcome the mathematical limits of fairness. The model
performs competitively on accuracy but remains subject to the same trade-ofs that govern all classifiers
when base rates difer across groups. By situating these empirical outcomes within the framework of
impossibility results, we demonstrate that foundation models neither escape nor transcend established
fairness constraints. Instead, progress in fairness will come from combining architectural innovation
with deliberate fairness interventions grounded in theory.</p>
      <sec id="sec-5-1">
        <title>5.1. Limitations and Future Work</title>
        <p>Although our evaluation is systematic, several limitations must be acknowledged. First, our analysis
is restricted to three benchmark datasets, each with a binary protected attribute. Outcomes may
difer in settings involving continuous sensitive attributes or intersectional analyses that account for
multiple overlapping demographic characteristics. Second, we evaluate fairness primarily through
Equal Opportunity and Worst-Group Balanced Accuracy. While these measures are widely used and
theoretically motivated, alternative fairness definitions could reveal diferent patterns. Nevertheless,
impossibility results suggest that similar trade-ofs would persist regardless of the chosen metric.
Another limitation is that our study does not examine fairness stability under hard distribution shift,
an issue of critical importance for real-world deployment [17]. Demographic changes or shifts in base
rates may exacerbate disparities in ways that static evaluations cannot anticipate. Understanding how
foundation models behave under such conditions remains an open question.</p>
        <p>Looking forward, several promising directions emerge. One is to embed fairness constraints directly
into the pretraining objective, potentially reshaping rather than merely shifting the fairness–accuracy
trade-of frontier. Another is to investigate how the diversity of synthetic pretraining interacts with the
complex, historically embedded patterns of bias present in real data. Such work could inform improved
design principles for foundation models, ensuring that their scale and flexibility translate into genuine
gains for fairness-critical applications.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>We present the first comprehensive evaluation of algorithmic fairness in tabular foundation models,
testing whether TabPFN’s extensive pretraining on diverse synthetic data confers advantages in
handling biased real-world datasets. Our results, grounded in fairness impossibility theorems, deliver a
clear verdict: Despite theoretical promise, foundation models ofer no inherent fairness benefits over
traditional approaches.</p>
      <p>This work contributes to a growing understanding of foundation model limitations. Although these
models excel at many tasks, they do not provide a universal solution to machine learning challenges.
Our findings emphasize that achieving trustworthy AI requires more than architectural innovation; it
requires careful attention to fairness interventions regardless of model sophistication.</p>
      <p>As foundation models proliferate across domains, maintaining this measured perspective becomes
increasingly critical. Our results show that fairness-impossibility theorems remain binding even in the
context of foundation models. This suggests that the path to fairness lies not in architectural scale or
model sophistication alone but in explicitly integrating theoretical insights with deliberate interventions.
Achieving trustworthy AI therefore requires embracing fairness-aware design choices.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We thank the developers of TabPFN for open-sourcing their model and providing comprehensive
documentation. We are grateful to the creators and maintainers of the fairness benchmark datasets that
made this evaluation possible. Special thanks to my parents for their unwavering support throughout
this research project, and to Mr. Hoek for introducing me to LaTeX. We also thank the TRUST-AI
reviewers for their constructive feedback that helped strengthen the theoretical grounding of this
work.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
K. Ethayarajh, L. Fei-Fei, C. Finn, T. Gale, L. Gillespie, K. Goel, N. Goodman, S. Grossman, N. Guha,
T. Hashimoto, P. Henderson, J. Hewitt, D. E. Ho, J. Hong, K. Hsu, J. Huang, T. Icard, S. Jain,
D. Jurafsky, P. Kalluri, S. Karamcheti, G. Keeling, F. Khani, O. Khattab, P. W. Koh, M. Krass,
R. Krishna, R. Kuditipudi, A. Kumar, F. Ladhak, M. Lee, T. Lee, J. Leskovec, I. Levent, X. L.
Li, X. Li, T. Ma, A. Malik, C. D. Manning, S. Mirchandani, E. Mitchell, Z. Munyikwa, S. Nair,
A. Narayan, D. Narayanan, B. Newman, A. Nie, J. C. Niebles, H. Nilforoshan, J. Nyarko, G. Ogut,
L. Orr, I. Papadimitriou, J. S. Park, C. Piech, E. Portelance, C. Potts, A. Raghunathan, R. Reich,
H. Ren, F. Rong, Y. Roohani, C. Ruiz, J. Ryan, C. Ré, D. Sadigh, S. Sagawa, K. Santhanam, A. Shih,
K. Srinivasan, A. Tamkin, R. Taori, A. W. Thomas, F. Tramèr, R. E. Wang, W. Wang, B. Wu, J. Wu,
Y. Wu, S. M. Xie, M. Yasunaga, J. You, M. Zaharia, M. Zhang, T. Zhang, X. Zhang, Y. Zhang,
L. Zheng, K. Zhou, P. Liang, On the Opportunities and Risks of Foundation Models, 2022. URL:
http://arxiv.org/abs/2108.07258. doi:10.48550/arXiv.2108.07258, arXiv:2108.07258.
[2] N. Hollmann, S. Müller, K. Eggensperger, F. Hutter, TabPFN: A Transformer That Solves Small
Tabular Classification Problems in a Second, 2023. URL: http://arxiv.org/abs/2207.01848. doi: 10.
48550/arXiv.2207.01848, arXiv:2207.01848.
[3] J. Kleinberg, S. Mullainathan, M. Raghavan, Inherent Trade-Ofs in the Fair Determination of
Risk Scores, 2016. URL: http://arxiv.org/abs/1609.05807. doi:10.48550/arXiv.1609.05807,
arXiv:1609.05807.
[4] S. Sagawa, P. W. Koh, T. B. Hashimoto, P. Liang, Distributionally robust neural networks for group
shifts: On the importance of regularization for worst-case generalization, CoRR abs/1911.08731
(2019). URL: http://arxiv.org/abs/1911.08731. arXiv:1911.08731.
[5] A. Pham, The Efect of Model Size on Worst-Group Generalization, Master’s thesis, EECS
Department, University of California, Berkeley, 2022. URL: http://www2.eecs.berkeley.edu/Pubs/
TechRpts/2022/EECS-2022-138.html.
[6] B. An, S. Zhu, M.-A. Panaitescu-Liess, C. K. Mummadi, F. Huang, Perceptionclip: Visual
classiifcation by inferring and conditioning on contexts, 2024. URL: https://arxiv.org/abs/2308.01313.
arXiv:2308.01313.
[7] J. Angwin, L. Jef, S. Mattu, K. Lauren, Machine Bias, 2016. URL: https://www.propublica.org/
article/machine-bias-risk-assessments-in-criminal-sentencing.
[8] M. Hardt, E. Price, N. Srebro, Equality of Opportunity in Supervised Learning, 2016. URL: http:
//arxiv.org/abs/1610.02413. doi:10.48550/arXiv.1610.02413, arXiv:1610.02413 [cs].
[9] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, R. Zemel, Fairness Through Awareness, 2011. URL:
http://arxiv.org/abs/1104.3913. doi:10.48550/arXiv.1104.3913, arXiv:1104.3913.
[10] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M.
Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner,
S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language Models are Few-Shot Learners, 2020.</p>
      <p>URL: http://arxiv.org/abs/2005.14165. doi:10.48550/arXiv.2005.14165, arXiv:2005.14165.
[11] E. M. Bender, T. Gebru, A. McMillan-Major, S. Shmitchell, On the Dangers of Stochastic Parrots:
Can Language Models Be Too Big? , in: Proceedings of the 2021 ACM Conference on Fairness,
Accountability, and Transparency, ACM, Virtual Event Canada, 2021, pp. 610–623. URL: https:
//dl.acm.org/doi/10.1145/3442188.3445922. doi:10.1145/3442188.3445922.
[12] J. Buolamwini, T. Gebru, Gender Shades: Intersectional Accuracy Disparities in Commercial
Gender Classification, in: Proceedings of the 1st Conference on Fairness, Accountability and
Transparency, PMLR, 2018, pp. 77–91. URL: https://proceedings.mlr.press/v81/buolamwini18a.html,
iSSN: 2640-3498.
[13] H. Zhao, G. J. Gordon, Inherent Tradeofs in Learning Fair Representations, 2022. URL: http:
//arxiv.org/abs/1906.08386. doi:10.48550/arXiv.1906.08386, arXiv:1906.08386.
[14] R. K. Barry Becker, Adult, 1996. URL: https://archive.ics.uci.edu/dataset/2. doi:10.24432/C5XW20.
[15] H. Hofmann, Statlog (German Credit Data), 1994. URL: https://archive.ics.uci.edu/dataset/144.</p>
      <p>doi:10.24432/C5NC77.
[16] S. Corbett-Davies, J. D. Gaebler, H. Nilforoshan, R. Shrof, S. Goel, The Measure and
Mismeasure of Fairness, 2023. URL: http://arxiv.org/abs/1808.00023. doi:10.48550/arXiv.1808.00023,
arXiv:1808.00023 [cs].
[17] A. D’Amour, K. Heller, D. Moldovan, B. Adlam, B. Alipanahi, A. Beutel, C. Chen, J. Deaton, J.
Eisenstein, M. D. Hofman, F. Hormozdiari, N. Houlsby, S. Hou, G. Jerfel, A. Karthikesalingam, M. Lucic,
Y. Ma, C. McLean, D. Mincu, A. Mitani, A. Montanari, Z. Nado, V. Natarajan, C. Nielson, T. F.
Osborne, R. Raman, K. Ramasamy, R. Sayres, J. Schrouf, M. Seneviratne, S. Sequeira, H. Suresh,
V. Veitch, M. Vladymyrov, X. Wang, K. Webster, S. Yadlowsky, T. Yun, X. Zhai, D. Sculley,
Underspecification Presents Challenges for Credibility in Modern Machine Learning, 2020. URL:
http://arxiv.org/abs/2011.03395. doi:10.48550/arXiv.2011.03395, arXiv:2011.03395.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Hudson</surname>
          </string-name>
          , E. Adeli,
          <string-name>
            <given-names>R.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Arora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. v.</given-names>
            <surname>Arx</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bohg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosselut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Brunskill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Brynjolfsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Buch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Card</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Castellon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chatterji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Creel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Q.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demszky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Doumbouya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Durmus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ermon</surname>
          </string-name>
          , J. Etchemendy,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>