“20% Increase in fairness for Black applicants”: A Critical Examination of Fairness Measurements Offered by Startups Corinna Hertweck1,2 , Maya Guido1 1 University of Zurich, Zurich, Switzerland 2 Zurich University of Applied Sciences, Zurich, Switzerland Abstract Companies using machine learning are increasingly obligated to integrate fairness considerations, often driven by regulatory imperatives and public discourse. This has given rise to a startup ecosystem focused on or at least integrating fairness measurement into their ML observability platforms. However, fairness is a complex concept and there are still many open questions in research. We therefore investigate how startups deal with this and present preliminary results of our ongoing analysis of the fairness startup landscape. In our analysis, we review publicly available material (such as websites) from these companies. We find two notable gaps: (1) the gap between fairness measurement in the algorithmic fairness literature and what startups actually implement and (2) the gap between the claims made by these startups and their actual practices. Based on our findings, we make recommendations for academia, policymakers, and industry stakeholders to advance the cause of fairness in machine learning collaboratively. Keywords fairness, observability, startups, fairness metrics, fairness criteria, demographic parity, statistical parity 1. Introduction Through the increasing use of machine learning, there is also an increasing awareness of potential discrimination through automated decision-making systems. This has led to more regulation in this space (e.g., in the EU AI Act [1]) and thereby to more pressure on companies that are using machine learning. Consequently, ML observability platforms are starting to incorporate fairness metrics into their offerings. Some of these platforms even prioritize fairness as their primary concern. However, it is unclear if these platforms’ claims match what they can actually offer – especially since we know that the field of algorithmic fairness still has a lot of open questions to answer on the research side. Inspired by [2], we want to evaluate these platforms’ “claims and practices”. Our focus is specifically on startups that integrate some form of off-the-shelf fairness measurement into their platforms. We do not consider consulting companies that do not offer stand-alone platforms and instead provide services such as consultation or manual audits. For an overview of the AI audit ecosystem, we refer readers EWAF’24: European Workshop on Algorithmic Fairness, July 01–03, 2024, Mainz, Germany $ corinna.hertweck@zhaw.ch (C. Hertweck); maya.guido@uzh.ch (M. Guido) € https://hcorinna.github.io/ (C. Hertweck)  0000-0002-7639-2771 (C. Hertweck); 0000-0002-2770-1216 (M. Guido) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings to [3]. We also do not consider open source platforms, which [4] has reviewed. Our goal is to provide an overview of the fairness measurement startup ecosystem and to discuss how these startups implement fairness measurement in practice. We aim to highlight the gaps between current implementations and existing research and suggest potential improvements in both research and implementation to guide algorithmic fairness in practice. 2. Methods We collected relevant startups specializing in fairness evaluations from “The ethical AI database”, Google search and Crunchbase, using a set of predefined keywords related to algorithmic fairness. We then filtered this list for startups that claim to offer fairness metrics. This resulted in a list of 21 startups, which we are currently investigating. Since their platforms are proprietary products, we were not able to easily access them to check what types of fairness measurements are implemented. We therefore rely on startups’ publicly available material, such as their website, documentation, white papers and video material. We review this material to document how these startups implement fairness measurement and also take note of the claims that they are making about their products. The startups that we have analyzed so far are Arize [5], Etiq AI [6], FairPlay [7], Fiddler AI [8], Mona [9] and SolasAI [10]. 3. Preliminary Results 3.1. Fairness Measurement For Fiddler AI, Arize and Etiq AI, we were able to find a clear list of the implemented fairness criteria (see [11, 12, 13]). FairPlay uses one metric in all their reports, which we therefore assume is the only one that their platform measures although they mention two more metrics on their website’s FAQ section [14]. For Mona and SolasAI, we could not find documentation that listed the implemented fairness metrics, so access to the platform would be required to evaluate this further. Note that these platforms also implement other metrics (e.g., label distribution) for evaluating different aspects. However, we focus specifically on fairness metrics and how users are guided to choose between them. Focus on standard group fairness criteria Of the platforms with information on which concrete fairness criteria are implemented, all but one of the implemented criteria belong to the group fairness category. Only Etiq AI mentions individual fairness [13]. However, there is no explanation of how these are implemented or how the issue of defining similarity between individuals is addressed. All other implemented fairness metrics are group fairness metrics. This is a clear majority that resembles what we see in the open source landscape [4]. We assume that the reason for this is that group fairness is very easy to implement and requires no further input from users whereas individual fairness or causal definitions of fairness require domain-specific input from the user. Implemented fairness criteria Let us now summarize which fairness criteria we know to be implemented.1 • Statistical parity / demographic parity: selection rate (probability of receiving a posi- tive decision) equal across socio-demographic groups; implemented by all four startups • Equal opportunity: true positive rate equal across socio-demographic groups; imple- mented by three startups (Fiddler AI, Arize, Etiq AI) • False positive rate parity: false positive rate equal across socio-demographic groups; implemented by one startup (Arize) • Equalized odds: both equal opportunity and false positive rate parity2 fulfilled; imple- mented by one startup (Etiq AI) • Group benefit parity: ratio of positive decisions to positive labels equal across socio- demographic groups; implemented by one startup (Fiddler AI) • Denial odds parity ratio of negative decisions to positive decisions equal across socio- demographic groups. The ratio of two groups’ denial odds is described as a fairness metric in FairPlay’s FAQ section [14], but it is doubtful whether it is actually implemented. The first four of these criteria are well-known group fairness criteria that are commonly found in the literature. However, they have also received criticism: One common theme is that these fairness criteria only look at statistics relating to the decision but not at the consequences of the decision [15, 16, 17, 18, 19]. However, what is relevant for fairness is how a decision affects decision subjects. This mismatch can mean that enforcing some fairness metrics could hurt marginalized groups as shown in [20, 19]. There has thus been a call for welfare-based fairness criteria, which the analyzed tools have not implemented yet. Lack of guidance Choosing an appropriate fairness metric represents multiple value judg- ments about the situation at hand. This moral choice is difficult to make, but particularly hard if one is not familiar with fairness and justice discussions – which we would expect to be the case for practitioners using these platforms. We therefore sought documentation from all platforms that guide users in choosing fairness metrics. Along with the specification of the fairness metrics that are implemented, Fiddler AI, Arize, Etiq AI and FairPlay all provided more information on these metrics. However, in three cases (Fiddler AI, Etiq AI and FairPlay) this information is purely formal and descriptive. They simply describe the statistical metric in words instead of using a formula. What is provided is not actual guidance, but something that merely appears to be guidance at first. See, for example, Fiddler AI’s “guidance” on two fairness criteria (the others are described similarly) in [11]: • Group benefit: “If the two groups are treated equally, the group benefit should be the same.” • Equal opportunity: “If the two groups are treated equally, the TPR should be the same.” 1 Note that because we only have access to the documentation and white papers, but not the platforms themselves, there could be discrepancies that we cannot account for. 2 Etiq AI actually uses equal opportunity and true negative rate parity, but by fulfilling true negative rate parity, one also fulfills false positive rate parity. Wanting groups to be treated equally seems like a good goal, which according to Fiddler AI would mean having to fulfill both the group benefit and equal opportunity criterion – which Fiddler AI (incorrectly) claims to be “impossible”.3 The given information is not only confusing to users but also not backed up by research. In a blog post [23], Arize provides a decision tree through which users are supposed to find appropriate fairness criteria. This tree strongly resembles the one proposed by Aequitas [24].4 With questions such as “Does your business problem require fairness to address disparate representation or disparate errors in your ML model?”, the tree would (similar to Aequitas’ tree, cmp. [4]) still be difficult to use for an uninitiated user of a fairness toolkit as they assume that a user already knows what fairness requires in their context. With access limited to the platforms’ websites and documentation, it’s unclear if more guidance is available on the actual platforms. Given the unclear documentation, we do not expect this to be the case. 3.2. Critical View on Claims In our analysis, we came across various claims about fairness measurement and bias mitigation capabilities of startups. Some startups give the impression that fairness is fully quantifiable with a definite metric to measure bias, even though a single fairness metric cannot capture the complexity of fairness. [25]. For bias mitigation, it is common to insinuate that mitigation techniques are a solution or fix for discrimination – a techno-solutionist message [26, 27]. One example that combines both is the following claim found on FairPlay’s website, advertising why customers should use FairPlay’s platform: “20% Increase in fairness for Black applicants” [28]. These kinds of claims carry the risk that third parties using these platforms build on the claims of the startups to ethics-wash their product. 4. Discussion As we have seen, most implemented fairness metrics are standard group fairness metrics. While group fairness metrics have the advantage of being easy to implement, this also bears the danger that they are used without much reflection. This issue is worsened by the platform providers not offering any sort of moral guidance for choosing fairness metrics. Moreover, many startups make misleading claims about their fairness capabilities that promote a techno-solutionist view, reducing fairness to a single number. Although some startups have shown admirable intentions in practical fairness solutions, they are inherently driven by customer demand – which is in this case often a reaction to prevalent regulations. Therefore, achieving substantive fairness must be a collective responsibility that extends beyond these platforms and encompasses policymakers, researchers, the industry and society at large. 3 Fiddler AI writes “An important point to make is that it’s impossible to optimize all the metrics at the same time. This is something to keep in mind when analyzing fairness metrics.” With this, Fiddler AI hints at the impossibility theorems [21, 22], which mathematically show the impossibility of fulfilling specific criteria at the same time under certain conditions. However, they only showed this impossibility for certain metrics and, for example, did not include group benefit. 4 Although we note that the work of Aequitas is not cited by Arize. References [1] European Commission, Proposal for a Regulation of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) and amending certain Union legislative acts (COM(2021) 206 final), 2021. URL: https: //eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex:52021PC0206, accessed on 2024-01- 03. [2] M. Raghavan, S. Barocas, J. Kleinberg, K. Levy, Mitigating bias in algorithmic hiring: Evaluating claims and practices, in: Proceedings of the 2020 conference on fairness, accountability, and transparency, 2020, pp. 469–481. [3] S. Costanza-Chock, I. D. Raji, J. Buolamwini, Who Audits the Auditors? Recommendations from a field scan of the algorithmic auditing ecosystem, in: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, 2022, pp. 1571–1583. [4] M. S. A. Lee, J. Singh, The landscape and gaps in open source fairness toolkits, in: Proceedings of the 2021 CHI conference on human factors in computing systems, 2021, pp. 1–13. [5] Arize, The AI Observability & LLM Evaluation Platform, 2024. URL: https://arize.com/, accessed on 2024-03-28. [6] Etiq AI, ML Testing For Everyone, 2024. URL: https://etiq.ai/, accessed on 2024-03-28. [7] FairPlay, Fairness for People, Profits, and Progress, 2024. URL: https://fairplay.ai/, accessed on 2024-03-28. [8] Fiddler AI, AI Observability, 2024. URL: https://www.fiddler.ai/, accessed on 2024-03-28. [9] Mona, The Most Intelligent AI Monitoring Platform, 2023. URL: https://www.monalabs.io/, accessed on 2024-03-28. [10] SolasAI, Reduce your algorithmic discrimination regulatory, legal and reputational risk, 2022. URL: https://www.solas.ai/, accessed on 2024-03-28. [11] Fiddler AI, Fairness, 2023. URL: https://docs.fiddler.ai/docs/fairness, accessed on 2023-11-13. [12] Arize, Bias Tracing (Fairness), 2023. URL: https://docs.arize.com/arize/ tracing-and-troubleshooting/11.-bias-tracing-fairness, accessed on 2023-11-25. [13] Etiq AI, Bias, 2023. URL: https://docs.etiq.ai/scan-types/bias, accessed on 2023-12-01. [14] FairPlay, Frequently Asked Questions, 2024. URL: https://fairplay.ai/faq/, accessed on 2024-02-26. [15] J. Finocchiaro, R. Maio, F. Monachou, G. K. Patro, M. Raghavan, A.-A. Stoica, S. Tsirtsis, Bridging machine learning and mechanism design towards algorithmic fairness, in: Pro- ceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021, pp. 489–503. [16] R. Binns, Fairness in machine learning: Lessons from political philosophy, in: S. A. Friedler, C. Wilson (Eds.), Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, PMLR, New York, NY, USA, 2018, pp. 149–159. URL: http://proceedings.mlr.press/v81/binns18a.html. [17] C. Hertweck, C. Heitz, M. Loi, On the moral justification of statistical parity, in: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, Association for Computing Machinery, New York, NY, USA, 2021, p. 747–757. URL: https: //doi.org/10.1145/3442188.3445936. doi:10.1145/3442188.3445936. [18] H. Weerts, L. Royakkers, M. Pechenizkiy, Does the End Justify the Means? On the Moral Justification of Fairness-Aware Machine Learning, arXiv preprint arXiv:2202.08536 (2022). [19] M. Jorgensen, H. Richert, E. Black, N. Criado, J. Such, Not so fair: The impact of presumably fair machine learning models, in: Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, 2023, pp. 297–311. [20] L. Hu, Y. Chen, Fair classification and social welfare, in: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 2020, pp. 535–545. [21] J. Kleinberg, S. Mullainathan, M. Raghavan, Inherent trade-offs in the fair determination of risk scores, arXiv preprint arXiv:1609.05807 (2016). [22] A. Chouldechova, Fair prediction with disparate impact: A study of bias in recidivism prediction instruments, Big data 5 (2017) 153–163. [23] S.-A. DeLucia, Evaluating Model Fairness, 2023. URL: https://arize.com/blog/ evaluating-model-fairness/. [24] Center for Data Science and Public Policy, University of Chicago, Aequitas, 2018. URL: http://www.datasciencepublicpolicy.org/our-work/tools-guides/aequitas/. [25] A. Z. Jacobs, H. Wallach, Measurement and fairness, in: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, 2021, pp. 375–385. [26] E. Morozov, To save everything, click here: The folly of technological solutionism, Publi- cAffairs, 2013. [27] R. Abebe, S. Barocas, J. Kleinberg, K. Levy, M. Raghavan, D. G. Robinson, Roles for com- puting in social change, in: Proceedings of the 2020 conference on fairness, accountability, and transparency, 2020, pp. 252–260. [28] FairPlay, Increase Fairness, Boost Profits, 2024. URL: https://fairplay.ai/for-banks/, accessed on 2024-03-28.