<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>European Workshop on Algorithmic Fairness, July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>“20% Increase in fairness for Black applicants”: A Critical Examination of Fairness Measurements Ofered by Startups</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Corinna Hertweck</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maya Guido</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Zurich</institution>
          ,
          <addr-line>Zurich</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Zurich University of Applied Sciences</institution>
          ,
          <addr-line>Zurich</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>0</volume>
      <fpage>1</fpage>
      <lpage>03</lpage>
      <abstract>
        <p>Companies using machine learning are increasingly obligated to integrate fairness considerations, often driven by regulatory imperatives and public discourse. This has given rise to a startup ecosystem focused on or at least integrating fairness measurement into their ML observability platforms. However, fairness is a complex concept and there are still many open questions in research. We therefore investigate how startups deal with this and present preliminary results of our ongoing analysis of the fairness startup landscape. In our analysis, we review publicly available material (such as websites) from these companies. We find two notable gaps: (1) the gap between fairness measurement in the algorithmic fairness literature and what startups actually implement and (2) the gap between the claims made by these startups and their actual practices. Based on our findings, we make recommendations for academia, policymakers, and industry stakeholders to advance the cause of fairness in machine learning collaboratively.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;fairness</kwd>
        <kwd>observability</kwd>
        <kwd>startups</kwd>
        <kwd>fairness metrics</kwd>
        <kwd>fairness criteria</kwd>
        <kwd>demographic parity</kwd>
        <kwd>statistical parity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Through the increasing use of machine learning, there is also an increasing awareness of
potential discrimination through automated decision-making systems. This has led to more
regulation in this space (e.g., in the EU AI Act [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) and thereby to more pressure on companies
that are using machine learning. Consequently, ML observability platforms are starting to
incorporate fairness metrics into their oferings. Some of these platforms even prioritize fairness
as their primary concern. However, it is unclear if these platforms’ claims match what they
can actually ofer – especially since we know that the field of algorithmic fairness still has
a lot of open questions to answer on the research side. Inspired by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we want to evaluate
these platforms’ “claims and practices”. Our focus is specifically on startups that integrate
some form of of-the-shelf fairness measurement into their platforms. We do not consider
consulting companies that do not ofer stand-alone platforms and instead provide services such
as consultation or manual audits. For an overview of the AI audit ecosystem, we refer readers
to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We also do not consider open source platforms, which [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] has reviewed. Our goal is to
provide an overview of the fairness measurement startup ecosystem and to discuss how these
startups implement fairness measurement in practice. We aim to highlight the gaps between
current implementations and existing research and suggest potential improvements in both
research and implementation to guide algorithmic fairness in practice.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Methods</title>
      <p>
        We collected relevant startups specializing in fairness evaluations from “The ethical AI database”,
Google search and Crunchbase, using a set of predefined keywords related to algorithmic fairness.
We then filtered this list for startups that claim to ofer fairness metrics. This resulted in a
list of 21 startups, which we are currently investigating. Since their platforms are proprietary
products, we were not able to easily access them to check what types of fairness measurements
are implemented. We therefore rely on startups’ publicly available material, such as their
website, documentation, white papers and video material. We review this material to document
how these startups implement fairness measurement and also take note of the claims that they
are making about their products. The startups that we have analyzed so far are Arize [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Etiq
AI [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], FairPlay [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Fiddler AI [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Mona [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and SolasAI [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Preliminary Results</title>
      <sec id="sec-3-1">
        <title>3.1. Fairness Measurement</title>
        <p>
          For Fiddler AI, Arize and Etiq AI, we were able to find a clear list of the implemented fairness
criteria (see [
          <xref ref-type="bibr" rid="ref11 ref12 ref13">11, 12, 13</xref>
          ]). FairPlay uses one metric in all their reports, which we therefore assume
is the only one that their platform measures although they mention two more metrics on their
website’s FAQ section [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. For Mona and SolasAI, we could not find documentation that listed
the implemented fairness metrics, so access to the platform would be required to evaluate this
further. Note that these platforms also implement other metrics (e.g., label distribution) for
evaluating diferent aspects. However, we focus specifically on fairness metrics and how users
are guided to choose between them.
        </p>
        <p>
          Focus on standard group fairness criteria Of the platforms with information on which
concrete fairness criteria are implemented, all but one of the implemented criteria belong to
the group fairness category. Only Etiq AI mentions individual fairness [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. However, there is
no explanation of how these are implemented or how the issue of defining similarity between
individuals is addressed. All other implemented fairness metrics are group fairness metrics. This
is a clear majority that resembles what we see in the open source landscape [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. We assume that
the reason for this is that group fairness is very easy to implement and requires no further input
from users whereas individual fairness or causal definitions of fairness require domain-specific
input from the user.
        </p>
        <p>Implemented fairness criteria
be implemented.1</p>
        <p>
          Let us now summarize which fairness criteria we know to
• Statistical parity / demographic parity: selection rate (probability of receiving a
positive decision) equal across socio-demographic groups; implemented by all four startups
• Equal opportunity: true positive rate equal across socio-demographic groups;
implemented by three startups (Fiddler AI, Arize, Etiq AI)
• False positive rate parity: false positive rate equal across socio-demographic groups;
implemented by one startup (Arize)
• Equalized odds: both equal opportunity and false positive rate parity2 fulfilled;
implemented by one startup (Etiq AI)
• Group benefit parity: ratio of positive decisions to positive labels equal across
sociodemographic groups; implemented by one startup (Fiddler AI)
• Denial odds parity ratio of negative decisions to positive decisions equal across
sociodemographic groups. The ratio of two groups’ denial odds is described as a fairness metric
in FairPlay’s FAQ section [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], but it is doubtful whether it is actually implemented.
        </p>
        <p>
          The first four of these criteria are well-known group fairness criteria that are commonly
found in the literature. However, they have also received criticism: One common theme is that
these fairness criteria only look at statistics relating to the decision but not at the consequences
of the decision [
          <xref ref-type="bibr" rid="ref15 ref16 ref17">15, 16, 17, 18, 19</xref>
          ]. However, what is relevant for fairness is how a decision
afects decision subjects. This mismatch can mean that enforcing some fairness metrics could
hurt marginalized groups as shown in [20, 19]. There has thus been a call for welfare-based
fairness criteria, which the analyzed tools have not implemented yet.
        </p>
        <p>
          Lack of guidance Choosing an appropriate fairness metric represents multiple value
judgments about the situation at hand. This moral choice is dificult to make, but particularly hard if
one is not familiar with fairness and justice discussions – which we would expect to be the case
for practitioners using these platforms. We therefore sought documentation from all platforms
that guide users in choosing fairness metrics. Along with the specification of the fairness metrics
that are implemented, Fiddler AI, Arize, Etiq AI and FairPlay all provided more information
on these metrics. However, in three cases (Fiddler AI, Etiq AI and FairPlay) this information is
purely formal and descriptive. They simply describe the statistical metric in words instead of
using a formula. What is provided is not actual guidance, but something that merely appears
to be guidance at first. See, for example, Fiddler AI’s “guidance” on two fairness criteria (the
others are described similarly) in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]:
• Group benefit: “If the two groups are treated equally, the group benefit should be the
same.”
• Equal opportunity: “If the two groups are treated equally, the TPR should be the same.”
1Note that because we only have access to the documentation and white papers, but not the platforms themselves,
there could be discrepancies that we cannot account for.
2Etiq AI actually uses equal opportunity and true negative rate parity, but by fulfilling true negative rate parity, one
also fulfills false positive rate parity.
        </p>
        <p>Wanting groups to be treated equally seems like a good goal, which according to Fiddler AI
would mean having to fulfill both the group benefit and equal opportunity criterion – which
Fiddler AI (incorrectly) claims to be “impossible”.3 The given information is not only confusing
to users but also not backed up by research.</p>
        <p>
          In a blog post [23], Arize provides a decision tree through which users are supposed to
ifnd appropriate fairness criteria. This tree strongly resembles the one proposed by Aequitas
[24].4 With questions such as “Does your business problem require fairness to address disparate
representation or disparate errors in your ML model?”, the tree would (similar to Aequitas’ tree,
cmp. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]) still be dificult to use for an uninitiated user of a fairness toolkit as they assume that
a user already knows what fairness requires in their context.
        </p>
        <p>With access limited to the platforms’ websites and documentation, it’s unclear if more
guidance is available on the actual platforms. Given the unclear documentation, we do not
expect this to be the case.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Critical View on Claims</title>
        <p>In our analysis, we came across various claims about fairness measurement and bias mitigation
capabilities of startups. Some startups give the impression that fairness is fully quantifiable
with a definite metric to measure bias, even though a single fairness metric cannot capture
the complexity of fairness. [25]. For bias mitigation, it is common to insinuate that mitigation
techniques are a solution or fix for discrimination – a techno-solutionist message [ 26, 27]. One
example that combines both is the following claim found on FairPlay’s website, advertising
why customers should use FairPlay’s platform: “20% Increase in fairness for Black applicants”
[28]. These kinds of claims carry the risk that third parties using these platforms build on the
claims of the startups to ethics-wash their product.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>As we have seen, most implemented fairness metrics are standard group fairness metrics. While
group fairness metrics have the advantage of being easy to implement, this also bears the danger
that they are used without much reflection. This issue is worsened by the platform providers
not ofering any sort of moral guidance for choosing fairness metrics. Moreover, many startups
make misleading claims about their fairness capabilities that promote a techno-solutionist view,
reducing fairness to a single number. Although some startups have shown admirable intentions
in practical fairness solutions, they are inherently driven by customer demand – which is in this
case often a reaction to prevalent regulations. Therefore, achieving substantive fairness must be
a collective responsibility that extends beyond these platforms and encompasses policymakers,
researchers, the industry and society at large.
3Fiddler AI writes “An important point to make is that it’s impossible to optimize all the metrics at the same time.
This is something to keep in mind when analyzing fairness metrics.” With this, Fiddler AI hints at the impossibility
theorems [21, 22], which mathematically show the impossibility of fulfilling specific criteria at the same time under
certain conditions. However, they only showed this impossibility for certain metrics and, for example, did not
include group benefit.
4Although we note that the work of Aequitas is not cited by Arize.
[18] H. Weerts, L. Royakkers, M. Pechenizkiy, Does the End Justify the Means? On the Moral</p>
      <p>Justification of Fairness-Aware Machine Learning, arXiv preprint arXiv:2202.08536 (2022).
[19] M. Jorgensen, H. Richert, E. Black, N. Criado, J. Such, Not so fair: The impact of presumably
fair machine learning models, in: Proceedings of the 2023 AAAI/ACM Conference on AI,
Ethics, and Society, 2023, pp. 297–311.
[20] L. Hu, Y. Chen, Fair classification and social welfare, in: Proceedings of the 2020 Conference
on Fairness, Accountability, and Transparency, 2020, pp. 535–545.
[21] J. Kleinberg, S. Mullainathan, M. Raghavan, Inherent trade-ofs in the fair determination
of risk scores, arXiv preprint arXiv:1609.05807 (2016).
[22] A. Chouldechova, Fair prediction with disparate impact: A study of bias in recidivism
prediction instruments, Big data 5 (2017) 153–163.
[23] S.-A. DeLucia, Evaluating Model Fairness, 2023. URL: https://arize.com/blog/
evaluating-model-fairness/.
[24] Center for Data Science and Public Policy, University of Chicago, Aequitas, 2018. URL:
http://www.datasciencepublicpolicy.org/our-work/tools-guides/aequitas/.
[25] A. Z. Jacobs, H. Wallach, Measurement and fairness, in: Proceedings of the 2021 ACM
conference on fairness, accountability, and transparency, 2021, pp. 375–385.
[26] E. Morozov, To save everything, click here: The folly of technological solutionism,
PublicAfairs, 2013.
[27] R. Abebe, S. Barocas, J. Kleinberg, K. Levy, M. Raghavan, D. G. Robinson, Roles for
computing in social change, in: Proceedings of the 2020 conference on fairness, accountability,
and transparency, 2020, pp. 252–260.
[28] FairPlay, Increase Fairness, Boost Profits, 2024. URL: https://fairplay.ai/for-banks/, accessed
on 2024-03-28.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>European</given-names>
            <surname>Commission</surname>
          </string-name>
          ,
          <article-title>Proposal for a Regulation of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) and amending certain Union legislative acts (COM(</article-title>
          <year>2021</year>
          )
          <article-title>206 final</article-title>
          ),
          <year>2021</year>
          . URL: https: //eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex:52021PC0206, accessed on 2024-
          <volume>01</volume>
          - 03.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Barocas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kleinberg</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. Levy</surname>
          </string-name>
          ,
          <article-title>Mitigating bias in algorithmic hiring: Evaluating claims and practices</article-title>
          ,
          <source>in: Proceedings of the 2020 conference on fairness, accountability, and transparency</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>469</fpage>
          -
          <lpage>481</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Costanza-Chock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. D.</given-names>
            <surname>Raji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Buolamwini</surname>
          </string-name>
          ,
          <article-title>Who Audits the Auditors? Recommendations from a field scan of the algorithmic auditing ecosystem</article-title>
          ,
          <source>in: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1571</fpage>
          -
          <lpage>1583</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. S. A.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>The landscape and gaps in open source fairness toolkits</article-title>
          ,
          <source>in: Proceedings of the 2021 CHI conference on human factors in computing systems</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Arize</surname>
          </string-name>
          ,
          <source>The AI Observability &amp; LLM Evaluation Platform</source>
          ,
          <year>2024</year>
          . URL: https://arize.com/, accessed on 2024-
          <volume>03</volume>
          -28.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Etiq</surname>
            <given-names>AI</given-names>
          </string-name>
          , ML Testing For Everyone,
          <year>2024</year>
          . URL: https://etiq.ai/, accessed on 2024-
          <volume>03</volume>
          -28.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7] FairPlay, Fairness for People, Profits, and
          <string-name>
            <surname>Progress</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://fairplay.ai/, accessed on 2024-
          <volume>03</volume>
          -28.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Fiddler</surname>
            <given-names>AI</given-names>
          </string-name>
          ,
          <source>AI Observability</source>
          ,
          <year>2024</year>
          . URL: https://www.fiddler.ai/, accessed on 2024-
          <volume>03</volume>
          -28.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Mona</surname>
          </string-name>
          ,
          <source>The Most Intelligent AI Monitoring Platform</source>
          ,
          <year>2023</year>
          . URL: https://www.monalabs.io/, accessed on 2024-
          <volume>03</volume>
          -28.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>SolasAI</surname>
          </string-name>
          ,
          <article-title>Reduce your algorithmic discrimination regulatory, legal and reputational risk</article-title>
          ,
          <year>2022</year>
          . URL: https://www.solas.ai/, accessed on 2024-
          <volume>03</volume>
          -28.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Fiddler</surname>
            <given-names>AI</given-names>
          </string-name>
          , Fairness,
          <year>2023</year>
          . URL: https://docs.fiddler.ai/docs/fairness, accessed on 2023-
          <volume>11</volume>
          -13.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Arize</surname>
            ,
            <given-names>Bias Tracing</given-names>
          </string-name>
          (Fairness),
          <year>2023</year>
          . URL: https://docs.arize.com/arize/ tracing-and-troubleshooting/11.
          <string-name>
            <surname>-</surname>
          </string-name>
          bias
          <string-name>
            <surname>-</surname>
          </string-name>
          tracing-fairness,
          <source>accessed on 2023-11-25.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Etiq</surname>
            <given-names>AI</given-names>
          </string-name>
          , Bias,
          <year>2023</year>
          . URL: https://docs.etiq.ai/scan-types/bias, accessed on 2023-
          <volume>12</volume>
          -01.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>FairPlay</surname>
          </string-name>
          , Frequently Asked Questions,
          <year>2024</year>
          . URL: https://fairplay.ai/faq/, accessed on 2024-
          <volume>02</volume>
          -26.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Finocchiaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Maio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Monachou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Patro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-A.</given-names>
            <surname>Stoica</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Tsirtsis,</surname>
          </string-name>
          <article-title>Bridging machine learning and mechanism design towards algorithmic fairness</article-title>
          ,
          <source>in: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>489</fpage>
          -
          <lpage>503</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Binns</surname>
          </string-name>
          ,
          <article-title>Fairness in machine learning: Lessons from political philosophy</article-title>
          , in: S. A.
          <string-name>
            <surname>Friedler</surname>
          </string-name>
          , C. Wilson (Eds.),
          <source>Proceedings of the 1st Conference on Fairness, Accountability and Transparency</source>
          , volume
          <volume>81</volume>
          <source>of Proceedings of Machine Learning Research</source>
          , PMLR, New York, NY, USA,
          <year>2018</year>
          , pp.
          <fpage>149</fpage>
          -
          <lpage>159</lpage>
          . URL: http://proceedings.mlr.press/v81/binns18a.html.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hertweck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Heitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Loi</surname>
          </string-name>
          ,
          <article-title>On the moral justification of statistical parity</article-title>
          ,
          <source>in: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency</source>
          , FAccT '21,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2021</year>
          , p.
          <fpage>747</fpage>
          -
          <lpage>757</lpage>
          . URL: https: //doi.org/10.1145/3442188.3445936. doi:
          <volume>10</volume>
          .1145/3442188.3445936.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>