<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Watching the Watchers: A Comparative Fairness Audit of Cloud-based Content Moderation Services</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Hartmann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amin Oueslati</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitri Staufer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Electrical Engineering and Computer Science</institution>
          ,
          <addr-line>TU Berlin</addr-line>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>9</lpage>
      <abstract>
        <p>Online platforms face the challenge of moderating an ever-increasing volume of content, including harmful hate speech. In the absence of clear legal definitions and a lack of transparency regarding the role of algorithms in shaping decisions on content moderation, there is a critical need for external accountability. Our study contributes to filling this gap by systematically evaluating four leading cloudbased content moderation services through a third-party audit, highlighting issues such as biases against minorities and vulnerable groups that may arise through over-reliance on these services. Using a blackbox audit approach and four benchmark data sets, we measure performance in explicit and implicit hate speech detection as well as counterfactual fairness through perturbation sensitivity analysis and present disparities in performance for certain target identity groups and data sets. Our analysis reveals that all services had dificulties detecting implicit hate speech, which relies on more subtle and codified messages. Moreover, our results point to the need to remove group-specific bias. It seems that biases towards some groups, such as Women, have been mostly rectified, while biases towards other groups, such as LGBTQ+ and PoC remain.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Content moderation as a service</kwd>
        <kwd>hate speech detection</kwd>
        <kwd>third-party audit</kwd>
        <kwd>NLP fairness</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Hate speech has real-world efects, being the suppression of voices, exclusion, discrimination,
and violence against minorities [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. It is all the more concerning that with the rise of online
content in the digital age, more pernicious and unwanted content, such as hate speech and
discriminatory content, is being proliferated [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Online platforms responded to the online
hate speech proliferation by adopting extensive content moderation regimes [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and assessing
potential hateful content against so-called community guidelines by human moderators, who
are assisted by algorithms [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Absent a translation of hate speech operationalizations into
practice, private companies are given substantial autonomy in their moderation practices,
efectively making them the judges of public speech [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. The largest technology firms, such as
Google, Microsoft, Amazon, and OpenAI, additionally ofer content moderation as a service via
cloud-based API access. While most organisations do not report the extent to which algorithms
shape content moderation, the sheer amount of online speech makes reliance on algorithmic
moderation inevitable [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        The risks associated with hate speech are not limited to its lack of regulation or moderation.
Over-moderation and under-moderation of specific groups and the non-functionality of
automated hate speech classification can lead to serious harm. If content moderation algorithms
malfunction, some users are wrongfully censored, while others are insuficiently protected
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Open-source content moderation algorithms have continuously displayed biases against
minorities and target groups [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref15">10, 11, 12, 13, 14, 15</xref>
        ].
      </p>
      <p>
        Nonetheless, no systematic evaluation of cloud-based content moderation services exists,
meaning an alarming absence of public scrutiny. This paper’s contribution is twofold. Firstly, it
ofers the first comprehensive fairness assessment of four major cloud-based content moderation
algorithms. Not only are these algorithms likely in use through the SaaS model. Secondly, our
auditing strategy may inform future bias audits of (cloud-based) content moderation algorithms.
Importantly, our proposed approach solely assumes limited black-box access [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and ofers
guidance on reinforced sampling strategies to achieve maximal scrutiny with limited resources
Noting the realities of unsolicited audits from civil society organisations and academia[
        <xref ref-type="bibr" rid="ref17 ref18 ref19">17, 18, 19</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Data and Method</title>
      <p>
        We gained researcher access to the Google Moderate Text API, Amazon Comprehend, Microsoft
Azure Content Moderation, and the Open AI Content Moderation API. These services generate a
hate speech score per text sequence, often split across several sub-categories, as well as a binary
lfag. Our study uses the MegaSpeech, Jigsaw, HateXplain, and ToxiGen datasets [
        <xref ref-type="bibr" rid="ref14 ref20 ref21 ref22">20, 21, 22, 14</xref>
        ].
The selected datasets capture various forms of hate speech, with ToxiGen containing implicit
and adversarial hate speech constructed around indirect messages [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], while MegaSpeech and
ToxiGen use generative AI to diversify speech corpora [
        <xref ref-type="bibr" rid="ref14 ref20">20, 14</xref>
        ]. Jigsaw and HateXplain contain
human-written examples labeled by annotators, with MegaSpeech containing more hate speech
corpora but no target group labels. MegaSpeech, HateXplain, and ToxiGen provide shorter text
sequences, with on average 17.7, 23.3, and 18.1 words respectively, while Jigsaw is made up by
longer sequences, 48.3 words on average.
      </p>
      <p>
        We evaluate all cloud-based moderation algorithms across all datasets on a set of
thresholdvariant and threshold-invariant performance metrics [
        <xref ref-type="bibr" rid="ref24 ref25">24, 25</xref>
        ] at an aggregate level and also
specifically for vulnerable groups. We ensure consistency across datasets by mapping these
onto seven vulnerable groups (Women, LGBTQ+, PoC, Muslim, Asian, Jewish, Latinx). Since
MegaSpeech comes without labels, we train a Bi-LSTM model with the collected data set by
Yoder et al. [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] (preliminary evaluation accuracy 78 %) for target identity classification. At
the group-level, we compute the pinned ROC AUC, a metric proposed by Dixon et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
designed to provide a more robust measure for scale-invariant performance comparison across
sub-groups.While this approach comes with its pitfalls, as the authors themselves note in a
subsequent paper, it is the best scale-invariant metric to date when presented with group-level
variation in biases [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
      </p>
      <p>
        Perturbation Sensitivity Analysis (PSA) ofers an additional, arguably more robust evaluation
of group-level biases by using counterfactual fairness evaluation[
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. We follow prior research
Dataset Moderation Service ROC AUC
      </p>
      <p>Amazon 70.4%
ToxiGen OGpoeongAleI 7602..37%%</p>
      <p>Microsoft 59.8%</p>
      <p>Amazon 92.2%
Jigsaw OGpoeongAleI 7689..69%%</p>
      <p>
        Microsoft 75.8%
in defining an anchor group against which other groups are compared [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. Using the dominant
majority group as baseline, Counterfactual Token Fairness (CFT) scores are computed as the
diference in toxicity between the baseline and the corresponding minority group.
      </p>
      <p>PSA makes two assumptions: First, counterfactual pairs should convey the same or neutral
meaning, avoiding any implicit biases or derogatory connotations. While constructing toxic
counterfactuals is theoretically possible, it is methodologically demanding and exceeds the scope
of this project. Instead, we construct 34 neutral counterfactual pairs. Importantly, each minority
group is represented by multiple tokens, reflecting its diferent semantic representations. For
instance, the minority group female also manifests as woman and women. Second, there should be
no unique interactions between a particular minority token and the context of the sentence that
would skew the analysis. This is challenging in real-world applications, as certain combinations
might evoke stereotypes or specific cultural connotations. Thus, the project uses data consisting
largely of short and explicit statements.</p>
      <p>
        Furthermore, CFT scores are calculated separately for toxic and non-toxic statements, with
the latter generally supporting the assumption of counterfactual symmetry more consistently.
PSA experiments are conducted using two distinct data sets. First, the synthetic Identity Phrase
Templates from Dixon et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] are used. The set contains 77,000 synthetic examples of which
50% are toxic. These avoid stereotypes and complex sentence structures by design, which
ensures that the symmetric counterfactual assumption is met. Mapping the dataset, which
contains a broader set of identities, to the 34 minority token relevant to this study, results in
25,738 sentence pairs. Second, by applying the same logic, 9,190 sentence pairs are derived from
the MegaSpeech dataset.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>Table 1 shows aggregated performance results for chosen benchmark data sets. Our results
indicate notable disparities between moderation APIs. OpenAI’s content moderation
algorithm performs best for Megaspeech and Amazon Text Moderation on Jigsaw and ToxiGen,
generalising well across data sets. On Jigsaw, Amazon Comprehend performs best. However,
its near-optimal performance (92.2 % ROC AUC) suggests that the Jigsaw data was likely
included in Amazon Comprehend API’s training process. Overall, Google’s API shows the worst
performance across data sets. Its poor performance seems driven by a comparably high FPR,
which suggests that the algorithm tends to overmoderate. In contrast, Microsoft Azure Content
Moderation is associated with a high FNR, suggesting it often misses hate speech.</p>
      <p>
        Furthermore, all services struggle to detect implicit hate speech, reflected in their high False
Positive Rates on ToxiGen. To this end, commercial moderation services do not fare much better
than their open-source counterparts [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. One likely cause is the limited availability of implicit
hate speech datasets for training purposes.
      </p>
      <p>
        The comparative fairness evaluation of the identity group is presented via group-level pinned
ROC AUC scores in Figure 1. Due to space constraints, we only present one metric (ROC AUC).
Future work includes a comprehensive analysis. We find that all services tend to overmoderate
speech concerning groups PoC and LGBTQ+. This is somewhat surprising as extensive prior
research uncovered biases in open-source content moderation algorithms in relation to these
groups [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. Commonly, such overmoderation occurs as toxic speech concerning these groups is
overrepresented in the training data, and subsequently learned by the model. Most services fail
to reliably detect hate speech aimed at groups Disability, Asian, and Latinx. Lastly, the tendency
of Google Text Moderation to overmoderate is puzzling but also alarming. While we cannot
entirely rule out an error on our end, this observation is robust to diferent configurations of API
sub-categories. Figure 1 (right) displays the PSA results. We find (1) diferences in toxicity scores
by and large are more pronounced on non-toxic than toxic data. Intuitively this makes sense, as
scores are generated non-linearly with a definite upper bound. Thus, when other elements in a
sentence induce a high toxicity score, the marginal efect from identity tokens is comparably
lower. We further find that (2) greater variation in the mean CFT scores in non-synthetic than
in synthetic data. This was to be expected, as the sentences from MegaSpeech contain more
contextual information that interacts with the tokens. Overall, the results suggest that most
minorities are associated with higher levels of toxicity than dominant majorities, although these
efects appear relatively small, and vary across groups and services. Group LGBTQ+ seems
associated with the strongest negative bias, occurring for all samples and services. We observe
limited negative bias against groups Latinx and Asian.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>Summarizing, we uncovered both aggregate-level performance issues and group-level biases in
major commercial cloud-based content moderation services. Importantly, while some
shortcomings extend to all services, such as dificulty in detecting implicit hate speech or biases against
group LGBTQ+, others are confined to a particular service.</p>
      <p>Over the years, a lot of research has been done that shows the biases and limitations of
automated hate speech detection classifiers. Nevertheless, these limitations persist in current
content moderation APIs. We demonstrated that all five tested content moderation APIs show
disparities in performance for specific target groups, for implicit hate speech, over moderate
target groups which are strongly associated with hate speech online and penalize counter
speech as well as reappropriation.</p>
      <p>Challenges we encountered, such as the inherent subjectivity of hate speech moderation
and data limitations, should not deter but encourage future work. Without public scrutiny, the
subjectivity does not vanish, but it remains entirely to the discretion of private companies to
make these subjective choices.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Matsuda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R. L. III</given-names>
            ,
            <surname>R. Delgado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. W.</given-names>
            <surname>Crenshaw</surname>
          </string-name>
          , Words That Wound:
          <article-title>Critical Race Theory, Assaultive Speech, and The First Amendment</article-title>
          , Faculty Books,
          <year>1993</year>
          . URL: https://scholarship.law.columbia.edu/books/287, accessed
          <article-title>: date-of-access.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Marques</surname>
          </string-name>
          ,
          <article-title>The expression of hate in hate speech</article-title>
          ,
          <source>Journal of Applied Philosophy</source>
          <volume>40</volume>
          (
          <year>2023</year>
          )
          <fpage>769</fpage>
          -
          <lpage>787</lpage>
          . URL: https://onlinelibrary.wiley.com/ doi/abs/10.1111/japp.12608. doi:https://doi.org/10.1111/japp.12608. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/japp.12608.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bakalis</surname>
          </string-name>
          ,
          <article-title>Regulating hate crime in the digital age</article-title>
          , Oxford University Press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>De Gregorio</surname>
          </string-name>
          ,
          <article-title>Democratising online content moderation: A constitutional framework</article-title>
          ,
          <source>Computer Law &amp; Security Review</source>
          <volume>36</volume>
          (
          <year>2020</year>
          )
          <fpage>105376</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Gorwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Binns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Katzenbach</surname>
          </string-name>
          ,
          <article-title>Algorithmic content moderation: Technical and political challenges in the automation of platform governance</article-title>
          ,
          <source>Big Data &amp; Society</source>
          <volume>7</volume>
          (
          <year>2020</year>
          )
          <article-title>205395171989794</article-title>
          . URL: http://journals.sagepub.com/doi/10.1177/2053951719897945.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Seering</surname>
          </string-name>
          ,
          <article-title>Reconsidering self-moderation: the role of research in supporting communitybased models for online content moderation</article-title>
          ,
          <source>Proceedings of the ACM on Human-Computer Interaction</source>
          <volume>4</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Einwiller</surname>
          </string-name>
          , S. Kim,
          <article-title>How online content providers moderate usergenerated content to prevent harmful online communication: An analysis of policies and their implementation</article-title>
          ,
          <source>Policy &amp; Internet</source>
          <volume>12</volume>
          (
          <year>2020</year>
          )
          <fpage>184</fpage>
          -
          <lpage>206</lpage>
          . URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/poi3.239. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/poi3.
          <fpage>239</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Schluger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Danescu-Niculescu-Mizil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. E. C.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <article-title>Proactive moderation of online discussions: Existing practices and the potential for algorithmic support</article-title>
          ,
          <source>Proceedings of the ACM on Human-Computer Interaction</source>
          <volume>6</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          . URL: https://api.semanticscholar.org/CorpusID:253460203.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Dixon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sorensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Thain</surname>
          </string-name>
          , L. Vasserman,
          <article-title>Measuring and Mitigating Unintended Bias in Text Classification</article-title>
          ,
          <source>in: Proceedings of the 2018 AAAI/ACM Conference on AI</source>
          ,
          <string-name>
            <surname>Ethics</surname>
          </string-name>
          , and Society, AIES '18,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2018</year>
          , pp.
          <fpage>67</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Garg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Masud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Suresh</surname>
          </string-name>
          , T. Chakraborty, Handling Bias in Toxic Speech Detection: A Survey,
          <source>CoRR abs/2202</source>
          .00126 (
          <year>2022</year>
          ). URL: https://arxiv.org/abs/2202.00126, arXiv:
          <fpage>2202</fpage>
          .
          <fpage>00126</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sap</surname>
          </string-name>
          , S. Gabriel, L. Qin,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          , Social Bias Frames:
          <article-title>Reasoning about Social and Power Implications of Language</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>5477</fpage>
          -
          <lpage>5490</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>486</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Fortuna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Soler</surname>
          </string-name>
          , L. Wanner, Toxic, Hateful, Ofensive or Abusive?
          <article-title>What Are We Really Classifying? An Empirical Analysis of Hate Speech Datasets</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Béchet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Blache</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Choukri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cieri</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Goggi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Isahara</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Mazo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Odijk</surname>
          </string-name>
          , S. Piperidis (Eds.),
          <source>Proceedings of the Twelfth Language Resources and Evaluation Conference</source>
          , European Language Resources Association, Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>6786</fpage>
          -
          <lpage>6794</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .lrec-
          <volume>1</volume>
          .
          <fpage>838</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ruppenhofer</surname>
          </string-name>
          , T. Kleinbauer,
          <article-title>Detection of Abusive Language: the Problem of Biased Datasets, in: North American Chapter of the Association for Computational Linguistics</article-title>
          ,
          <year>2019</year>
          . URL: https://api.semanticscholar.org/CorpusID:174799974.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hartvigsen</surname>
          </string-name>
          , S. Gabriel, H. Palangi,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ray</surname>
          </string-name>
          , E. Kamar,
          <article-title>ToxiGen: A LargeScale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection, in: Annual Meeting of the Association for Computational Linguistics</article-title>
          ,
          <year>2022</year>
          . URL: https: //api.semanticscholar.org/CorpusID:247519233.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>E.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Natarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <article-title>The Woman Worked as a Babysitter: On Biases in Language Generation</article-title>
          , in: K. Inui,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          Wan (Eds.),
          <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>3407</fpage>
          -
          <lpage>3412</lpage>
          . URL: https://aclanthology.org/D19-1339. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1339.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Casper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ezell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Siegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kolt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Curtis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bucknall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Haupt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Scheurer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hobbhahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sharkey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Alberti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gerovitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tegmark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          Hadfield-Menell,
          <article-title>Black-box access is insuficient for rigorous ai audits</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2401</volume>
          .
          <fpage>14446</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Birhane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Steed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ojewale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vecchione</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. D.</given-names>
            <surname>Raji</surname>
          </string-name>
          ,
          <article-title>Ai auditing: The broken bus on the road to ai accountability</article-title>
          ,
          <source>ArXiv abs/2401</source>
          .14462 (
          <year>2024</year>
          ). URL: https://api.semanticscholar. org/CorpusID:267301287.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>West</surname>
          </string-name>
          , Algorithmic Accountability: Moving Beyond Audits,
          <source>AI Now Institute</source>
          (
          <year>2023</year>
          ). URL: https://ainowinstitute.org/publication/algorithmic-accountability.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>I. D.</given-names>
            <surname>Raji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Honigsberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Ho</surname>
          </string-name>
          , Outsider Oversight:
          <article-title>Designing a Third Party Audit Ecosystem for AI Governance</article-title>
          ,
          <year>2022</year>
          . URL: http://arxiv.org/abs/2206.04737, arXiv:
          <fpage>2206</fpage>
          .04737 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pendzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wullach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Adler</surname>
          </string-name>
          , E. Minkov,
          <source>Generative AI for Hate Speech Detection: Evaluation and Findings</source>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2311.09993, arXiv:
          <fpage>2311</fpage>
          .09993 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Jigsaw</surname>
          </string-name>
          ,
          <article-title>Jigsaw toxic comment classifi- cation challenge</article-title>
          .,
          <year>2019</year>
          . URL: https://www.kaggle. com/c/jigsaw-toxic
          <article-title>-comment-classification-challenge.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mathew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Yimam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Biemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Mukherjee,
          <article-title>HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          <volume>35</volume>
          (
          <year>2021</year>
          )
          <fpage>14867</fpage>
          -
          <lpage>14875</lpage>
          . URL: https://ojs.aaai.org/index. php/AAAI/article/view/17745, number:
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>M. ElSherief</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Ziems</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Muchlinski</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Anupindi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Seybolt</surname>
            ,
            <given-names>M. D.</given-names>
          </string-name>
          <string-name>
            <surname>Choudhury</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          , Latent Hatred:
          <article-title>A Benchmark for Understanding Implicit Hate Speech</article-title>
          ,
          <source>CoRR abs/2109</source>
          .05322 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2109.05322, arXiv:
          <fpage>2109</fpage>
          .
          <fpage>05322</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>F.</given-names>
            <surname>Elsafoury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Katsigiannis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ramzan</surname>
          </string-name>
          ,
          <article-title>On Bias and Fairness in NLP: How to have a fairer text classification</article-title>
          ?,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2305.12829, arXiv:
          <fpage>2305</fpage>
          .12829 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>D.</given-names>
            <surname>Borkan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dixon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sorensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Thain</surname>
          </string-name>
          , L. Vasserman,
          <article-title>Nuanced metrics for measuring unintended bias with real data for text classification</article-title>
          ,
          <source>in: Companion Proceedings of The 2019 World Wide Web Conference</source>
          , WWW '19,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>491</fpage>
          -
          <lpage>500</lpage>
          . URL: https://doi.org/10.1145/3308560.3317593.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>M. M. Yoder</surname>
            ,
            <given-names>L. H. X.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>D. W.</given-names>
          </string-name>
          <string-name>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. M. Carley</surname>
          </string-name>
          ,
          <article-title>How hate speech varies by target identity: A computational analysis</article-title>
          ,
          <source>arXiv preprint arXiv:2210.10839</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>V.</given-names>
            <surname>Prabhakaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hutchinson</surname>
          </string-name>
          , M. Mitchell,
          <article-title>Perturbation sensitivity analysis to detect unintended model biases</article-title>
          , in: K. Inui,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          Wan (Eds.),
          <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>5740</fpage>
          -
          <lpage>5745</lpage>
          . URL: https://aclanthology.org/D19-1578.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>S.</given-names>
            <surname>Garg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Perot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Limtiaco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Taly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Beutel</surname>
          </string-name>
          ,
          <article-title>Counterfactual Fairness in Text Classification through Robustness</article-title>
          ,
          <source>in: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society</source>
          , ACM,
          <source>Honolulu HI USA</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>219</fpage>
          -
          <lpage>226</lpage>
          . URL: https://dl.acm. org/doi/10.1145/3306618.3317950.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>