Watching the Watchers: A Comparative Fairness
                                Audit of Cloud-based Content Moderation Services
                                David Hartmann1,2 , Amin Oueslati3 and Dimitri Staufer1
                                1
                                  Faculty of Electrical Engineering and Computer Science, TU Berlin
                                2
                                  Weizenbaum Institute for the Networked Society
                                3
                                  Hertie School Berlin


                                                                         Abstract
                                                                         Online platforms face the challenge of moderating an ever-increasing volume of content, including
                                                                         harmful hate speech. In the absence of clear legal definitions and a lack of transparency regarding
                                                                         the role of algorithms in shaping decisions on content moderation, there is a critical need for external
                                                                         accountability. Our study contributes to filling this gap by systematically evaluating four leading cloud-
                                                                         based content moderation services through a third-party audit, highlighting issues such as biases against
                                                                         minorities and vulnerable groups that may arise through over-reliance on these services. Using a black-
                                                                         box audit approach and four benchmark data sets, we measure performance in explicit and implicit
                                                                         hate speech detection as well as counterfactual fairness through perturbation sensitivity analysis and
                                                                         present disparities in performance for certain target identity groups and data sets. Our analysis reveals
                                                                         that all services had difficulties detecting implicit hate speech, which relies on more subtle and codified
                                                                         messages. Moreover, our results point to the need to remove group-specific bias. It seems that biases
                                                                         towards some groups, such as Women, have been mostly rectified, while biases towards other groups,
                                                                         such as LGBTQ+ and PoC remain.

                                                                         Keywords
                                                                         Content moderation as a service, hate speech detection, third-party audit, NLP fairness


                                1. Introduction
                                Hate speech has real-world effects, being the suppression of voices, exclusion, discrimination,
                                and violence against minorities [1, 2]. It is all the more concerning that with the rise of online
                                content in the digital age, more pernicious and unwanted content, such as hate speech and
                                discriminatory content, is being proliferated [3]. Online platforms responded to the online
                                hate speech proliferation by adopting extensive content moderation regimes [4] and assessing
                                potential hateful content against so-called community guidelines by human moderators, who
                                are assisted by algorithms [5]. Absent a translation of hate speech operationalizations into
                                practice, private companies are given substantial autonomy in their moderation practices,
                                effectively making them the judges of public speech [6, 7]. The largest technology firms, such as
                                Google, Microsoft, Amazon, and OpenAI, additionally offer content moderation as a service via
                                cloud-based API access. While most organisations do not report the extent to which algorithms

                                EWAF’24: European Workshop on Algorithmic Fairness, July 01–03, 2024, Mainz, Germany
                                $ d.hartmann@tu-berlin.de (D. Hartmann); amin.m.oueslati@gmail.com (A. Oueslati); staufer@tu-berlin.de
                                (D. Staufer)
                                                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                                           1


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
David Hartmann et al. CEUR Workshop Proceedings                                                  1–8


shape content moderation, the sheer amount of online speech makes reliance on algorithmic
moderation inevitable [8].
   The risks associated with hate speech are not limited to its lack of regulation or moderation.
Over-moderation and under-moderation of specific groups and the non-functionality of auto-
mated hate speech classification can lead to serious harm. If content moderation algorithms
malfunction, some users are wrongfully censored, while others are insufficiently protected
[9]. Open-source content moderation algorithms have continuously displayed biases against
minorities and target groups [10, 11, 12, 13, 14, 15].
   Nonetheless, no systematic evaluation of cloud-based content moderation services exists,
meaning an alarming absence of public scrutiny. This paper’s contribution is twofold. Firstly, it
offers the first comprehensive fairness assessment of four major cloud-based content moderation
algorithms. Not only are these algorithms likely in use through the SaaS model. Secondly, our
auditing strategy may inform future bias audits of (cloud-based) content moderation algorithms.
Importantly, our proposed approach solely assumes limited black-box access [16] and offers
guidance on reinforced sampling strategies to achieve maximal scrutiny with limited resources
Noting the realities of unsolicited audits from civil society organisations and academia[17, 18, 19].


2. Data and Method
We gained researcher access to the Google Moderate Text API, Amazon Comprehend, Microsoft
Azure Content Moderation, and the Open AI Content Moderation API. These services generate a
hate speech score per text sequence, often split across several sub-categories, as well as a binary
flag. Our study uses the MegaSpeech, Jigsaw, HateXplain, and ToxiGen datasets [20, 21, 22, 14].
The selected datasets capture various forms of hate speech, with ToxiGen containing implicit
and adversarial hate speech constructed around indirect messages [23], while MegaSpeech and
ToxiGen use generative AI to diversify speech corpora [20, 14]. Jigsaw and HateXplain contain
human-written examples labeled by annotators, with MegaSpeech containing more hate speech
corpora but no target group labels. MegaSpeech, HateXplain, and ToxiGen provide shorter text
sequences, with on average 17.7, 23.3, and 18.1 words respectively, while Jigsaw is made up by
longer sequences, 48.3 words on average.
   We evaluate all cloud-based moderation algorithms across all datasets on a set of threshold-
variant and threshold-invariant performance metrics [24, 25] at an aggregate level and also
specifically for vulnerable groups. We ensure consistency across datasets by mapping these
onto seven vulnerable groups (Women, LGBTQ+, PoC, Muslim, Asian, Jewish, Latinx). Since
MegaSpeech comes without labels, we train a Bi-LSTM model with the collected data set by
Yoder et al. [26] (preliminary evaluation accuracy 78 %) for target identity classification. At
the group-level, we compute the pinned ROC AUC, a metric proposed by Dixon et al. [9],
designed to provide a more robust measure for scale-invariant performance comparison across
sub-groups.While this approach comes with its pitfalls, as the authors themselves note in a
subsequent paper, it is the best scale-invariant metric to date when presented with group-level
variation in biases [25].
   Perturbation Sensitivity Analysis (PSA) offers an additional, arguably more robust evaluation
of group-level biases by using counterfactual fairness evaluation[27]. We follow prior research


                                                 2
David Hartmann et al. CEUR Workshop Proceedings                                                                                     1–8


  Dataset   Moderation Service   ROC AUC    F1     FPR     FNR     Dataset      Moderation Service   ROC AUC    F1       FPR     FNR
                Amazon            70.4%    68.9%    7.2%   52.0%                    Amazon            72.8 %   72.0 %   10.4 %   43.9 %
                 Google           62.7%    62.7%   39.1%   35.5%                     Google           73.3 %   72.3 %   41.3 %   12.0 %
  ToxiGen                                                          MegaSpeech
                OpenAI            70.3%    68.1%   33.2%   56.0%                    OpenAI            77.1 %   76.7 %    8.4 %   37.3 %
                Microsoft         59.8%    57.4%   16.4%   64.0%                    Microsoft         70.6 %   70.1 %   16.9 %   41.9 %
                Amazon            92.2%    92.2%    7.5%    8.1%                    Amazon            66.8%    66.25%   46.3 %    20 %
                 Google           69.9%    67.2%   58.4%    1.8%                     Google           52.2 %   58.9 %   78.2 %     4%
  Jigsaw                                                           HateXplain
                OpenAI            78.6%    78.6%   17.1%   25.6%                    OpenAI            72.9 %   76.7 %   45.4 %   8.86 %
                Microsoft         75.8%    75.7%   20.4%   28.1%                    Microsoft         63.1 %   60.2 %   63.6 %   10.3 %

Table 1
Performance metrics by moderation service and dataset. Blue shading signals the best performance,
while red shading indicates the worst performance. ToxiGen includes 7,800 observations and HateXplain
14,000, while Jigsaw and MegaSpeech each contain 50,000. All datasets are balanced on toxic and
non-toxic phrases.


in defining an anchor group against which other groups are compared [27]. Using the dominant
majority group as baseline, Counterfactual Token Fairness (CFT) scores are computed as the
difference in toxicity between the baseline and the corresponding minority group.
   PSA makes two assumptions: First, counterfactual pairs should convey the same or neutral
meaning, avoiding any implicit biases or derogatory connotations. While constructing toxic
counterfactuals is theoretically possible, it is methodologically demanding and exceeds the scope
of this project. Instead, we construct 34 neutral counterfactual pairs. Importantly, each minority
group is represented by multiple tokens, reflecting its different semantic representations. For
instance, the minority group female also manifests as woman and women. Second, there should be
no unique interactions between a particular minority token and the context of the sentence that
would skew the analysis. This is challenging in real-world applications, as certain combinations
might evoke stereotypes or specific cultural connotations. Thus, the project uses data consisting
largely of short and explicit statements.
   Furthermore, CFT scores are calculated separately for toxic and non-toxic statements, with
the latter generally supporting the assumption of counterfactual symmetry more consistently.
PSA experiments are conducted using two distinct data sets. First, the synthetic Identity Phrase
Templates from Dixon et al. [9] are used. The set contains 77,000 synthetic examples of which
50% are toxic. These avoid stereotypes and complex sentence structures by design, which
ensures that the symmetric counterfactual assumption is met. Mapping the dataset, which
contains a broader set of identities, to the 34 minority token relevant to this study, results in
25,738 sentence pairs. Second, by applying the same logic, 9,190 sentence pairs are derived from
the MegaSpeech dataset.


3. Results
Table 1 shows aggregated performance results for chosen benchmark data sets. Our results
indicate notable disparities between moderation APIs. OpenAI’s content moderation algo-
rithm performs best for Megaspeech and Amazon Text Moderation on Jigsaw and ToxiGen,
generalising well across data sets. On Jigsaw, Amazon Comprehend performs best. However,
its near-optimal performance (92.2 % ROC AUC) suggests that the Jigsaw data was likely in-
cluded in Amazon Comprehend API’s training process. Overall, Google’s API shows the worst


                                                                   3
David Hartmann et al. CEUR Workshop Proceedings                                                  1–8


Figure 1: Pinned ROC AUC is presented by moderation service, dataset and minority group. ToxiGen
includes 4,268 observations, HateXplain includes 1,748, Jigsaw consists of 19,228 observations and
MegaSpeech is comprised of 33,886.


performance across data sets. Its poor performance seems driven by a comparably high FPR,
which suggests that the algorithm tends to overmoderate. In contrast, Microsoft Azure Content
Moderation is associated with a high FNR, suggesting it often misses hate speech.
   Furthermore, all services struggle to detect implicit hate speech, reflected in their high False
Positive Rates on ToxiGen. To this end, commercial moderation services do not fare much better
than their open-source counterparts [14]. One likely cause is the limited availability of implicit
hate speech datasets for training purposes.
   The comparative fairness evaluation of the identity group is presented via group-level pinned
ROC AUC scores in Figure 1. Due to space constraints, we only present one metric (ROC AUC).
Future work includes a comprehensive analysis. We find that all services tend to overmoderate
speech concerning groups PoC and LGBTQ+. This is somewhat surprising as extensive prior
research uncovered biases in open-source content moderation algorithms in relation to these
groups [28]. Commonly, such overmoderation occurs as toxic speech concerning these groups is
overrepresented in the training data, and subsequently learned by the model. Most services fail
to reliably detect hate speech aimed at groups Disability, Asian, and Latinx. Lastly, the tendency
of Google Text Moderation to overmoderate is puzzling but also alarming. While we cannot
entirely rule out an error on our end, this observation is robust to different configurations of API
sub-categories. Figure 1 (right) displays the PSA results. We find (1) differences in toxicity scores
by and large are more pronounced on non-toxic than toxic data. Intuitively this makes sense, as
scores are generated non-linearly with a definite upper bound. Thus, when other elements in a
sentence induce a high toxicity score, the marginal effect from identity tokens is comparably
lower. We further find that (2) greater variation in the mean CFT scores in non-synthetic than
in synthetic data. This was to be expected, as the sentences from MegaSpeech contain more
contextual information that interacts with the tokens. Overall, the results suggest that most


                                                 4
David Hartmann et al. CEUR Workshop Proceedings                                                     1–8


Figure 2: CFT scores are visualized. They are computed through PSA on synthetic data from the Identity
Phrase Templates in Dixon et al. [9] and non-synthetic data from MegaSpeech, averaged per group and
service, reported separately for non-toxic and toxic examples. Besides a point estimate, the figure also
includes a 95% confidence interval assuming a student-t distribution.


minorities are associated with higher levels of toxicity than dominant majorities, although these
effects appear relatively small, and vary across groups and services. Group LGBTQ+ seems
associated with the strongest negative bias, occurring for all samples and services. We observe
limited negative bias against groups Latinx and Asian.


4. Conclusion
Summarizing, we uncovered both aggregate-level performance issues and group-level biases in
major commercial cloud-based content moderation services. Importantly, while some shortcom-
ings extend to all services, such as difficulty in detecting implicit hate speech or biases against
group LGBTQ+, others are confined to a particular service.
   Over the years, a lot of research has been done that shows the biases and limitations of
automated hate speech detection classifiers. Nevertheless, these limitations persist in current
content moderation APIs. We demonstrated that all five tested content moderation APIs show
disparities in performance for specific target groups, for implicit hate speech, over moderate
target groups which are strongly associated with hate speech online and penalize counter
speech as well as reappropriation.
   Challenges we encountered, such as the inherent subjectivity of hate speech moderation
and data limitations, should not deter but encourage future work. Without public scrutiny, the
subjectivity does not vanish, but it remains entirely to the discretion of private companies to
make these subjective choices.


                                                   5
David Hartmann et al. CEUR Workshop Proceedings                                              1–8


References
 [1] M. J. Matsuda, C. R. L. III, R. Delgado, K. W. Crenshaw, Words That Wound: Critical
     Race Theory, Assaultive Speech, and The First Amendment, Faculty Books, 1993. URL:
     https://scholarship.law.columbia.edu/books/287, accessed: date-of-access.
 [2] T. Marques,           The expression of hate in hate speech,                 Journal of Ap-
     plied Philosophy 40 (2023) 769–787. URL: https://onlinelibrary.wiley.com/
     doi/abs/10.1111/japp.12608.                doi:https://doi.org/10.1111/japp.12608.
     arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/japp.12608.
 [3] C. Bakalis, Regulating hate crime in the digital age, Oxford University Press, 2016.
 [4] G. De Gregorio, Democratising online content moderation: A constitutional framework,
     Computer Law & Security Review 36 (2020) 105376.
 [5] R. Gorwa, R. Binns, C. Katzenbach, Algorithmic content moderation: Technical and
     political challenges in the automation of platform governance, Big Data & Society 7 (2020)
     205395171989794. URL: http://journals.sagepub.com/doi/10.1177/2053951719897945.
 [6] J. Seering, Reconsidering self-moderation: the role of research in supporting community-
     based models for online content moderation, Proceedings of the ACM on Human-Computer
     Interaction 4 (2020) 1–23.
 [7] S. A. Einwiller, S. Kim,              How online content providers moderate user-
     generated content to prevent harmful online communication:                         An anal-
     ysis of policies and their implementation,                 Policy & Internet 12 (2020)
     184–206.             URL:         https://onlinelibrary.wiley.com/doi/abs/10.1002/poi3.239.
     arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/poi3.239.
 [8] C. Schluger, J. P. Chang, C. Danescu-Niculescu-Mizil, K. E. C. Levy, Proactive mod-
     eration of online discussions: Existing practices and the potential for algorithmic sup-
     port, Proceedings of the ACM on Human-Computer Interaction 6 (2022) 1 – 27. URL:
     https://api.semanticscholar.org/CorpusID:253460203.
 [9] L. Dixon, J. Li, J. Sorensen, N. Thain, L. Vasserman, Measuring and Mitigating Unintended
     Bias in Text Classification, in: Proceedings of the 2018 AAAI/ACM Conference on AI,
     Ethics, and Society, AIES ’18, Association for Computing Machinery, New York, NY, USA,
     2018, pp. 67–73.
[10] T. Garg, S. Masud, T. Suresh, T. Chakraborty, Handling Bias in Toxic Speech Detection:
     A Survey, CoRR abs/2202.00126 (2022). URL: https://arxiv.org/abs/2202.00126, arXiv:
     2202.00126.
[11] M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, Y. Choi, Social Bias Frames: Reasoning
     about Social and Power Implications of Language, in: D. Jurafsky, J. Chai, N. Schluter,
     J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computa-
     tional Linguistics, Association for Computational Linguistics, Online, 2020, pp. 5477–5490.
     URL: https://aclanthology.org/2020.acl-main.486.
[12] P. Fortuna, J. Soler, L. Wanner, Toxic, Hateful, Offensive or Abusive? What Are We Really
     Classifying? An Empirical Analysis of Hate Speech Datasets, in: N. Calzolari, F. Béchet,
     P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani,
     H. Mazo, A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings of the Twelfth Language Re-
     sources and Evaluation Conference, European Language Resources Association, Marseille,


                                               6
David Hartmann et al. CEUR Workshop Proceedings                                                1–8


     France, 2020, pp. 6786–6794. URL: https://aclanthology.org/2020.lrec-1.838.
[13] M. Wiegand, J. Ruppenhofer, T. Kleinbauer, Detection of Abusive Language: the Problem
     of Biased Datasets, in: North American Chapter of the Association for Computational
     Linguistics, 2019. URL: https://api.semanticscholar.org/CorpusID:174799974.
[14] T. Hartvigsen, S. Gabriel, H. Palangi, M. Sap, D. Ray, E. Kamar, ToxiGen: A Large-
     Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection,
     in: Annual Meeting of the Association for Computational Linguistics, 2022. URL: https:
     //api.semanticscholar.org/CorpusID:247519233.
[15] E. Sheng, K.-W. Chang, P. Natarajan, N. Peng, The Woman Worked as a Babysitter: On
     Biases in Language Generation, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings
     of the 2019 Conference on Empirical Methods in Natural Language Processing and the
     9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
     Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3407–3412. URL:
     https://aclanthology.org/D19-1339. doi:10.18653/v1/D19-1339.
[16] S. Casper, C. Ezell, C. Siegmann, N. Kolt, T. L. Curtis, B. Bucknall, A. Haupt, K. Wei,
     J. Scheurer, M. Hobbhahn, L. Sharkey, S. Krishna, M. V. Hagen, S. Alberti, A. Chan, Q. Sun,
     M. Gerovitch, D. Bau, M. Tegmark, D. Krueger, D. Hadfield-Menell, Black-box access is
     insufficient for rigorous ai audits, 2024. arXiv:2401.14446.
[17] A. Birhane, R. Steed, V. Ojewale, B. Vecchione, I. D. Raji, Ai auditing: The broken bus on the
     road to ai accountability, ArXiv abs/2401.14462 (2024). URL: https://api.semanticscholar.
     org/CorpusID:267301287.
[18] A. Kak, S. M. West, Algorithmic Accountability: Moving Beyond Audits, AI Now Institute
     (2023). URL: https://ainowinstitute.org/publication/algorithmic-accountability.
[19] I. D. Raji, P. Xu, C. Honigsberg, D. E. Ho, Outsider Oversight: Designing a Third Party Audit
     Ecosystem for AI Governance, 2022. URL: http://arxiv.org/abs/2206.04737, arXiv:2206.04737
     [cs].
[20] S. Pendzel, T. Wullach, A. Adler, E. Minkov, Generative AI for Hate Speech Detection:
     Evaluation and Findings, 2023. URL: http://arxiv.org/abs/2311.09993, arXiv:2311.09993 [cs].
[21] Jigsaw, Jigsaw toxic comment classifi- cation challenge., 2019. URL: https://www.kaggle.
     com/c/jigsaw-toxic-comment-classification-challenge.
[22] B. Mathew, P. Saha, S. M. Yimam, C. Biemann, P. Goyal, A. Mukherjee, HateXplain: A
     Benchmark Dataset for Explainable Hate Speech Detection, Proceedings of the AAAI
     Conference on Artificial Intelligence 35 (2021) 14867–14875. URL: https://ojs.aaai.org/index.
     php/AAAI/article/view/17745, number: 17.
[23] M. ElSherief, C. Ziems, D. Muchlinski, V. Anupindi, J. Seybolt, M. D. Choudhury, D. Yang,
     Latent Hatred: A Benchmark for Understanding Implicit Hate Speech, CoRR abs/2109.05322
     (2021). URL: https://arxiv.org/abs/2109.05322, arXiv: 2109.05322.
[24] F. Elsafoury, S. Katsigiannis, N. Ramzan, On Bias and Fairness in NLP: How to have a fairer
     text classification?, 2023. URL: http://arxiv.org/abs/2305.12829, arXiv:2305.12829 [cs].
[25] D. Borkan, L. Dixon, J. Sorensen, N. Thain, L. Vasserman, Nuanced metrics for measuring
     unintended bias with real data for text classification, in: Companion Proceedings of The
     2019 World Wide Web Conference, WWW ’19, Association for Computing Machinery,
     New York, NY, USA, 2019, p. 491–500. URL: https://doi.org/10.1145/3308560.3317593.
[26] M. M. Yoder, L. H. X. Ng, D. W. Brown, K. M. Carley, How hate speech varies by target


                                                7
David Hartmann et al. CEUR Workshop Proceedings                                                1–8


     identity: A computational analysis, arXiv preprint arXiv:2210.10839 (2022).
[27] V. Prabhakaran, B. Hutchinson, M. Mitchell, Perturbation sensitivity analysis to detect
     unintended model biases, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of
     the 2019 Conference on Empirical Methods in Natural Language Processing and the
     9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
     Association for Computational Linguistics, Hong Kong, China, 2019, pp. 5740–5745. URL:
     https://aclanthology.org/D19-1578.
[28] S. Garg, V. Perot, N. Limtiaco, A. Taly, E. H. Chi, A. Beutel, Counterfactual Fairness in Text
     Classification through Robustness, in: Proceedings of the 2019 AAAI/ACM Conference on
     AI, Ethics, and Society, ACM, Honolulu HI USA, 2019, pp. 219–226. URL: https://dl.acm.
     org/doi/10.1145/3306618.3317950.


                                                8