Creating a framework for a Benchmark Religion Dataset
Deepa Muralidhar1 , Ashwin Ashok2
1
    Georgia State University
2
    Georgia State University


                                          Abstract
                                          Development of Language Models (LM) such as OpenAI’s GPT series generating natural language text is growing at a rapid
                                          pace. The LMs take a text prompt as input and generate text as output that represent the most probable sequence of words
                                          matching the prompt’s context and pattern. Our preliminary investigations revealed bias but not much evidence for its
                                          cause. Our goal is, therefore, to build a benchmark dataset on various religions for evaluating this bias. We envision that our
                                          conceptual method of creating a dataset and developing a bias rating mechanism can serve as a fundamental tool establishing
                                          a process to measure bias. Comparing the Bias Indicator Value (BIV) for one religion against another should give us enough
                                          information to provide a holistic bias rating for the text generated.

                                          Keywords
                                          Large Language Models, religious bias, algorithmic bias, bias, metrics, Text mining, socioeconomic factors, mitigating,


1. Introduction and Motivation                                                                     why LMs generate texts differently for different religions.
                                                                                                   This metric which acts as a indicator of religious bias
Our research on religious bias targets two key challenges                                          in GPT-3 is computed for the generated data. We share
for AI text generators: (1) The need for a religion-based                                          our results of these experiments with the community
benchmark data set to evaluate an AI text generator for                                            through a preliminary religion dataset that includes tex-
bias. The problem with existing data sets is that there are                                        tual prompts, the test data, the associated AI-generated
no clear documentation and other data set management                                               output text and the analysis done of the text. The metrics
practices in place [1]. A well-designed benchmark data                                             and graphs calculated are part of this benchmark dataset.
set helps verify that the output data is unbiased across
a diverse distribution of real-world contexts [2]. (2) To
present indicators of bias,a quantitative value that can 3. Open Challenges
represent the implicit bias numerically. Our key obser-
vation is that it is challenging to create a bias metric in An open research question for future work is that one
LMs as the bias changes depending on the context of the metric maybe insufficient to measure biases, instead look
text.                                                       to develop a bias-reporting toolkit. This could include
                                                            transparency measurement, an examination on the how
                                                            and why of the decision-making process in an AI system,
2. Target design goals                                      is useful in detecting systemic biases.[4]

We conduct experiments to measure the sentiment of the
AI generated text and test for religious bias. For every References
prompt, we program GPT-3 to generate 200 tokens, ap-
proximately 178 words. For six sets of 20 prompts (five [1] K. Peng, A. Mathur, A. Narayanan, Mitigating
religions and one religion-neutral that acts as control                                                              dataset harms requires stewardship: Lessons from
value) the text has about 11100 words (12000 tokens).                                                                1000  papers (2021). URL: https://arxiv.org/abs/2108.
Using VADER [3], a rule-based tool that measures the                                                                 02922.
sentiment within the text, we measure positive and nega- [2] P. P. Liang, C. Wu, L. Morency, R. Salakhutdinov,
tive values (between -1 to +1) and compare the sentiment                                                             Towards understanding and mitigating social biases
value of each sentence. We use this to create a prelimi-                                                             in language models, 2021. URL: https://arxiv.org/
nary quantifiable metric,a Bias Indicator Value, to identify                                                         abs/2106.13219.
stereotypical bias with respect to a religion and interpret                                                      [3] G. E. Hutto, C.J., Vader: A parsimonious rule-based
                                                                                                                     model for sentiment analysis of social media text,
                                                                                                                     2015.
Human in the Loop Data Curation Workshop’22: CKIM ’22, October
17–21, 2022, Atlanta, Ga                                                                                         [4] J. Stoyanovich, Transfat: Translating fairness, ac-
Envelope-Open deepa.muralidhar@gmail.com (D. Muralidhar); aashok@gsu.edu                                             countability and transparency into data science prac-
(A. Ashok)                                                                                                           tice, 2018. URL: http://ceur-ws.org/Vol-2417/paper1.
                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                    Attribution 4.0 International (CC BY 4.0).                                                       pdf.
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)