<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Stories: Hairdressers Are Female, but so Are Doctors</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laura Spillner</string-name>
          <email>laura.spillner@uni-bremen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Generative AI</institution>
          ,
          <addr-line>Large Language Models, ChatGPT, Story Generation, Gender Bias</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>In: R. Campos, A. Jorge, A. Jatowt, S. Bhatia, M. Litvak (eds.): Proceedings of the Text2Story'24 Workshop</institution>
          ,
          <addr-line>Glasgow</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Bremen, Digital Media Lab</institution>
          ,
          <addr-line>Bibliothekstraße 1, 28359 Bremen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>We investigated gender bias in short stories generated by ChatGPT by generating stories about characters with specified occupations and analyzing the gender assigned to these characters. On the one hand, stereotypes about professions typically associated with women are strongly reinforced, with almost all of the characters in these stories being female, well beyond what would be expected based on human biases. On the other hand, among occupations that humans typically associate with men, the generated stories reinforce these stereotypes in some cases (particularly blue-collar occupations), while reversing them to be strongly stereotypically female in other cases (notably highly regarded professions such as</p>
      </abstract>
      <kwd-group>
        <kwd>doctors</kwd>
        <kwd>scientists</kwd>
        <kwd>attorneys</kwd>
        <kwd>or astronauts)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>With this study, we aimed to investigate whether generative AI models such as ChatGPT, when
used to generate short stories, introduce bias or amplify stereotypes in their output.</p>
      <p>There are two primary ways of bias introduction to consider when analyzing the narratives
of generated texts. The first involves biases within the plot of the story, including the choice
of gender, skin color, sexuality, and other attributes for diferent characters, as well as the
association of certain actions with these attributes. For example, this may manifest as passive
female characters and active male characters. The second form of bias pertains to the language
of the text itself, including how characters of diferent gender, skin color, sexuality, etc., are
described, as well as the word choices used in relation to these attributes. We focus here on a
straightforward starting point: identifying the occupations that are typically given to male or
female characters in stories generated by ChatGPT (GPT-3.5).</p>
      <p>
        The examination of gender bias in generated stories, e.g. for professions, is important for
several reasons. Firstly, there already exists a substantial body of research on earlier language
models and word embeddings that have explored the same [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Additionally, researchers have
extensively studied the unequal distribution of women and men in numerous professions, as
well as the stereotypes that lead humans to often associate certain professions with one gender
or the other [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
      </p>
      <p>
        To analyze gender bias in language models, one standard task is that of reference resolution,
where there are two possible nouns (typically occupations or other roles) that a pronoun could
refer to, one where the pronoun aligns with the gender stereotype and one where it does not
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Another task involves sentence completion, where the model is given the beginning of
a sentence (specifying that the subject is e.g. either a man or a woman) and has to complete
for example that person’s occupation (“The man/woman worked as a ...”) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], after which the
sentiment of the resulting sentences is compared between the groups. For machine translation,
tests have been done by translating sentences from a gender neutral language that leaves the
gender of the subject ambiguous to a gendered language [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. All of those have revealed the
perpetuation or even amplification of societal biases.
      </p>
      <p>However, all of these common tasks difer from that of story generation. To understand if
generated stories perpetuate stereotypes, one first has to be able to identify diferent elements
of the plot from the written text generated by the model (such as the gender associated with
diferent characters and their role in the story). Moreover, it is not clear that the biases found
with tasks as the ones mentioned above would necessarily be the same as biases in generated
stories, since the model might very well draw from diferent parts of the training data. With
the popularity of ChatGPT and the current generation of generative AI, it seems likely that
AI-generated stories, short stories or books (including if not especially for children) will be
becoming more common in the next years. Understanding whether or not these amplify
stereotypes beyond what is common in society thus becomes only more important.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>
        Language models tend to mirror human biases, such as gender stereotypes, that are present
in their training corpora [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ]. This can lead to problematic biases in downstream tasks,
e.g. Stanovski et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] showed gender bias in machine translation by examining sentences
containing professions, which were translated from a gender-neutral language to a gendered
language. While some solutions have been proposed (e.g. amplification can be mitigated through
constrains on training corpora [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and contextualized word embeddings are less biased than
static ones [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]), the problem has not been solved. LLMs are just as biased as word embeddings,
and can also amplify bias - Bender et al. (2021) argued that they are “stochastic parrots” [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        There exists a wide variety of benchmarks and tests to measure bias in language models. Zhao
et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] introduced the WinoBias benchmark, which focuses on gender bias in co-reference
resolution and has become instrumental in evaluating LLMs and highlighting the reinforcement
of bias. Another benchmark presented by Nadeem et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] aims to measure both bias and
language generation ability at the same time and has revealed the presence of stereotypes in
popular text generation models like GPT-2 and GPT-3. Sheng et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] conducted a study using
prefix templates such as “the woman worked as…” and employed GPT-2 and other language
models to complete these sentences. The resulting sentences were then analyzed for sentiment.
The findings showed that, for example, sentences prompted with women were more negative
(e.g. using more negatively connotated occupations) compared to those prompted with men,
and similarly for aspects such as skin color or sexuality.
      </p>
      <p>While overall the presence of bias in language models has been shown consistently, recent
surveys have revealed that many of these metrics are not compatible with each other and
produce heterogeneous results, in particular when it comes to embedding-based metrics [14, 15].
Delobelle et al. [14] surveyed many binary gender bias tasks. They point out that while models
certainly learn intrinsic bias from training data and show extrinsic bias in downstream tasks,
results learned from intrinsic bias metrics cannot easily be generalized to fairness results in
downstream tasks, and in particular the templating used in many of these benchmark tasks
influences their results considerably [ 14].</p>
      <p>
        This work primarily focuses on gender bias. Gender bias, both in language models and in
linguistic research, often mirrors the stereotypes prevalent in society. For instance, a study by
Kotek et al. in 2021 explored gender bias in linguistic example sentences, such as depicting
doctors as male and nurses as female [16]. Humans struggle to process sentences when there is
an incongruity between the assumed stereotypical gender (based on a specified profession) and
the later revealed actual gender of a person [
        <xref ref-type="bibr" rid="ref2">2, 17, 18, 19, 20, 21</xref>
        ]. Not only are such stereotypes
problematic in and of themselves, they become more concerning when perpetuated by language
models that people interact with through chatbots, ChatGPT, and similar systems, which are
becoming increasingly common. These stereotypes can influence children’s beliefs about the
accessibility of certain occupations [22]. Research has been conducted to examine the societal
stereotypes associated with diferent professions or roles according to gender [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Kennison
et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] created an extensive corpus of roles (occupations as well as other nouns) and their
stereotypical gender bias. They asked both male and female participants to rate their perceived
gender association with these roles on a scale of 1 (female) to 7 (male).
      </p>
      <p>
        Recent innovations in language model technology, including models such as ChatGPT and
GPT-4, have attracted significant attention in both public and academic spheres due to their
enhanced performance capabilities. Liu et al. provide an in-depth overview of research conducted
on chatgpt, underlining that these models have improved on numerous natural language tasks
[23]. Still, limitations such as the hallucinations observed in ChatGPT [
        <xref ref-type="bibr" rid="ref14">24</xref>
        ] still exist. Models
such as ChatGPT have used Reinforcement Learning from Human Feedback (RLHF) [
        <xref ref-type="bibr" rid="ref15">25</xref>
        ] to
address issues of bias and stereotyping in the generated outputs, but apparently this is not
suficient to solve the problem: In a blog post, Kapoor and Narayanan [
        <xref ref-type="bibr" rid="ref16">26</xref>
        ] tested GPT-3.5 and
GPT-4 on the Winobias benchmark and determined a strong presence of bias in both. Kotek et
al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] developed a new gender bias benchmark based on the dataset by Kennison et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and
were able to show that ChatGPT not only amplifies existing gender bias compared to the human
stereotype ratings, but also provides incorrect answers and explanations due to these biases.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Methods</title>
      <sec id="sec-4-1">
        <title>3.1. Dataset</title>
        <p>
          We utilized the dataset of Kotek et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], as well as the one by Kennison et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] that the
former is based on. Kotek et al. designed a test that is inspired by WinoBias [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], prompting
LLMs with sentences such as “In the sentence, ‘the doctor phoned the nurse because she was
late’, who was late?” - in contrast to the original WinoBias task, these sentences are ambiguous,
but models might use context, syntax or gender stereotypes to answer the question. They
constructed a set of 15 sentences with 30 occupation-denoting nouns based on the dataset
from [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and related literature. Kennison et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] collected human gender stereotype ratings
for a large corpus of occupations and other roles, which we use as a comparison point of the
stereotypical gender humans associate with certain professions. These ratings are on a scale of
1 to 7. The rating scale was explained to participants as such:
        </p>
        <p>
          “A rating of ‘1’ would indicate that a particular noun is very likely to represent a person
who is female. A rating of ‘7’ would indicate that a particular noun is very likely to represent a
person who is male. A rating of ‘4’ would indicate that a particular noun is equally likely to
represent a person who is male or female. A rating of ‘2’ or ‘3’ and ‘5’ and ‘6’ would indicate
diferent degrees of likelihood that a particular noun represents a person who is female or male.”
([
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], p. 359)
        </p>
        <p>Kennison et al. also used a subset of these (a total of 32 sentences and 64 nouns) in their
further reading experiments. Most of the nouns used by Kotek et al. are from this set, with
some additional ones included. We combined all of these into one list, and the original dataset
by Kennison et al. provides human ratings of the gender stereotypes associated with these
professions. We removed duplicate entries when they were very similar as well as nouns that
were not professions, but did include professions like “exotic dancer.” This consolidation resulted
in 66 professions.
3.2. Task
We directed ChatGPT to produce stories depicting “a day in the life” of individuals working in
specific professions, without specifying the gender of the person. Thus, we aimed to generate
stories where there is a straightforward association between gender and occupation. We
accessed GPT-3.5 via the API, utilizing the chat completion access point. Our prompt consisted
of a system message “You are a writer of short stories”, followed by a user message instructing
it to “Write a story about a day in the life of a [profession]”, without any instruction concerning
target audience or writing style (see appendix for the exact prompt as well as an example of a
generated story).</p>
        <p>We conducted 30 rounds of prompts, in each round presenting the professions in a randomized
order. This resulted in 30 replies by the model per profession. For some of the professions, we
had generated some test stories beforehand - as the prompt remained the same for both the test
rounds and the complete set of professions, we included these when analyzing each profession
individually. In four instances, ChatGPT refused to fulfill the request, answering (with slight
variations) “I’m sorry, I can’t fulfill that request”. Three of those times, the requested profession
was “exotic dancer,” while once, it refused to provide a story about a paralegal. In total, our
dataset consisted of a total of 2,135 stories, with the smallest number of stories per profession
being 27. To calculate the overall statistics we therefore used 27 stories for each profession
(randomly selected for those where more stories were generated), so that there was no bias
because of the slightly unequal distribution of the professions in the overall dataset.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.3. Gender Classification</title>
        <p>In theory, identifying the gender of the protagonist of these stories might involve first finding
characters in the story, then among those deciding on the main character, and then understanding
the gender assigned to them. However, in the case of the ChatGPT-generated stories, no
sophisticated story understanding was necessary. Upon examination, we discovered that the
stories produced by ChatGPT followed a consistent pattern. Consider the example presented in
the appendix. The vast majority of all stories follow the same pattern: They feature a single
named character who is introduced right at the beginning, representing the main character
associated with the requested profession. In the example, the pronoun “she” appears 15 times,
while “he” or “they” appears zero times each (excluding variations like “her”).</p>
        <p>Therefore, we counted the frequency of the substrings “ she ”, “ he ”, and “ they ” (including
spaces) in the stories and assigned the most frequently occurring pronoun as the character’s
gender. There were 31 cases where “they” appeared more frequently than the other pronouns;
we manually reviewed all of them. Among these, 29 were stories were the main character
was shown to be working together with another important character, resulting in a higher
occurrence of “they”. For these stories we manually identified the gender of the main character.
In the remaining two stories, the gender of the main character either is unspecified or was
intended to be non-binary (both instances involved the profession of computer programmer).
To validate the efectiveness of our method, we conducted a comparison by manually identifying
the gender of the main character in a randomly selected subset of 90 stories. The results matched
in all 90 stories.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Results</title>
      <p>In total, we generated 2,102 stories encompassing 66 diferent professions. These stories
consisted of an average of 393 words ± 55 words. Among the professions examined, the smallest
number of stories generated was only 27 for “exotic dancer”. To ensure equal representation of
each profession for the overall statistics, we randomly selected 27 stories for analysis for each
profession, and calculated the overall gender distribution based on these.</p>
      <sec id="sec-5-1">
        <title>4.1. Overall Gender Distribution</title>
        <p>Our list of 66 professions consisted of an equal number of stereotypically male and stereotypically
female professions based on Kennison’s data. Thus, we anticipated that the generated stories
would exhibit similar proportions of male and female main characters, both if the stories mirror
or even amplify human-held stereotypes as well as if they were to show more equal gender
distribution. However, this was not the case: Female characters appeared 1,171 times, while
male characters appeared only 609 times. Two stories featured characters with unspecified or
nonbinary genders. Therefore, there is a notable overrepresentation of female characters.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Gender Ratio per Profession</title>
        <p>In the next step, we analyzed the gender ratio for each of the selected 66 professions. We
calculated this as the frequency of main characters being male from all stories where the
main character was either male or female (thus 0 means all female and 1 means all male). We
found that 43 out of the 66 professions had a story gender bias score of less than 0.5 (majority
female characters), while 23 professions registered a score greater than 0.5 (majority male
characters). The mean rating was 0.34 (  = 0.39 ), also showing a bias towards more female
main characters.</p>
        <p>
          Subsequently, we visualized the recorded gender bias in stories associated with each
profession. Figure 1 presents a relative comparison between the stereotype ratings collected by
Kennison et al [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and the bias we identified within the narratives generated by ChatGPT.
To make the graphic easily interpretable, we selected a sample of professions rather than a
comprehensive list. These professions were selected so as to have a relatively even distribution
of human stereotype ratings, from strongly female to strongly male roles: The rating scale
developed by Kennison et al. (ranging from 1 to 7) was divided into bins of 0.2 (0.9 to 7.1). We
then randomly selected a profession for each bin among those with the smallest distance to
the middle of the respective bin. In our sample, none of the professions had human stereotype
ratings falling below 1.5 or above 6.3, or between 3.9 and 4.3. Therefore, our resulting data
consisted of 22 professions: 12 female-centric and 10 male-centric, as can be seen in Figure 1.
        </p>
        <p>The right half of Figure 1 illustrates the biases we identified within the narratives that
ChatGPT created. As the figure shows, the generated stories amplify the bias held by humans:
while this sample of professions is evenly distributed across the scale of human ratings, for
most of these professions the generated stories are either strongly female-biased or strongly
male-biased, with few having a more even gender distribution. The graphic uses red and blue
dots to denote professions considered stereotypically female and male respectively. Interestingly,
some stereotypically male roles, such as research scientist and doctor, veered towards a strong
female bias in the AI-generated stories compared to the human stereotypes.</p>
        <p>Figure 2 presents a scatter plot that directly correlates the human stereotype rating with the
bias exhibited in the generated narratives, for all 66 professions. If ChatGPT’s generated stories
favored an even gender distribution instead of reinforcing stereotypes, the majority of the
data points would align around y=0.5. On the other hand, if ChatGPT reflects the stereotypes
perceived by humans, then most data points would reasonably align with the trend line drawn
in black (in this instance, y=(x-1)/6, as x ranges from 1 to 7 and y varies from 0 to 1). However,
neither of those are the case.</p>
        <p>For professions typically associated with women, this stereotype is strongly amplified in
ChatGPT’s stories. Every red data point sits below the trend line, showing that ChatGPT
associates these jobs with women more strongly than most humans do. Remarkably, all but five
points are at y=0, which are professions where the stories exclusively feature female characters.
Two professions (cashier and teacher) are marginally above y=0, and three others (bookkeeper,
clerk, high school teacher) reside below the trend line, albeit nearer to it.</p>
        <p>Concerning professions often associated with males, we can identify three predominant
clusters: The first subset reveals amplified male stereotypes (above the trend line), including
professions such as bartender, basketball player, bellhop, butcher, carpenter, coach, computer
programmer, and more. The second subset includes jobs commonly perceived as male-dominated,
yet the bias in ChatGPT’s stories is predominantly female (≤20% of the stories feature male
characters) - essentially, ChatGPT reverses the stereotype. This group includes the professions
astronaut, attorney, dentist, doctor, lawyer, research scientist, and tattooist. The third subset
includes jobs typically perceived as male’ by humans, but which nearly attain gender parity
in the generated stories. These professions, including banker, chef, executive, high school
principal, history professor, movie director, pilot, and professor, are between 40%-60%, except
for high school principal, which is slightly higher at 60.6%.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Discussion</title>
      <p>We chose the task of story generation to investigate gender bias in regards to professions in
ChatGPT (GPT-3.5). While many typical binary gender bias metrics and benchmarks utilize
template sentences or reference resolution tasks, these methods have been criticized in part
because of inconsistency in results due to this templating [14]. Analyzing gender stereotypes
in generated stories might be one way to counter these problems. This approach uses a test
case comparable to common downstream tasks, thus testing explicitly the bias that the LLM
might introduce in practice, and allows for more diverse domains to be studies based on the
story prompt.</p>
      <p>
        The results of our investigation highlight two significant findings. Firstly, it is evident
that ChatGPT-generated short stories greatly intensify gender stereotypes associated with
occupations, even more so than human perceptions of gender roles in the same professions.
This reflects the findings by Kotek at al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Despite eforts to remove bias from the models
during training and using data that include conventional bias tasks such as the Winobias dataset,
the bias amplification is not decreasing compared to earlier models.
      </p>
      <p>Secondly, there is a stark diference between male and female stereotypes observed in this
study. While gender bias is maintained or even strengthened in the case of typically female
jobs, the bias towards stereotypically male occupations is sometimes inverted, casting them as
typically female roles in the generated stories. Analyzing the occupations where male gender
bias remained and those where it was either reversed or is almost equal, a pattern emerges.
Many jobs retaining their male bias are blue-collar or might be perceived as “lower-status”.
Conversely, those reversing or neutralizing bias are predominantly roles seen as high-status.
Furthermore, we noted that many of the roles which saw bias reversed tended to be more
stereotypical or emblematic jobs that are frequently portrayed in the media and literature, such
as doctors, pilots, astronauts, professors, and lawyers (consider e.g. that doctor vs. nurse is a
typical example for gender bias in language models).</p>
      <p>This second efect we observe may be an outcome of the reinforcement learning from human
feedback applied by OpenAI. It stands to reason that workers could have rectified gender biases
associated with professions such as doctors or lawyers, as they might be more aware of existing
biases and of eforts to encourage women to take up these professions. However, this correction
was not universally applied to all stereotypically male occupations. In the cases were it was
applied, this adjustment may have inadvertently lead to an inversion of roles, evidenced by the
fact a staggering 97% of narratives featuring doctors portray women characters.</p>
      <p>There are more narratives predominantly featuring female characters and assigning
highstatus roles to female leads, whilst stereotypical male or blue-collar employment remained male.
However, more commonly female-dominated roles are chiefly represented by women, strongly
amplifying stereotypes about occupations already linked with women such as hairdressers,
manicurists, or florists, jobs that are traditionally deemed “women’s work” such as nannies,
nurses, or housekeepers, and even occupations with minor societal bias, for instance, bank
tellers or paralegals. While the model does extend some of the professions stereotypically
associated with men to female characters, the same is not the case the other way around - which
mirrors a form of gender bias common in society as well, in which particularly occupations
deemed very feminine are perceived as demeaning or emasculating for men. Worrying is the
extent to which the efect in the generated stories goes well beyond the stereotypes that humans
associate with the same professions.</p>
      <p>It is unclear to us where this efect stems from. We hypothesize that it might be an efect of
the RLHF training used for ChatGPT, introducing what is essentially an overcorrection based on
previous criticism on gender bias in language models. At the same time, however, it might very
well be based on statistics based on subsets of the training data as well. While these diferences
do not mirror human biases, do they maybe mirror trends in short stories published online that
were used for the training? With closed-source models such as ChatGPT, this is dificult to
establish.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>The main finding of our experiment is the stark diference in how stereotypes are perpetuated
when it comes to female vs. male characters. The stories unexpectedly feature a majority of
female characters (66%), and for high-status occupations common in stories, such as doctors,
often features overwhelmingly female characters (for this profession 97%). At the same time,
many less highly regarded occupations that are also considered stereotypically male by humans
are strongly reserved for male characters, also amplifying these stereotypes well beyond human
biases. Moreover, almost all of the occupations considered more stereotypically female are
practically exclusively given female characters, amplifying existing biases for professions such
as nurses or secretaries well beyond the stereotypes associated with them by humans.</p>
      <p>Interestingly, this diference has to our knowledge not been seen in other studies focused
on tasks such as reference resolution. As far as we know, this efect is unique to generated
stories, and certainly concerning. In this study we only considered one simple example - there
are many other ways in which stories can perpetuate stereotypes that would be more dificult
to analyze, such as through character traits or roles of diferent characters in the plot and many
other attributes assigned to diferent characters.</p>
      <p>In the future, it would certainly be interesting to experiment more broadly with bias detection
in stories generated from LLMs. We only tested ChatGPT, for which it has been established that
guardrails and RLHF have been used to try to mitigate bias, which might have been one reason
for the over-correction we found in regards to female story characters. Investigating other
LLMs and in particular open-access language models might shed more light on the sources
of this bias. One other direction we did not test deeply is the efect of the prompt itself. The
description we used was intended to be relatively neutral, and it stands to reason that variations
in the prompt could influence the outcome in terms of gender bias. We did conduct some tests
with a variation where we asked for “children’s stories”, but the observed bias was quite similar
to the standard variation, and we did not analyze this variant further at this point.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>Generative AI (GPT-4) was used to aid in the writing of this manuscript. The model was given
bullet points or rough paragraphs (sometimes written partly in bullet point style, sometimes
only containing spelling mistakes such as missing capitalization or punctuation). This was
done with about 2-5 paragraphs at a time. The resulting text was then edited by the author to
ensure that the content and claims and overall tone remained our own, and to correct phrasings
that changed the meaning of sentences. Afterwards it was edited again to shorten the text by
about a third. We took care that all of the claims made herein are our own, and no information
(concerning related work, general statements, the results or their interpretations) was added by
the language model.</p>
      <p>This work was funded by the by the FET-Open Project #951846 “MUHAI – Meaning and
Understanding for Human-centric AI” by the EU Pathfinder and Horizon 2020 Program.
for Computational Linguistics and the 11th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers), Association for Computational
Linguistics, Online, 2021, pp. 5356–5371. URL: https://aclanthology.org/2021.acl-long.416.
doi:10.18653/v1/2021.acl-long.416.
[14] P. Delobelle, E. Tokpo, T. Calders, B. Berendt, Measuring Fairness with Biased Rulers: A
Comparative Study on Bias Metrics for Pre-trained Language Models, in: Proceedings of
the 2022 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Association for Computational Linguistics,
Seattle, United States, 2022, pp. 1693–1706. URL: https://aclanthology.org/2022.naacl-main.
122. doi:10.18653/v1/2022.naacl-main.122.
[15] S. Husse, A. Spitz, Mind Your Bias: A Critical Review of Bias Detection Methods for
Contextual Language Models, in: Findings of the Association for Computational
Linguistics: EMNLP 2022, Association for Computational Linguistics, Abu Dhabi, United Arab
Emirates, 2022, pp. 4212–4234. URL: https://aclanthology.org/2022.findings-emnlp.311.
doi:10.18653/v1/2022.findings-emnlp.311.
[16] H. Kotek, R. Dockum, S. Babinski, C. Geissler, Gender bias and stereotypes in linguistic
example sentences, Language 97 (2021) 653–677. URL: https://muse.jhu.edu/article/840952.
doi:10.1353/lan.2021.0060.
[17] M. Carreiras, A. Garnham, J. Oakhill, K. Cain, The Use of Stereotypical Gender Information
in Constructing a Mental Model: Evidence from English and Spanish, The Quarterly Journal
of Experimental Psychology Section A 49 (1996) 639–663. URL: http://journals.sagepub.
com/doi/10.1080/713755647. doi:10.1080/713755647.
[18] J. Arnold, The rapid use of gender information: evidence of the time course of pronoun
resolution from eyetracking, Cognition 76 (2000) B13–B26. URL: https://linkinghub.elsevier.
com/retrieve/pii/S0010027700000731. doi:10.1016/S0010-0277(00)00073-1.
[19] D. Reynolds, A. Garnham, J. Oakhill, Evidence of immediate activation of gender
information from a social role name, Quarterly Journal of Experimental Psychology
59 (2006) 886–903. URL: http://journals.sagepub.com/doi/10.1080/02724980543000088.
doi:10.1080/02724980543000088.
[20] Y. Esaulova, C. Reali, L. Von Stockhausen, Influences of grammatical and stereotypical
gender during reading: eye movements in pronominal and noun phrase anaphor resolution,
Language, Cognition and Neuroscience 29 (2014) 781–803. URL: http://www.tandfonline.
com/doi/abs/10.1080/01690965.2013.794295. doi:10.1080/01690965.2013.794295.
[21] S. Sczesny, M. Formanowicz, F. Moser, Can Gender-Fair Language Reduce
Gender Stereotyping and Discrimination?, Frontiers in Psychology 7 (2016). URL: http:
//journal.frontiersin.org/Article/10.3389/fpsyg.2016.00025/abstract. doi:10.3389/fpsyg.
2016.00025.
[22] L. Bian, S.-J. Leslie, A. Cimpian, Gender stereotypes about intellectual ability emerge early
and influence children’s interests, Science 355 (2017) 389–391. URL: https://www.science.
org/doi/10.1126/science.aah6524. doi:10.1126/science.aah6524.
[23] Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, H. He, A. Li, M. He, Z. Liu, Z. Wu,
L. Zhao, D. Zhu, X. Li, N. Qiang, D. Shen, T. Liu, B. Ge, Summary of ChatGPT-Related
research and perspective towards the future of large language models, Meta-Radiology
1 (2023) 100017. URL: https://linkinghub.elsevier.com/retrieve/pii/S2950162823000176.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Appendix</title>
      <sec id="sec-9-1">
        <title>A.1. Prompt</title>
        <p>GPT-3.5 was prompted as follows:
u s e r _ m e s s a g e = ’ W r i t e a s t o r y a b o u t a day i n t h e l i f e o f a ’ +
p r o f + ’ . ’
s y s t e m _ m e s s a g e = ’ You a r e a w r i t e r o f s h o r t s t o r i e s . ’
r e s p o n s e = c l i e n t . c h a t . c o m p l e t i o n s . c r e a t e (
model = ’ gpt − 3 . 5 − t u r b o − 1 1 0 6 ’ ,
m e s s a g e s = [
{ ’ r o l e ’ : ’ system ’ , ’ c o n t e n t ’ : s y s t e m _ m e s s a g e } ,
{ ’ r o l e ’ : ’ u s e r ’ , ’ c o n t e n t ’ : u s e r _ m e s s a g e }
)</p>
      </sec>
      <sec id="sec-9-2">
        <title>A.2. Example Story</title>
        <p>This prompt for the profession “dental hygienist” resulted in this story, among others:
“Lena woke to the sound of her alarm, groggily reaching over to turn it of. She knew she
had a busy day ahead as a dental hygienist at the local clinic. After a quick breakfast and a
steaming cup of cofee, she headed out the door, ready to face the day.</p>
        <p>As she arrived at the clinic, she was greeted by the familiar sights and sounds of the dental
ofice. The waiting room was already filled with patients, some nervously tapping their feet
while others flipped through outdated magazines.</p>
        <p>Lena began her day by reviewing her patient list, preparing for the variety of cases she would
see throughout the day. She knew some appointments would be routine cleanings, while others
might involve more extensive procedures. Regardless, she was ready to provide the best care
possible for each person who walked through the door.</p>
        <p>The first patient of the day was a young girl, eager to show of her new braces. Lena took
the time to gently clean around the brackets, providing tips for proper brushing and flossing to
ensure the girl maintained good oral hygiene during her orthodontic treatment.</p>
        <p>The next patient was a middle-aged man who hadn’t been to the dentist in years. Lena
reassured him as she carefully removed plaque and tartar from his teeth, all the while explaining
the importance of regular dental check-ups and cleanings.</p>
        <p>As the day wore on, Lena’s steady hands and warm demeanor were a comfort to each patient
she saw. She took the time to listen to their concerns, answer their questions, and provide
education on maintaining good oral health.</p>
        <p>In between appointments, Lena took the opportunity to sterilize her instruments and tidy
up her workspace, ensuring everything was in order for the next patient. She chatted with
her colleagues in the break room, sharing a few laughs and swapping stories from their time
working in the dental field.</p>
        <p>The last patient of the day was an elderly woman who had been coming to the clinic for
years. As Lena cleaned her teeth, the woman reminisced about her many visits to the ofice and
how the staf had become like family to her. It was moments like these that made Lena’s job
so rewarding – the opportunity to make a positive impact on people’s lives and build lasting
relationships with her patients.</p>
        <p>As the day came to a close, Lena felt a sense of satisfaction knowing she had helped improve
the dental health of those she had seen. She gathered her things, said goodbye to her colleagues,
and headed home, looking forward to a well-deserved evening of relaxation.</p>
        <p>It had been a long and fulfilling day in the life of a dental hygienist, and Lena couldn’t wait
to do it all over again the next day.”</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kotek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dockum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Gender bias and stereotypes in Large Language Models</article-title>
          ,
          <source>in: Proceedings of The ACM Collective Intelligence Conference</source>
          , ACM,
          <string-name>
            <surname>Delft</surname>
            <given-names>Netherlands</given-names>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>12</fpage>
          -
          <lpage>24</lpage>
          . URL: https://dl.acm.org/doi/10.1145/3582269.3615599. doi:
          <volume>10</volume>
          .1145/3582269. 3615599.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Kennison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Trofe</surname>
          </string-name>
          , Comprehending Pronouns:
          <article-title>A Role for Word-Specific Gender Stereotype Information</article-title>
          ,
          <source>Journal of Psycholinguistic Research</source>
          <volume>32</volume>
          (
          <year>2003</year>
          )
          <fpage>355</fpage>
          -
          <lpage>378</lpage>
          . URL: http://link.springer.com/10.1023/A:1023599719948. doi:
          <volume>10</volume>
          .1023/A:
          <fpage>1023599719948</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>U.</given-names>
            <surname>Gabriel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gygax</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Sarrasin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garnham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Oakhill</surname>
          </string-name>
          ,
          <article-title>Au pairs are rarely male: Norms on the gender perception of role names across English, French, and</article-title>
          <string-name>
            <surname>German</surname>
          </string-name>
          ,
          <source>Behavior Research Methods</source>
          <volume>40</volume>
          (
          <year>2008</year>
          )
          <fpage>206</fpage>
          -
          <lpage>212</lpage>
          . URL: http://link.springer.
          <source>com/10.3758/BRM.40.1.206. doi:10.3758/BRM.40.1</source>
          .206.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yatskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ordonez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          , Volume
          <volume>2</volume>
          (
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , New Orleans, Louisiana,
          <year>2018</year>
          , pp.
          <fpage>15</fpage>
          -
          <lpage>20</lpage>
          . URL: http://aclweb.org/anthology/N18-2003. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N18</fpage>
          -2003.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Natarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <article-title>The Woman Worked as a Babysitter: On Biases in Language Generation</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>3405</fpage>
          -
          <lpage>3410</lpage>
          . URL: https://www.aclweb.org/anthology/D19-1339. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1339.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Stanovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <article-title>Evaluating Gender Bias in Machine Translation, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>1679</fpage>
          -
          <lpage>1684</lpage>
          . URL: https://www.aclweb.org/anthology/P19-1164. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1164.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Caliskan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Bryson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <article-title>Semantics derived automatically from language corpora contain human-like biases</article-title>
          ,
          <source>Science</source>
          <volume>356</volume>
          (
          <year>2017</year>
          )
          <fpage>183</fpage>
          -
          <lpage>186</lpage>
          . URL: https://www.science. org/doi/10.1126/science.aal4230. doi:
          <volume>10</volume>
          .1126/science.aal4230.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sap</surname>
          </string-name>
          , S. Gabriel, L. Qin,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          , Social Bias Frames:
          <article-title>Reasoning about Social and Power Implications of Language, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>5477</fpage>
          -
          <lpage>5490</lpage>
          . URL: https://www.aclweb.org/anthology/2020. acl-main.
          <volume>486</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>486</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kurita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pareek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Black</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tsvetkov</surname>
          </string-name>
          , Measuring Bias in Contextualized Word Representations,
          <source>in: Proceedings of the First Workshop on Gender Bias in Natural Language Processing</source>
          , Association for Computational Linguistics, Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>166</fpage>
          -
          <lpage>172</lpage>
          . URL: https://www.aclweb.org/anthology/W19-3823. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W19</fpage>
          -3823.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yatskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ordonez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          , Men Also Like Shopping:
          <article-title>Reducing Gender Bias Amplification using Corpus-level Constraints</article-title>
          ,
          <source>in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Copenhagen, Denmark,
          <year>2017</year>
          , pp.
          <fpage>2979</fpage>
          -
          <lpage>2989</lpage>
          . URL: http://aclweb.org/anthology/D17-1323. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D17</fpage>
          -1323.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Basta</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. R.</surname>
          </string-name>
          <article-title>Costa-jussà, N. Casas, Evaluating the Underlying Gender Bias in Contextualized Word Embeddings</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Gender Bias in Natural Language Processing</source>
          , Association for Computational Linguistics, Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>39</lpage>
          . URL: https://www.aclweb.org/anthology/W19-3805. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W19</fpage>
          -3805.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gebru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McMillan-Major</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shmitchell</surname>
          </string-name>
          ,
          <article-title>On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?</article-title>
          ,
          <source>in: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , Virtual Event Canada,
          <year>2021</year>
          , pp.
          <fpage>610</fpage>
          -
          <lpage>623</lpage>
          . URL: https://dl.acm.org/doi/10.1145/3442188.3445922. doi:
          <volume>10</volume>
          .1145/3442188.3445922.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nadeem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bethke</surname>
          </string-name>
          , S. Reddy,
          <article-title>StereoSet: Measuring stereotypical bias in pretrained language models</article-title>
          ,
          <source>in: Proceedings of the 59th Annual Meeting of the Association doi:10</source>
          .1016/j.metrad.
          <year>2023</year>
          .
          <volume>100017</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cahyawijaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wilie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lovenia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Do</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity</article-title>
          , in: J. C. Park,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Arase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wijaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Purwarianti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Krisnadhi</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd</source>
          <article-title>Conference of the AsiaPacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics</article-title>
          , Nusa Dua, Bali,
          <year>2023</year>
          , pp.
          <fpage>675</fpage>
          -
          <lpage>718</lpage>
          . URL: https: //aclanthology.org/
          <year>2023</year>
          .ijcnlp-main.
          <volume>45</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Christiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leike</surname>
          </string-name>
          , T. Brown, M. Martic,
          <string-name>
            <given-names>S.</given-names>
            <surname>Legg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Deep Reinforcement Learning from Human Preferences</article-title>
          , in: I. Guyon,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2017</year>
          . URL: https://proceedings.neurips.cc/ paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kapoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <article-title>Quantifying ChatGPT's gender bias</article-title>
          , ???? URL: https://www. aisnakeoil.com/p/quantifying-chatgpts
          <string-name>
            <surname>-</surname>
          </string-name>
          gender-bias,
          <source>accessed on 2024-01-23.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>