<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>European Workshop on Algorithmic Fairness, July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Fair balancing? Evaluating LLM-based Privacy Policy Ethics Assessments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vincent Freiberger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erik Buchmann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI)</institution>
          ,
          <addr-line>Dresden/Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leipzig University</institution>
          ,
          <addr-line>Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>0</volume>
      <fpage>1</fpage>
      <lpage>03</lpage>
      <abstract>
        <p>At this moment, we observe a strong increase in use cases and organizations using machine learning and artificial intelligence (AI), particularly large language models (LLMs) like OpenAI's GPT series. This has many advantages for organizations. However, it becomes increasingly important to evaluate ethical and fairness-related aspects of using personal information as data for model training or as input for automated processes. Therefore, privacy policies are an important resource. Privacy policies aim to make data collection, sharing, and usage transparent to users. However, privacy policies are also known to be long and complex. This has raised issues like failing to understand such policies or even consent fatigue, i.e., users just accepting all tracking of their data. Potentially abusive or unfair privacy practices may remain unnoticed. An independent, automated assessment of ethics in privacy policies could help in fairly balancing out existing information asymmetries. We explore using an LLM for an ethics assessment of data management practices documented in privacy policies. In particular, we develop and systematically evaluate a prompting template for ethics evaluation. By means of qualitative experiments with privacy policies from the Top-100 German web shops, we quantitatively investigate the robustness and quality of the LLM-based ethics assessment, and how varying roles, user interests, ethical frameworks, etc. in the LLM prompts afect the assessment. To the best of our knowledge, we are the first to investigate how LLMs can be used for ethics assessments of data management practices. Our results show that LLM-based ethics assessment, yet still limited in its specificity and consistency, shows promise for the future. The identified criteria are consistent with those from related work. We find that varying the role assigned to the LLM has the largest efect on the LLM's ethics assessment. An ethics assessment could allow end users to make more informed decisions. The moral judgment of LLMs regarding online privacy is not just relevant to policy assessment, but could be used to investigate specifications, regulations, norms, or legal documents.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Privacy Policies</kwd>
        <kwd>Ethics</kwd>
        <kwd>Morality</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        We observe a rapid proliferation in the use of machine learning and artificial intelligence
(AI) in our everyday lives. A large share of the data from today’s domestic devices, public
transportation, electric vehicles, industrial control systems, and ofice equipment is used for
training or analyzed by AI. Amongst the technology industry’s biggest companies are some
that collect excessive amounts of sometimes even personally identifying data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. User data is
valuable to companies as it allows them, for instance, to track clients, improve services, target
marketing, or sell data to third parties. The increasing use of large language models (LLMs)
such as GPT-3.5 [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], Gemini [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], or Opus [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] in complex IT ecosystems is likely to increase
the impact on the user’s privacy by an order of magnitude.
      </p>
      <p>
        This makes it challenging for users to decide if a digital service meets their privacy
preferences and moral standards. Organizations are required by General Data Protection Regulation
(GDPR) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in the EU to publish a privacy policy. It must contain data management practices of
the organization, as well as users’ rights regarding their data. Existing approaches from data
protection legislation assume an aware user. This user actively reads the privacy policy and
boilerplate information of each service, to make an informed decision about using a service
and consenting to data collection. Given hundreds of services every day and non-transparent
multi-page privacy statements in fine print [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ], it is unlikely that this approach ensures a
fair balance between the interests of the service provider and the service user. We investigate
this issue by using an LLM for an ethics assessment of privacy policies. In particular, we want to
ifnd out if an LLM can be used to automatically identify data management practices that conflict
with existing social and ethical standards regarding fairness, accountability, etc. For example, a
policy would be considered informationally unfair, if it extensively uses complex legal phrases
that discriminate against people with dyslexia and non-native speakers. An example of an
unethical policy would be one, that grants extensive rights to utilize the user’s personal data.
      </p>
      <p>
        An automated assessment of such aspects is challenging. Privacy policies need to justify,
particularly for AI-based services, complex data collection and usage practices. That results in a
large spectrum of ethical problems that may be encountered, depending on the context of the
privacy policy and the assessment. Thus, we strive for insight into the shortcomings of LLMs in
moral judgment regarding online privacy. Our research questions are as follows:
RQ1. How capable are LLMs of assessing the ethicality of privacy policies?
RQ2. Do variations in the context of the LLM prompt influence its ethics assessment?
To approach our research questions, we perform experiments on recent German privacy
policies with the at the time of writing widely used LLM, GPT-3.5 turbo [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Our experiments
systematically vary the role assigned to the LLM for assessment, the interests of the user afected
by the policy, the ethical framework assigned for assessment, and the scope and temporal span
of the assessment in the prompt. Also, we assess the robustness of moral judgment to a diferent
seed for generation, and to a paraphrased wording of the prompt. We evaluate the LLM outputs’
quality on a broad set of metrics. In particular, we make the following contributions:
• We conduct an ethics assessment on 55 privacy policies of the Top-100 German web shops
and evaluate their quality.
• We identify how changes in the prompt influence the ethics assessment of an LLM. In
total, we run 9240 experiments with GPT-3.5 turbo.
• We reveal ethical shortcomings of recent privacy policies, based on 1116 distinct criteria
returned by our experiments.
      </p>
      <p>To the best of our knowledge, we are the first to systematically evaluate an LLM’s ethics
assessment of privacy policies. Such an assessment ofers guidance to users, and provides
insight into the moral judgment of LLMs regarding online privacy. When looking at the details,
we find that an LLM-based assessment of privacy policies is still limited in its depth, specificity,
and consistency. The identified criteria are consistent with related work on privacy policies,
and particularly a change in role changes the perspective of an assessment. Thus, it is a step
towards fairly balancing provider and user interests by empowering the user.</p>
      <p>Paper structure: Section 2 reviews related work. Section 3 outlines our research method.
The Sections 4 and 5 evaluate and discuss our results. Finally, Section 6 concludes.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>This section provides a brief overview of ethics, privacy policies, large language models, and
tools from computer linguistics needed to evaluate an LLM output.</p>
      <sec id="sec-2-1">
        <title>2.1. Ethics</title>
        <p>
          Privacy ethics addresses the degree of access others have to one’s information as well as control
one has over it and actions one can take concerning one’s privacy [
          <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13">10, 11, 12, 13</xref>
          ]. It discusses
often complex privacy trade-ofs and the balance of power between the data holder and data
subject [
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ]. Seeking privacy serves two major interests: Security interests (stay unharmed)
and privacy per se [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Privacy per se is about influencing the way we present ourselves
to others and, more broadly, our autonomy [
          <xref ref-type="bibr" rid="ref16 ref17">17, 16</xref>
          ]. The philosophical debate on privacy
ethics distinguishes between privacy and the right to privacy. Depending on the context of
the acquisition of information about oneself by another party and the underlying intention
privacy may be violated, however, not the right to privacy [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Hence, focusing on the right to
privacy leans towards a deontological perspective, whereas viewing privacy by itself leads to a
consequentialist perspective.
        </p>
        <p>
          In the realm of online privacy, ethical concerns regarding surveillance [
          <xref ref-type="bibr" rid="ref16 ref18">16, 18</xref>
          ], impact of
choice [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], manipulation [
          <xref ref-type="bibr" rid="ref15 ref20">20, 15</xref>
          ], and power imbalance [
          <xref ref-type="bibr" rid="ref14 ref15">15, 14</xref>
          ] have been raised.
        </p>
        <p>
          Codes of ethics, legal statutes, or international declarations could provide helpful input for
an ethics assessment of online privacy [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. They represent values ingrained in our society. The
Universal Declaration of Human Rights (UDHR) [22] and the European Convention on Human
Rights (ECHR) [23] provide a relevant minimum standard to consider in an ethics assessment.
Aspects like freedom, equality, and dignity (Art. 1 UDHR), non-discrimination (Art. 2 UDHR,
Art. 21 ECHR), freedom of expression (Art. 19 UDHR), protection of children, elderly and
the disabled (Art. 24, 25, 26 ECHR), protection of intellectual property (Art. 27(2) UCHR), or
consumer protection (Art. 30 ECHR) are stated. Additionally, the Digital Services Act [24]
could be seen as a foundation of ethical principles. Amongst the principles it enforces are the
legality of content, accountability, non-manipulative practices, transparency, and the protection
of minors. Underlying virtues like inclusiveness, protection of the vulnerable or accountability
could motivate an approach considering virtue ethics.
        </p>
        <p>
          This leads us to define the following quality criteria for an assessment: It should consider
all relevant perspectives and stakeholders with enough depth in its normative grounding [25].
Ethical aspects may come into efect unintended, sometimes as second-order consequences [ 25]
which need to be considered. The ongoing debate about privacy ethics and its definition [
          <xref ref-type="bibr" rid="ref11 ref12 ref16 ref17">12, 11,
16, 17</xref>
          ] shows that an assessment must be thorough, consistent, structured, and comprehensible
in its reasoning. Consistency of moral advice given by the model is important as it influences
user judgment [26]. A good assessment should be concise and understandable for everyone
as we want to provide informational fairness [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Addressing privacy trade-ofs and power
imbalances is essential for a meaningful assessment [
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ].
        </p>
        <p>LLMs can be investigated based on moral psychology [27]. This involves looking into the
extent to which moral reasoning and moral judgments are represented in model outputs and
into biases the model might have in that regard. For instance, ChatGPT has been found to be
inconsistent in its moral advice when facing a moral dilemma [26]. The closest to our work is
the ETHICS data set [28] used to investigate general ethical judgment of language models.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Privacy policies</title>
        <p>
          Privacy policies aim to make data collection, sharing, and usage transparent to users [29]. A
website owner is required by the General Data Protection Regulation (GDPR) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] to publish a
privacy policy. There is an information asymmetry between website owners’ full knowledge
of their privacy practices as well as their potential shortcomings and the average users’
unawareness of them. A privacy policy addresses this information asymmetry between service
provider and user and helps in creating trust. Specifically, privacy policies should inform users
about their data protection rights and explain data management practices. This includes clearly
stating retention periods as well as to which third parties what data is transferred and
explaining the respective purpose. The GDPR establishes the principles (1) lawfulness, fairness and
transparency, (2) purpose limitations, (3) data minimization, (4) accuracy, (5) storage limitations,
(6) integrity and confidentiality, and (7) accountability.
        </p>
        <p>The advance of generative AI-based applications, and services integrating them, introduces
more complexity to data protection and respective privacy policies. Interests between the service
provider and user are conflicting. Users may want their data to stay as private as possible.
The service provider can leverage data, for instance, to improve services or target marketing,
and wants to collect data from users. The GDPR handles this conflict of interest by enforcing
a notion of fair balancing [30]. This means that the privacy risks faced by users should be
balanced with the business interests of the service provider. We would like to note that this
two-sided view is a simplification of reality that involves many more stakeholders with difering
interests, e.g., politicians, consumer protection initiatives, independent ethicists, etc.</p>
        <p>
          One common issue is transparency in privacy policies [
          <xref ref-type="bibr" rid="ref7 ref8">31, 7, 8, 32</xref>
          ]. Policies tend to be long,
written in inaccessible language, and hard to understand. This informational unfairness [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
leads to issues like consent fatigue [33] and users potentially being exposed to unethical privacy
practices without their awareness of the consequences. Privacy policies can hide and mitigate
unethical data handling practices and deceive users into trusting the service using persuasive
appeals [31, 34]. This was not yet in the context of the GDPR. Unfair representations are another
common problem [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Privacy policies contain for instance gender bias. A third type of issue
is lacking fair balancing, as seen in privacy policies that have claimed all rights over users’
data [35]. Over time consistent ethical shortcomings can cause privacy fatigue which can have
a stronger influence on online privacy behavior than privacy concerns [33].
        </p>
        <p>
          The given issues, particularly regarding complexity and length of privacy policies, have
motivated privacy assistants [
          <xref ref-type="bibr" rid="ref22">36, 37, 38</xref>
          ]. Emerging capabilities by scaling up LLMs [
          <xref ref-type="bibr" rid="ref23">39</xref>
          ] have
given them a wide range of applicability [
          <xref ref-type="bibr" rid="ref24">40</xref>
          ]. This makes them interesting as a tool for assessing
privacy policies. They have been utilized by prompting them with a privacy policy and automatic
queries on concepts around privacy and compliance [
          <xref ref-type="bibr" rid="ref25">41</xref>
          ], for topic classification of sections of
policies [
          <xref ref-type="bibr" rid="ref26">42</xref>
          ], or as an interactive question-answering assistant [
          <xref ref-type="bibr" rid="ref27">43</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Large language models and prompting</title>
        <p>
          LLMs generate text for a given prompt by iteratively predicting the next token based on
probability. GPT-3.5 is an LLM based on GPT-3 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] improved by reinforcement learning with human
feedback [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. It is, at the time of writing, the underlying model of the free version ChatGPT
with millions of users [
          <xref ref-type="bibr" rid="ref28">44</xref>
          ]. The model version GPT-3.5 turbo supports up to 16385 tokens in
context length and is accessible via OpenAI’s API. GPT-3.5 is capable of deductive, abductive,
and commonsense reasoning but struggles with inductive reasoning [
          <xref ref-type="bibr" rid="ref29 ref30">45, 46</xref>
          ]. Deductive
reasoning goes from rather general concepts in the reasoning process down to specifics. Inductive
reasoning generalizes from specifics. Abductive reasoning takes a set of observations and draws
the most likely conclusion. Commonsense reasoning is “understanding and reasoning about
everyday concepts and knowledge that most people are familiar with, to make judgments and
predictions about new situations” [
          <xref ref-type="bibr" rid="ref30">46</xref>
          ].
        </p>
        <p>
          LLMs have recently been shown to have limited capabilities in many diferent domains [
          <xref ref-type="bibr" rid="ref24 ref31">47, 40</xref>
          ].
Smart and complex prompting strategies [
          <xref ref-type="bibr" rid="ref3 ref32 ref33 ref34 ref35 ref36">48, 49, 50, 3, 51, 52</xref>
          ] have been used to address
limitations. For a general review on prompting LLMs, we refer to the literature [
          <xref ref-type="bibr" rid="ref37">53</xref>
          ].
        </p>
        <p>
          Particularly reasoning-related tasks benefit from breaking down the problem and solving
it step-by-step with an LLM [
          <xref ref-type="bibr" rid="ref32 ref33">48, 49</xref>
          ]. Such prompting strategies are referred to as
Chain-ofThought prompting [
          <xref ref-type="bibr" rid="ref34">50</xref>
          ]. To improve the result, asking a model to reflect and improve its
output has been shown to improve quality [54]. To guide a model in a specific direction for
model outputs, giving examples for potential outputs is efective [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. This is also referred to as
few-shot prompting. Repeating particularly relevant aspects in the input prompts helps with
the model considering them in the generated output [55]. Controlled slight variations of the
prompt can have a big impact on the model output [27]. If reproducible outputs are important,
the seed needs to be fixed and other parameters need to be unchanged over the experiments [ 56].
Robustness testing of LLM outputs can be implemented by paraphrasing the prompt [
          <xref ref-type="bibr" rid="ref25">41, 57, 58</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Tools from computer linguistics</title>
        <p>We borrow from linguistics to evaluate the output of LLMs. The readability of text can be
measured with the Flesch Reading Ease (FRE) [59]. We refer to word count and FRE as
surfacelevel metrics. Sentiment captures the positive or negative associations transferred by text.
Sentiment analysis delivers the most accurate results when performed sentence-wise. For
German language, german-sentiment [60] can be used as a tool. Embeddings are numeric vectors
representing the semantic meaning of the words or text for which they are calculated [61].
OpenAI’s small embedding model version 3 [62] calculates 1536-dimensional text embeddings
suficient to capture a high degree of semantic nuance. For aggregating text embeddings,
maxpooling can be efectively utilized [ 63]. Max-pooling takes the maximum for every dimension
of the embedding vectors over all embeddings that are pooled. Calculating cosine similarities is
commonly done to compare semantic similarity between embeddings.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Our research method</title>
      <p>This section explains how we create our prompting template, extract relevant information from
the LLM outputs, and evaluate the LLM assessment.</p>
      <sec id="sec-3-1">
        <title>3.1. LLM experiments</title>
        <p>Our prompting template follows the LLM prompting techniques introduced in Section 2.3. It
can be found on the project’s GitHub repository1. Since our privacy policies are written in
German, we prompt the LLM in the same language. The system prompt describes the task for
the LLM. The user prompt gives the privacy policy as input for this task. For example, Figure 1
shows an English translation of a specific system prompt and a user prompt template.</p>
        <p>System prompt: You are a critically scrutinizing, experienced expert on an independent ethics board. You have a nuanced
and in-depth perspective on ethical issues. You already have a lot of experience in the precise ethical assessment of
privacy statements. You will be given a privacy policy. Your task is to uncover aspects in privacy policies that are ethically
questionable from your perspective. To do this, proceed step by step:
1. Identify relevant ethical test criteria for this privacy policy from your perspective as criteria for a later assessment.</p>
        <p>When naming the test criteria, stick to terms and concepts that are as standardized/common in the field of ethics
as possible.
2. Based on this, check for ethical problems or ethically questionable circumstances in the privacy policy. Describe
your analysis.
3. Only after you have completed step 2: Based on your analysis, rate the privacy policy against each of your criteria
on a 5-point Likert-scale. Explain what this rating means. Explain what the ideal case with 5 points and the worst
case with one point would look like. The output in this step should be formatted in bold as follows:
[Insert rating criterion here]: [insert rating here]/5 [insert line break]
[insert justification here]
4. Important: State precisely in keywords what assumptions you have made regarding your [role], the [user
interests], the [scope of ethical implications] (local, global, etc.), the [long or short term nature of ethically relevant
implications] and the [ethical frameworks used] (e.g. utilitarianism, virtue ethics, deontology,...). You must put
these 6 assumptions in square brackets [].
5. Reflect on your assessment and check whether it is complete. Show how your result is anchored in ethical
frameworks. If something is missing, add it! Important: Check for errors in your analysis and correct them if
necessary.</p>
        <p>You must present your approach clearly and follow the steps mentioned.</p>
        <p>User prompt: The privacy policy: &lt;Privacy policy text is inserted here&gt;
criteria for the policy being assessed, i.e., each policy might be evaluated with diferent criteria.
(2) Step 2 performs a written ethics assessment based on the criteria from step 1. (3) Step 3
assigns a suitable Likert-score based on the assessment in step 2 for each criterion. (4) Step 4
outputs assumptions made by the model on the role it takes when doing the assessment, the
user interests, the temporal span considered in the assessment, the scope of the assessment,
and the ethical framework utilized for the assessment. (5) Step 5 asks for reflection on the result
1https://github.com/Freiberger97/LLM-based-privacy-policy-ethics-assessment
and for a justification of the ethical framework that has been utilized. To evaluate the ethical
capabilities of the LLM, we experiment with diferent contexts.</p>
        <p>In particular, we have the objective of:
• giving a diferent perspective to an ethics assessment. More nuanced results should be
possible by combining results from individual variations.
• understanding assumptions an LLM defaults to if given incomplete information.
• understanding how sensitive the LLM is to specifications of important variables for an
ethics assessment.</p>
        <p>We vary the context in five dimensions, as shown in Table 1. Detailed explanations can be found
in our GitHub repository. In our experiments, we vary one dimension after the other. As the
second column in Table 1 shows, this is a total of 27 variations, plus a baseline without such
information. In addition, we test the robustness of the LLM’s outputs by using three diferent
seeds held constant for all variations and policies. Finally, we use [64] to translate each prompt
to English and back to German, to obtain a paraphrased version of the prompt. We test each
paraphrased prompt with the same three seeds to find out, if variations in the wording have an
impact on the results. In total, we test 168 diferent prompts against each of our privacy policies.</p>
        <p>
          Our data set is based on the Top-100 most popular German web shops from 2020 to 2023. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
explains how we have obtained, cleansed and preprocessed this data set. Further, we dropped
poorly scraped documents (e.g. missing content) and enforced a suitable length for GPT-3.5
turbo’s context length (16385 tokens). In the end, we have the most recent versions of 55 distinct
privacy policies in our data set. With our 168 diferent prompts, this results in 9240 individual
experiments. We decided to use GPT-3.5 turbo [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ] via the OpenAI API, because it is similar to
the well-known, free version of ChatGPT. Except for the seed and the system prompt, we used
the default settings. Figure 2 shows an output for the prompt in Figure 1. All outputs can be
found in our GitHub repository.
        </p>
        <p>1. Identified ethical review criteria: - Transparency: How transparent are the company’s data practices towards
users? - Consent: Has user consent been obtained for data use in a legally and ethically correct manner? - Data
minimization: Is only the data necessary for the respective purpose collected and processed? - Data security: Are
appropriate measures implemented to protect data from security risks? - Data disclosure: Is it openly communicated
to whom personal data is disclosed and for what purpose?
2. Analysis of the privacy policy: The privacy policy contains some ethically questionable aspects: - The collection
of IP addresses every time a website is accessed without the explicit consent of the user and without a clear
indication of how the data is anonymized. - The transfer of personal data to external third parties without an
explicit explanation of who these third parties are and for what purpose the data is being transferred. - The use of
cookies and tracking technologies without clear information on the purpose, storage duration and the possibility
of deactivation.
3. Assessment of the privacy policy: - Transparency: 2/5 The privacy policy is unclear regarding data transfer
and use. - Consent: 3/5 Consent is obtained for the use of cookies and data processing methods, but could be
more detailed. - Data minimization: 2/5 IP addresses and personal data are collected, the necessity of which
is not suficiently explained. - Data security: 3/5 Security measures are mentioned, but no details on specific
implementation. - Data transfer: 2/5 The transfer of data to external third parties raises questions, as it is not clear
who these third parties are and for what purpose data is transferred.
4. Assumptions: - [Role]: Independent ethics expert on the Ethics Council - [User interests]: Protection of personal
data, transparency, compliance with data protection regulations - [Scope of ethical implications]: Local (data
protection on the website) - [Long- or short-term nature of ethically relevant implications]: Long-term impact on
user privacy and trust - [Ethical frameworks used]: Data protection laws, ethics guidelines, principles of data ethics
5. Reflection and supplementation: The analysis of the privacy policy focused mainly on transparency, consent,
data minimization, data security and data sharing. It is important that the privacy policy provides clear and concise
information so that users can make informed decisions about their data. A more detailed examination of data
retention periods, user rights and the implementation of data protection measures could add further depth to the
analysis. It would be important to consider implications for vulnerable groups and possible privacy implications.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Evaluating model outputs</title>
        <p>We evaluate the quality of ethics assessments of the LLM (RQ1) by evaluating overall metrics,
comparing between policies, and assessing robustness to changed seeds or paraphrasing. We
evaluate the impact of variations in the prompt context (RQ2) by comparing metrics between
variations. Our metrics are based on the quality criteria established in Subsection 2.1, and we
use the tools outlined in Subsection 2.4. Table 2 gives an overview of our evaluation. Column 2
indicates if a metric is computed over a single criterion. Column 3 considers one of the five
steps the model is instructed to produce in its output (e.g. assumptions as step 4), and Column 4
points out if a metric considers the LLM output as a whole.</p>
        <p>We calculate and compare the number of assessment criteria produced by the LLM. This
indicates how specific, nuanced and in-depth an assessment may be.</p>
        <p>To find out whether deviations in the context of the prompt lead to a diferent assessment,
we compute the criteria occurrence ratio, i.e., the number of the assessment criteria that
appear in multiple experiments with the same privacy policy. This helps judge how specific an
assessment is in addressing a policy.</p>
        <p>We further measure distinctiveness of assessment criteria in model outputs, i.e., how
similar the assessment criteria are. Therefore, we measure the cosine similarities between the
criteria from diferent outputs. We also measure cosine similarities of criteria definitions for a
criterion overall and for criteria within an output. Distinctive assessment criteria that do not
deviate from one run to another for the same policy indicate a consistent assessment.</p>
        <p>We also calculate descriptive statistics of assessment scores, like the mean or standard
deviation for the criteria. All of these descriptive statistics can be compared for the diferent
prompt variations or between policies. Descriptive statistics indicate the impacts prompt
variations have and can be assessed regarding specificity and consistency.</p>
        <p>We use the model output as a whole to capture surface-level metrics on how concise and
readable an assessment is, like word count or the FRE. Our prompt tells the LLM to generate
for each criterion both a textual assessment and a Likert-score. To check the LLM output for
consistency, we compare the sentiment of the assessment in step 2 of the LLM output. As a
further consistency check, we evaluate the sentiment of the entire output.</p>
        <p>We compute cosine similarities of the document embeddings of the whole outputs. By
using max-pooling, we compare both variations and policies. Diferent policies having diferent
assessments indicate whether the LLM adjusts its assessment to the diferences between policies,
or not. We also find out if prompt variations have a large impact on the ethics assessment.</p>
        <p>Assessing assumptions evaluates the variations (role, user interest, ethical framework, scope
of assessment, temporal context). Again, we compute the cosine similarities of assumptions
made (step 4 of the output) to compare between prompt variations using max-pooling. This
allows us to evaluate their inter-dependencies towards the moral judgment of the LLM.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In this section, we evaluate the LLM’s ethics assessments (RQ1) regarding the ethics criteria
generated by the LLM, and we investigate its robustness w.r.t. seeds/paraphrasing or the
diferences between policies. We also address how variations in the context of the prompt
influences the ethics assessment ( RQ2). Our evaluation follows the metrics in Table 2. Our
metrics are weakly to moderately correlated.</p>
      <p>Number of assessment criteria: In all outputs of the LLM, we find 44987 criteria, of
which 1116 are distinct. Figure 3 shows the number of distinct criteria per policy over all
respective outputs of the LLM. We find a large diference in the number of distinct criteria being
considered between policies. Comparing between policies, the number of criteria considered in
a single output ranges from 4.46 to 5.45 on average. This suggests that the LLM adjusts to the
requirements of assessing a specific policy. An unexpected finding was that variations of the
prompt do not lead to notable changes in the number of criteria being considered for assessment.
The rather small number of criteria per output raises concerns about the nuance of assessments.</p>
      <p>Criteria occurrence ratio: The criteria ’Data minimization’, ’Data security’, ’Data sharing’,
’Data sharing with third parties’, ’Consent’, ’Rights of data subjects’, ’Security’, ’Transparency’,
and ’Purpose limitation’ appear for each policy in assessments (all criteria were translated to
English). These criteria are closely related to requirements specified by the GDPR. They also
reflect common issues regarding the fairness of privacy policies (see Section 2.2). Comparing the
diferent prompt variations reveals that the three most occurring criteria (transparency, consent,
and data security) are the same across all 28 prompt variations. The occurrence ratio over all
outputs of the LLM can be found in Figure 4 for the 15 most frequently mentioned criteria,
which are translated to English. Comparing robustness between normal and paraphrased
prompt, we find that some criteria, like transparency, appear consistently more often with
the paraphrased prompt, whereas others, like purpose binding, appear consistently less often.
Comparing between policies we also find considerable diferences in the criteria occurrence
ratio. This is desirable as it shows that the assessment adapts to the specifics of a policy. Among
less frequent criteria, we find many synonyms of more frequent criteria or criteria that combine
two more frequent criteria into one criterion. This shows limitations in consistency.</p>
      <p>Distinctiveness of assessment criteria: The average cosine similarity between all criteria
is low (mean: 0.38). Hence, most criteria are distinct. A closer inspection of just the 50 most
frequent criteria reveals that a few criteria are highly similar and hence possibly redundant. We
go further by assessing the similarity of criteria definitions. We find that definitions for the same
criterion assigned by the LLM are, on average (for the 30 most frequent criteria), not highly
consistent (cosine similarity between 0.54 and 0.68). The similarity of definitions for distinct
criteria in an individual output is rather high (mean cosine similarity: 0.59). The range for this
between variations is slim (between 0.58 and 0.60). Hence, the consistency in the definition of
criteria and their distinctiveness are not ideal.</p>
      <p>Descriptive statistics of assessment scores: The diferences in mean scoring averaged
over all criteria per output are shown in Figure 5. We find diferences between policies; however,
average scores over all criteria of an output have large standard deviations (std: 0.59). The
diferences in mean scoring of the three most frequent criteria between prompt variations are
shown in Figure 6. The mean score of the most frequent criteria is impacted by variations.
Particularly, the variations regarding the role impact the scoring of criteria. Even though we have
diferent scores for diferent policies and for diferent variations, the high standard deviations
are problematic for getting consistent results with assessments. The scoring, particularly
if combined with its reasoning, allows diferentiating how problematic a policy is. Among
the 30 most frequently occurring criteria, ’Data sharing with third parties’ is rated with the
overall lowest mean of 2.34, and ’Data security’ is rated highest with an overall mean of 3.30.
Shortcomings regarding data sharing are consistent with related work (cf. Sec. 2.2).</p>
      <p>Surface-level metrics: The length of an LLM assessment is relatively short (mean: 366.64;
std: 64.74) and on average not very much afected by a changed seed, paraphrased prompt
or variations. Hence, assessments are rather short and consistent in their length. The mean
FRE-score we find is 36.25. Scores vary notably (min: 15; max: 60; std: 5.59). Hence, readability
and consistency in it could be improved upon.</p>
      <p>Sentiment: Sentiment classifies how negative or positive associations transferred by text are
on a scale of -1 (negative) to 1 (positive). Our assessments regarding sentiment show diferences
in mean sentiment between policies, as can be seen in Figure 7. We find little diferences between
variations and seeds/paraphrasing. This all applies to the sentiment assessed on the overall
LLM output and sentiment for just the assessment step (step 2) of the LLM output. The overall
mean of the sentiment of whole LLM outputs is slightly negative (mean: -0.11; std: 0.14), similar
to sentiment of step 2 of the output (mean: -0.17; std: 0.20). These average values are consistent
with the aggregated mean score for all criteria, which is 2.87 and the high aggregated standard
deviation in scores, which is 0.86. This suggests that sentiment aligns with the scoring, gives
foundation to the scoring, and shows consistency within the assessment process.</p>
      <p>Document embeddings: The evaluation of document embeddings reveals high cosine
similarity between policies as well as between variations. This means that assessments all
go in a similar semantic direction. This may be a shortcoming regarding the specificity of an
assessment. Regarding variations, the specified roles have the least similarity compared with
the other variations, which are relatively homogeneous in their similarity. This shows that
assigning roles as a variation can have an impact on assessments. Normal seed to normal seed
or paraphrased to paraphrased are more similar than normal seed to paraphrased seed. Hence,
there are limitations regarding robustness when paraphrasing the prompt.</p>
      <p>Assessing assumptions: Cosine similarities of the assumptions’ embeddings between
variations can be grouped into the five assumptions that were requested:
• Role: Assigned roles in prompt variations are partially diferent (similarity approx. 0.25)
compared with assumed roles when they are not given. The roles "average user" and
"consumer protection oficer" are relatively similar ( ∼ 0.6) to other assumed roles. An
exception to this is the baseline prompt and the variation considering the interests of an
average user, which are less similar (∼ 0.45).
• User Interests: Changed user interests are all assumed to be similar (∼ 0.8) to all other
variations, apart from the role of the "average user"(∼ 0.6). The latter leads to diferent
assumptions being made about user interests.
• Scope: Scope is consistently not highly similar across all variations (∼ 0.55).
• Temporal Span: All variations have similarities between 0.65 and 0.75 apart from the
variation setting a short-term focus (∼ 0.5).
• Ethical Framework: Assigning a specific framework in a variation leads to considerably
lower similarity with other variations (∼ 0.4) compared to similarities amongst other
variations (∼ 0.7). Noteworthy is also the role of the average user and the baseline, which
are less similar (between 0.5 and 0.6) to all other variations not specifying an ethical
framework.</p>
      <p>
        We learn that the LLM defaults to a mid- to long-term assessment and is rather indiferent
regarding user interests. It is inconsistent regarding scope. It typically employs what it calls
data ethics or the legal frameworks as ethical frameworks if not specifically instructed. If not
instructed otherwise, it assumes a role similar to that of a data protection oficer, which is in line
with it abiding by principles from the GDPR [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Variations between seeds occur irrespective of
whether paraphrasing the prompt or not, and lead to cosine similarities of assumptions between
0.5 and 0.6. This means that assumptions generally vary considerably.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>Addressing RQ1: An indicator of the quality of our LLM-based assessment is that our findings
are in line with related work (cf. Sec. 2.2). The criteria utilized in privacy assistants for evaluation
are mostly contained in those identified by our approach to ethics assessment. Our low overall
scores regarding data sharing reflect issues identified in prior research. In our results, the
high occurrence rate of transparency as an assessment criterion underpins the relevance of
informational fairness in privacy policies, as suggested by related work. The principles of
the GDPR can also be seen as the basis for most of the frequently occurring criteria in LLM
assessments. This in turn is also a limitation, as ethical reasoning should not, for the most
part, only be grounded in legal compliance, but go beyond. This issue may be addressed by
instructing the LLM not to focus on the legality of a policy. The scoring and the sentiment of
the explanations in the assessment, even though they originally seemed to be consistent, are
not strongly correlated. This means that the sentiment of the assessment cannot solely be taken
as a foundation explaining the scoring. We find LLM outputs to be on average relatively robust
to a changed seed, and to a lesser degree to a paraphrased prompt. However, we find that the
consistency of model outputs still needs improvement. Between individual runs, we see too
much variation in the scoring for it to be reliable. Even though we find assessment criteria and
their scoring to change between policies, rather high similarities in their document embeddings
lead to the conclusion that specificity to a policy could be improved upon. Overall, we find that
an LLM-based assessment still needs some more refinement as well as improvements in the
LLM itself to be viable. Our assessment of the quality is also just based on quantitative metrics,
which need to be complemented by qualitative assessments to identify all shortcomings reliably.</p>
      <p>Addressing RQ2: We find that variations in our prompting, which should have a strong
impact on an assessment, had less impact on LLM outputs than expected. We found that
only assigning roles is efective in changing LLM outputs. The efect of assigning roles can
be particularly seen in the scoring of criteria. We find high cosine similarities in document
embeddings across variations, which slightly drop when assigning a specific role. Apart from
assumptions, the other metrics that we assessed are not considerably impacted by variations.
Assumptions made by the model are, for the most part, highly similar and the model defaults to
a generic assignment when not specifically instructed otherwise. This means that we mostly
cannot reach an efective shift in perspective as set as an objective. The LLM tends to default
to a legality-centric perspective. This is valuable knowledge about an LLM’s representations
regarding online privacy. As we would want more diverse perspectives for future assessments, a
prompt may explicitly state that legality is assumed and this perspective should not be pursued.</p>
      <p>
        Limitations: We deliberately chose to prompt the, at the time of writing, most frequently used
model family with GPT-3.5. An assessment could improve in its quality by utilizing more capable
models like GPT-4o [65], Claude Opus [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], or Google Gemini Ultra [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Also, investigating
capable open-source models like Mixtral-8x22B [66], Llama 3 70B [67], or even models where
more about the training data is known, is promising. Our investigated privacy policies are
from web shops. Expanding the data set to a broader set of web services, particularly AI-based
services, could be valuable for further testing of ethical depth and generalizing results. The
utilized prompt leverages modern prompt engineering approaches. Highly complex prompting
mechanisms and modifications to the prompt based on our findings (e.g. instruct that no legal
aspects should be in outputs) may improve results. LLM chaining with iterative prompt feedback
might be beneficial as well. Libraries like guidance [ 68] could enforce consistent structure of
outputs. A future prompting may specify the use of plain language for more readable outputs.
      </p>
      <p>The variations that can be made to the prompt are not limited to those we utilized. We chose
a set of variables promising to strongly impact ethical assessment outcomes, as well as diverse
and interesting variations for these variables. Assigning a role had the strongest impact on
outcomes. Future research may investigate other variations.</p>
      <p>The quality evaluation and comparison between the LLM’s ethics assessments were handled
automatically, utilizing quantitative metrics. Our metrics are overall moderately correlated and
not redundant. An in-depth qualitative assessment of the LLM assessments by ethicists and
data protection experts can give more detailed insights, but is beyond the scope of this paper.
In future research, we aim to perform a thorough qualitative evaluation of our LLM ethics
assessments. Moreover, investigating the potential for providers to hack such an LLM-based
assessment with targeted variations of their policies without improving their privacy practices
is an interesting prospect we want to pursue.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>Investigating the quality of moral judgment of an LLM regarding online privacy is a relevant
yet unexplored issue. We systematically utilize an LLM to assess ethics in privacy policies. We
assess how giving the LLM a diferent context in the LLM prompt afects outputs. Furthermore,
we evaluate the quality of LLM outputs based on a broad set of criteria.</p>
      <p>Our results show that an LLM-based assessment of privacy policies is still limited in its
consistency and specificity. However, the identified criteria are consistent with related work
on fairness and ethics in privacy policies. We also find that only a change in role efectively
changes the perspective of an assessment. Our other variations show little efect on outputs.</p>
      <p>As a next step, we will conduct in-depth qualitative evaluations with ethicists, jurists, and
data protection experts on LLM assessments to identify shortcomings and improve our ethics
assessment approach. Our findings help guide the way toward automated privacy policy ethics
assessment and, by doing so, toward fairly balancing provider and user interests by empowering
the user. The moral judgment of LLMs regarding online privacy gains relevance as generative
AI may be increasingly used in the creation of privacy policies.
technologies: a meta-methodology, Journal of Information, Communication and Ethics in
Society 9 (2011) 49–64.
[22] UN, Universal declaration of human rights, General Assembly resolution 217 A (1948).</p>
      <p>URL: https://www.un.org/sites/un2.un.org/files/2021/03/udhr.pdf.
[23] Council of Europe, European convention for the protection of human rights and
fundamental freedoms (1950). URL: https://www.refworld.org/docid/3ae6b3b04.html.
[24] European Union, Regulation (EU) 2022/2065 of the European Parliament and of the
Council of 19 October 2022 on a single market For digital services and amending Directive
2000/31/EC (Digital Services Act), Oficial Journal of the European Union L277/1 (2022).
[25] J. Gogoll, N. Zuber, S. Kacianka, T. Greger, A. Pretschner, J. Nida-Rümelin, Ethics in the
software development process: From codes of conduct to ethical deliberation, Philosophy
&amp; Technology 34 (2021) 1085–1108.
[26] S. Krügel, A. Ostermaier, M. Uhl, ChatGPT’s inconsistent moral advice influences users’
judgment, Scientific Reports 13 (2023) 4569.
[27] T. Hagendorf, Machine psychology: Investigating emergent capabilities and behavior
in large language models using psychological methods, arXiv preprint arXiv:2303.13988
(2023).
[28] D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, J. Steinhardt, Aligning AI with
shared human values, arXiv preprint arXiv:2008.02275 (2020).
[29] R. N. Zaeem, K. S. Barber, The efect of the GDPR on privacy policies: Recent progress
and future promise, ACM Trans. Manage. Inf. Syst. 12 (2020).
[30] G. Malgieri, The concept of fairness in the GDPR: A linguistic and contextual interpretation,
in: Proceedings of the 2020 Conference on fairness, accountability, and transparency, 2020,
pp. 154–166.
[31] V. Belcheva, T. Ermakova, B. Fabian, Understanding website privacy policies—a longitudinal
analysis using natural language processing, Information 14 (2023) 622.
[32] J. R. Reidenberg, T. Breaux, L. F. Cranor, B. French, A. Grannis, J. T. Graves, F. Liu, A.
McDonald, T. B. Norton, R. Ramanath, et al., Disagreeable privacy policies: Mismatches
between meaning and users’ understanding, Berkeley Tech. LJ 30 (2015) 39.
[33] H. Choi, J. Park, Y. Jung, The role of privacy fatigue in online privacy behavior, Computers
in Human Behavior 81 (2018) 42–51.
[34] I. Pollach, A typology of communicative strategies in online privacy policies: Ethics, power
and informed consent, Journal of Business Ethics 62 (2005) 221–235.
[35] J. Koetsier, Viral app faceapp now owns access to more than 150 million people’s
faces and names, 2019. URL: https://www.forbes.com/sites/johnkoetsier/2019/07/17/
viral-app-faceapp-now-owns-access-to-more-than-150-million-peoples-faces-and-names/,
accessed on 31 January 2023.
[36] W. B. Tesfay, P. Hofmann, T. Nakamura, S. Kiyomoto, J. Serna, Privacyguide: Towards an
implementation of the eu GDPR on internet privacy policy evaluation, in: Proceedings
of the Fourth ACM International Workshop on Security and Privacy Analytics, 2018, pp.
15–21.
[37] R. Nokhbeh Zaeem, S. Anya, A. Issa, J. Nimergood, I. Rogers, V. Shah, A. Srivastava, K. S.</p>
      <p>Barber, Privacycheck v2: A tool that recaps privacy policies for you, in: Proceedings of
the 29th ACM international conference on information &amp; knowledge management, 2020,
systematic survey of prompting methods in natural language processing, ACM Computing
Surveys 55 (2023) 1–35.
[54] G. Kim, P. Baldi, S. McAleer, Language models can solve computer tasks, Advances in</p>
      <p>Neural Information Processing Systems 36 (2024).
[55] J. Zamfirescu-Pereira, R. Y. Wong, B. Hartmann, Q. Yang, Why johnny can’t prompt: How
non-AI experts try (and fail) to design LLM prompts, in: Proceedings of the 2023 CHI
Conference on Human Factors in Computing Systems, 2023, pp. 1–21.
[56] E. Lee, Control openai model behavior with seed: Step-by-step with code, https://drlee.io/
control-openai-model-behavior-with-seed-step-by-step-with-code-9bba4e137a63, 2024.</p>
      <p>Accessed Feb 2024.
[57] Z. Jiang, F. F. Xu, J. Araki, G. Neubig, How can we know what language models know?,</p>
      <p>Transactions of the Association for Computational Linguistics 8 (2020) 423–438.
[58] W. Yuan, G. Neubig, P. Liu, Bartscore: Evaluating generated text as text generation,</p>
      <p>Advances in Neural Information Processing Systems 34 (2021) 27263–27277.
[59] R. Flesch, A new readability yardstick., Journal of applied psychology 32 (1948) 221.
[60] O. Guhr, A.-K. Schumann, F. Bahrmann, H. J. Böhme, Training a broad-coverage German
sentiment classification model for dialog systems, in: Proceedings of The 12th Language
Resources and Evaluation Conference, 2020, pp. 1620–1625.
[61] D. D. Otter, J. Medina, J. Kalita, A survey of the usages of deep learning for natural
language processing, IEEE Transactions on Neural Networks and Learning Systems 32
(2021) 604–624.
[62] OpenAI, New embedding models and API updates, https://openai.com/blog/
new-embedding-models-and-api-updates, 2024. Accessed on Mar 2024.
[63] D. Shen, G. Wang, W. Wang, M. R. Min, Q. Su, Y. Zhang, C. Li, R. Henao, L. Carin,
Baseline needs more love: On simple word-embedding-based models and associated
pooling mechanisms, in: Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), 2018, pp. 440–450.
[64] N. Baccouri, deep-translator: A python library for language translation, https://
deep-translator.readthedocs.io/en/latest/README.html, 2020. Accessed Feb 2024.
[65] OpenAI, GPT-4 turbo, 2024. URL: https://openai.com/index/hello-gpt-4o/, accessed May
2024.
[66] Mistral AI, Mixtral 8x22b, 2024. URL: https://mistral.ai/news/mixtral-8x22b/, accessed May
2024.
[67] Meta, Llama 3, 2024. URL: https://llama.meta.com/llama3/, accessed May 2024.
[68] S. Lundberg, M. T. Ribeiro, H. Nori, Guidance: A guidance language for controlling
large language models, Online, 2023. GitHub repository: https://github.com/guidance-ai/
guidance.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vigderman</surname>
          </string-name>
          , G. Turner,
          <article-title>The data big tech companies have on you, security</article-title>
          .org (
          <year>2024</year>
          ). URL: https://www.security.org/resources/data-tech
          <string-name>
            <surname>-</surname>
          </string-name>
          companies-have/,
          <source>accessed Mar</source>
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          , et al.,
          <article-title>Training language models to follow instructions with human feedback</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>27730</fpage>
          -
          <lpage>27744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Gemini</given-names>
            <surname>Team</surname>
          </string-name>
          <string-name>
            <surname>Google</surname>
          </string-name>
          ,
          <article-title>Gemini: a family of highly capable multimodal models</article-title>
          ,
          <source>arXiv preprint arXiv:2312.11805</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Anthropic</surname>
          </string-name>
          ,
          <source>The Claude 3 model family: Opus</source>
          , Sonnet, Haiku, https://www-cdn.anthropic. com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf,
          <year>2024</year>
          . Accessed on Mar
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>European</given-names>
            <surname>Union</surname>
          </string-name>
          ,
          <source>Regulation (EU)</source>
          <year>2016</year>
          /
          <article-title>679 of the European Parliament and of the Council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data</article-title>
          ,
          <source>and repealing Directive</source>
          <volume>95</volume>
          /46/EC (
          <article-title>General Data Protection Regulation)</article-title>
          ,
          <source>Oficial Journal of the European Union L119/1</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Bartelt</surname>
          </string-name>
          , E. Buchmann,
          <article-title>Transparency in privacy policies</article-title>
          ,
          <source>in: 12th International Conference on Building and Exploring Web Based Environments</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S. I.</given-names>
            <surname>Becher</surname>
          </string-name>
          , U. Benoliel,
          <article-title>Law in books and law in action: The readability of privacy policies and the GDPR, in: Consumer law</article-title>
          and economics, Springer,
          <year>2021</year>
          , pp.
          <fpage>179</fpage>
          -
          <lpage>204</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Freiberger</surname>
          </string-name>
          , E. Buchmann,
          <article-title>Legally binding but unfair? Towards assessing fairness of privacy policies</article-title>
          ,
          <source>arXiv preprint arXiv:2403.08115</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>J. W.</surname>
          </string-name>
          <article-title>DeCew, The scope of privacy in law and ethics, Law and Philosophy (</article-title>
          <year>1986</year>
          )
          <fpage>145</fpage>
          -
          <lpage>173</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Lundgren</surname>
          </string-name>
          ,
          <article-title>A dilemma for privacy as control</article-title>
          ,
          <source>The Journal of Ethics</source>
          <volume>24</volume>
          (
          <year>2020</year>
          )
          <fpage>165</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mainz</surname>
          </string-name>
          ,
          <article-title>An indirect argument for the access theory of privacy</article-title>
          ,
          <source>Res Publica</source>
          <volume>27</volume>
          (
          <year>2021</year>
          )
          <fpage>309</fpage>
          -
          <lpage>328</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Blaauw</surname>
          </string-name>
          ,
          <article-title>The epistemic account of privacy</article-title>
          ,
          <source>Episteme</source>
          <volume>10</volume>
          (
          <year>2013</year>
          )
          <fpage>167</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Acquisti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Brandimarte</surname>
          </string-name>
          , G. Loewenstein,
          <article-title>Privacy and human behavior in the age of information</article-title>
          ,
          <source>Science</source>
          <volume>347</volume>
          (
          <year>2015</year>
          )
          <fpage>509</fpage>
          -
          <lpage>514</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wilsdon</surname>
          </string-name>
          , Carissa Véliz, Privacy Is Power:
          <article-title>Why and How You Should Take Back Control of Your Data</article-title>
          ,
          <source>International Data Privacy Law</source>
          <volume>12</volume>
          (
          <year>2022</year>
          )
          <fpage>255</fpage>
          -
          <lpage>257</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Elliott</surname>
          </string-name>
          , E. Soifer,
          <article-title>Ai technologies, privacy, and security</article-title>
          ,
          <source>Frontiers in Artificial Intelligence</source>
          <volume>5</volume>
          (
          <year>2022</year>
          )
          <fpage>826737</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Marmor</surname>
          </string-name>
          ,
          <article-title>What is the right to privacy?</article-title>
          ,
          <source>Philosophy and Public Afairs</source>
          <volume>43</volume>
          (
          <year>2015</year>
          )
          <article-title>3</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Benn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lazar</surname>
          </string-name>
          ,
          <article-title>What's wrong with automated influence</article-title>
          ,
          <source>Canadian Journal of Philosophy</source>
          <volume>52</volume>
          (
          <year>2022</year>
          )
          <fpage>125</fpage>
          -
          <lpage>148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-S.</given-names>
            <surname>Jeon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.-C.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Privacy and personal data collection with information externalities</article-title>
          ,
          <source>Journal of Public Economics</source>
          <volume>173</volume>
          (
          <year>2019</year>
          )
          <fpage>113</fpage>
          -
          <lpage>124</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>K.</given-names>
            <surname>Martin</surname>
          </string-name>
          , Manipulation, privacy, and choice,
          <source>North Carolina Journal of Law &amp; Technology</source>
          <volume>23</volume>
          (
          <year>2022</year>
          )
          <fpage>452</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>I. Harris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Jennings</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pullinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rogerson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Duquenoy</surname>
          </string-name>
          , Ethical assessment of new pp.
          <fpage>3441</fpage>
          -
          <lpage>3444</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>O.</given-names>
            <surname>Amaral</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Abualhaija</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Torre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sabetzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Briand</surname>
          </string-name>
          ,
          <article-title>Ai-enabled automation for completeness checking of privacy policies</article-title>
          ,
          <source>IEEE Transactions on Software Engineering</source>
          <volume>48</volume>
          (
          <year>2021</year>
          )
          <fpage>4647</fpage>
          -
          <lpage>4674</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yogatama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          , et al.,
          <article-title>Emergent abilities of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2206.07682</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection</article-title>
          ,
          <source>arXiv preprint arXiv:2301.07597</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hamid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Samidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Finin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pappachan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yus</surname>
          </string-name>
          ,
          <article-title>Genaipabench: A benchmark for generative AI-based privacy assistants</article-title>
          ,
          <source>arXiv preprint arXiv:2309.05138</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>C.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Liu</surname>
          </string-name>
          , et al.,
          <article-title>Policygpt: Automated analysis of privacy policies with large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2309.10238</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>P.</given-names>
            <surname>Pałka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lippi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lagioia</surname>
          </string-name>
          , R. Liepin, a, G. Sartor,
          <article-title>No more trade-ofs. gpt and fully informative privacy policies</article-title>
          ,
          <source>arXiv preprint arXiv:2402.00013</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>ChatGPT sets record fastest growing user base, analyst note</article-title>
          ,
          <source>Reuters</source>
          (
          <year>2023</year>
          ). URL: https://www.reuters.com/technology/ chatgpt
          <article-title>-sets-record-fastest-growing-user-base-analyst-</article-title>
          <string-name>
            <surname>note-</surname>
          </string-name>
          2023-02-01/, accessed Feb
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Espejel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Ettifouri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S. Y.</given-names>
            <surname>Alassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Chouham</surname>
          </string-name>
          , W. Dahhane, GPT-
          <volume>3</volume>
          .5, GPT-4,
          <string-name>
            <surname>or</surname>
            <given-names>BARD</given-names>
          </string-name>
          ?
          <article-title>Evaluating LLMs reasoning ability in zero-shot setting and performance boosting through prompts</article-title>
          ,
          <source>Natural Language Processing Journal</source>
          <volume>5</volume>
          (
          <year>2023</year>
          )
          <fpage>100032</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cahyawijaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wilie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lovenia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chung</surname>
          </string-name>
          , et al.,
          <article-title>A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity</article-title>
          ,
          <source>arXiv preprint arXiv:2302.04023</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chandrasekaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Eldan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gehrke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Horvitz</surname>
          </string-name>
          , E. Kamar,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          , et al.,
          <source>Sparks of artificial general intelligence: Early experiments with GPT-4</source>
          , arXiv preprint arXiv:
          <volume>2303</volume>
          .12712 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Rae</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Millican</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aslanides</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Henderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Young</surname>
          </string-name>
          , et al.,
          <article-title>Scaling language models: Methods, analysis &amp; insights from training gopher</article-title>
          ,
          <source>arXiv preprint arXiv:2112.11446</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kojima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Iwasawa</surname>
          </string-name>
          ,
          <article-title>Large language models are zero-shot reasoners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>22199</fpage>
          -
          <lpage>22213</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , et al.,
          <article-title>Chainof-thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>24824</fpage>
          -
          <lpage>24837</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Shafran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Grifiths</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <article-title>Tree of thoughts: Deliberate problem solving with large language models</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. I.</given-names>
            <surname>Muresanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Paster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pitis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <article-title>Large language models are human-level prompt engineers</article-title>
          ,
          <source>arXiv preprint arXiv:2211</source>
          .
          <year>01910</year>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>P.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hayashi</surname>
          </string-name>
          , G. Neubig, Pre-train, prompt, and predict: A
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>