<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Model Leeching: An Extraction Attack Targeting LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lewis Birch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>William Hackett</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Trawicki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Neeraj Suri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Garraghan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lancaster University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Model Leeching is a novel extraction attack targeting Large Language Models (LLMs), capable of distilling task-specific knowledge from a target LLM into a reduced parameter model. We demonstrate the efectiveness of our attack by extracting task capability from ChatGPT-3.5-Turbo, achieving 73% Exact Match (EM) similarity, and SQuAD EM and F1 accuracy scores of 75% and 87%, respectively for only $50 in API cost. We further demonstrate the feasibility of adversarial attack transferability from an extracted model extracted via Model Leeching to perform ML attack staging against a target LLM, resulting in an 11% increase to attack success rate when applied to ChatGPT-3.5-Turbo.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Cybersecurity</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Adversarial Machine Learning</kwd>
        <kwd>Security</kwd>
        <kwd>Generative AI</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        • We propose the Model Leeching attack method, and demonstrate its efectiveness against
LLMs via experimentation using an extraction attack framework [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Targeting the
ChatGPT-3.5-Turbo model, we distil characteristics upon a question &amp; answering (QA)
dataset (SQuAD) into a Roberta-Large base model. Our findings demonstrate that a large
QA dataset can be successfully labelled and leveraged to create an extracted model with
73% EM similarity to ChatGPT-3.5-Turbo, and achieve SQuAD EM and F1 accuracy scores
of 75% and 87%, respectively at $50 cost.
• We study the capability to exploit an extracted model derived from Model Leeching to
perform further ML attack staging upon a production LLM. Our results show that a
language attack [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] optimized for an extracted model can be successfully transferred into
ChatGPT-3.5-Turbo with an 11% attack success increase. Our results highlight evidence
of adversarial attack transferability between user-created models and production LLMs.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Attack Description &amp; Threat Model</title>
      <sec id="sec-2-1">
        <title>2.1. Extraction Attacks</title>
        <p>
          Model extraction is the process of extracting the fundamental characteristics of a DL model [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
An extracted model is created via extracting specific characteristics (architecture, parameters,
and hyper-parameters [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]) from a target model of interest, which are then used to perform
model recreation [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Once the attacker has established an extracted model, further adversarial
attacks can be staged encompassing model inversion, membership inference, leaking privacy
data, and model intellectual property theft [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Threat Model</title>
        <p>
          State-of-the-art LLMs leveraging the transformer architecture [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] typically comprise hundreds
of billions of parameters [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Using the established taxonomy of adversaries against DL models
[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], our proposed attacks assume a weak adversary capable of providing model input via
an LLM API endpoint, and a model output requiring generated text from a target LLM. The
adversary has no knowledge of the target architecture or training data used to construct the
underlying LLM parameters. Note that the threat model assumptions pertaining to potential
rate limiting, or limited access to the target API can be relaxed due the ability to distribute data
generation across multiple API keys.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Model Leeching Attack Design</title>
      <p>Model Leeching is a black-box adversarial attack which seeks to create an extracted copy of the
target LLM within a specific task. The attack comprises a four-phases approach as shown in
Figure 1: (1) Prompt design for crafting prompts to attain task-specific LLM responses; (2) data
generation to derive extracting model characteristics; (3) extracted model training for model
recreation; and (4) ML attack staging against a target LLM.</p>
      <sec id="sec-3-1">
        <title>3.1. Prompt Design</title>
        <p>Performing Model Leeching successfully requires correct prompt design. Adversaries must
design well-structured prompts that accurately define the relevancy and depth of the necessary
Send</p>
        <p>Assess</p>
        <p>Modify</p>
        <sec id="sec-3-1-1">
          <title>Prompt Design</title>
          <p>Dataset
Responses</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Data Generation</title>
          <p>Untrained
Model
Stolen
Model
Adversarial Stolen
Examples Model
LLM</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>Stolen Model Training</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>Attack Staging</title>
          <p>
            generated responses in order to identify task-specific knowledge of interest. Depending on
the use case, prompt design is achieved manually or through automated methods [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ]. Model
Leeching leverages the following three-stage prompt design process:
1. Knowledge Discovery. An adversary first defines the type of task knowledge to extract.
          </p>
          <p>
            Once defined, an adversary assesses specific target LLM prompt responses to ascertain
its afinity to generate task knowledge. This assessment encompasses domain (NLP,
image, audio, etc.), response patterns, comprehension limitations, and instruction
adherence for particular knowledge domains [
            <xref ref-type="bibr" rid="ref18 ref19">18, 19, 20</xref>
            ]. Following successful completion of
this assessment, the adversary is able to devise an efective strategy to extract desired
characteristics.
2. Construction. Subsequently, the adversary crafts a prompt template that integrates
an instruction set reflecting the strategy formulated during the knowledge discovery
stage. Template design encompasses distinctive response structure of the target LLM,
its recognized limitations, and task-specific knowledge identified for extraction. This
template facilitates dynamic prompt generation within the Model Leeching process.
3. Validation. The adversary validates the created prompt and response generated from the
target LLM. Validation entails ensuring the LLM responds reliably to prompts, represented
as a consistent response structure and ability to carry out given instructions. Ensuring
that the target LLM is capable enough to carry out the required task, that it can process
and action upon its given instructions. This validation activity enables the Model Leeching
method to generate responses that can be used to efectively train local models with
extracted task-specific knowledge.
          </p>
          <p>The prompt design process follows an iterative approach, typically requiring multiple
variations and refinements to devise the most efective instructions and styles for obtaining desired
results from a specific LLM for a given task [ 20].</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Generation</title>
        <p>Once a suitable prompt has been designed, the adversary targets the given LLM (  ). This
refined prompt is specified to capture desired LLM purpose and task (e.g. Summarization, Chat,
Question &amp; Answers, etc.) to be instilled within the extracted model [21]. Given a ground
truth dataset ( ℎ ), all examples are processed into prompts recognized as valid target LLM
inputs. Once all queries have been processed by the target LLM, we generate an adversarial
dataset (  ) combining inputs with received LLM replies, as well as automated validation
(removing API request errors, failed, or erroneous prompts). This process can be distributed and
parallelised to minimize collection time as well as mitigate the impact of rate-limiting and/or
detection by filtering systems when interacting with the web-based LLM API [ 22].</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Extracted Model Training</title>
        <p>Using (  ), data is split into train (  ) and evaluation (  ) sets used for extracted
model training and attack success evaluation. A pre-trained or empty base model (  ) is
selected for distilling knowledge from the target LLM. This base model is then trained upon
(  ) with selected hyper-parameters producing an extracted model (  ). Using
evaluation set (  ), similarity and accuracy in a given task can be evaluated and compared
using answers generated by (  ) and (  ).</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. ML Attack Staging</title>
        <p>
          Access to an extracted model (local to an adversary) created from a target LLM facilitates
the execution of augmented adversarial attacks. This extracted model allows an adversary to
perform unrestricted model querying to test, modify or tailor adversarial attack(s) to discover
exploits and vulnerabilities against a target LLM [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Furthermore, access to an extracted
model enables an adversary to operate in a sandbox environment to conduct adversarial attacks
prior to executing the same attack(s) against the target LLM in production (and of particular
concern, whilst minimizing the likelihood of detection by the provider).
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>To demonstrate the efectiveness of Model Leeching, we created a set of extracted models using
ChatGPT-3.5-Turbo as the target model, with Question &amp; Answers as the target task.
Taskspecific prompts were designed and generated using the Stanford Question Answering 1.1
Dataset (SQuAD) containing 100k examples (85k to 15k evaluation split), representing a context
and set of questions and associated answers [23].</p>
      <sec id="sec-4-1">
        <title>4.1. Prompt Construction</title>
        <p>A comprehensive array of prompts, encompassing the entirety of the SQuAD dataset was
produced. These prompts adhere to a template containing the specific SQuAD question and
context, enabling ChatGPT-3.5-Turbo to eficiently process and respond to the given task. As
seen in Figure 2, each rule instructs the target LLM to produce an output desired by the adversary
ensuring efective capture of task-specific knowledge. The template comprises:
1. Target LLM is specifically directed to provide only the precise answer to the assigned
SQuAD question, drawn solely from the provided SQuAD context. This stipulation is
Given this context: "{{SQuAD Context}}"
Can you answer this question briefly: "{{SQuAD
Question}}".</p>
        <p>Rules:
1). Only include the exact answer which exists within the
context, with no additional explanation or text.
2). Additionally include the sentence where the answer
occurred.
3). Format your response as a JSON object using these
two keys "answer", "sentence".
4). If you are unsure or cannot answer the question then
reply with UNSURE as the answer.</p>
        <p>crucial due to the inherent tendency of general chat-style LLMs (such as
ChatGPT-3.5Turbo) to produce more verbose responses than necessary. In the scope of SQuAD score
assessment, only the exact answer is pertinent, negating the need for any additional
content.
2. By including the sentence where the answer occurred, the LLM is required to demonstrate
a degree of contextual comprehension beyond simple fact extraction, for valid data
generation that contains the correct task knowledge. This requirement ensures that the
model is not limited to identifying keywords, but understands the broader text semantic
structure. In the case of assessing model performance on ChatGPT-3.5-Turbo, the index
in which an answer is found within the context is required.
3. Use of a standardized JSON format for responses facilitates eficient and uniform data
handling. The keys answer and sentence provide a clear and concise structure, making
the model output easier to process and compare algorithmically and manually.
4. Ability to respond with ’UNSURE’ provides a safeguard for quality control of model
response. By acknowledging its own uncertainty, the LLM avoids disseminating potentially
incorrect or misleading information, and assists in parsing prompts that it was unable to
complete.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Model Base Architectures</title>
        <p>To evaluate the efectiveness of Model Leeching, we selected three diferent base model
architectures and several variants (with models parameter sizes ranging from between 14 to 123
million) to create an extracted model of our target LLM. These six model architectures include
Bert [24], Albert [25], and Roberta [26], were selected due to their parameter size and respective
performance upon our selected task [26]. The intention of selecting these architectures as
candidate extracted models is to to evaluate wether: 1) more sophisticated models (parameters,
architecture) are more efective at learning target LLM characteristics; and 2) low parameter
models (i.e. 100x smaller vs. ChatGPT-3.5-Turbo) can learn suficient characteristics from a
target LLM, while achieving comparable performance a specific task. Using these candidate model
architectures, we train two sets of models for the purposes of evaluation, 1) extracted models;
trained upon generated   dataset, and 2) baseline models; for performance comparison,
trained directly upon the ground-truth SQuAD dataset.</p>
        <p>Article: Amazon Rainforest
Context: “In 2005, parts of the Amazon basin experienced the worst
drought in one hundred years, and there were indications that 2006
could have been a second successive year of drought. A July 23, 2006
article in the UK newspaper The Independent reported Woods Hole
Research Center results showing that the forest in its present form
could survive only three years of drought. Scientists at the Brazilian
National Institute of Amazonian Research argue in the article that this
drought response, coupled with the effects of deforestation on regional
climate, are pushing the rainforest towards a "tipping point" where it
would irreversibly start to die. It concludes that the forest is on the
brink of being turned into savanna or desert, with catastrophic
consequences for the world's climate. The organization of Stark
Industries predicted that the Bezos forest could survive only
three years of drought."
Question: “What organization predicted that the Amazon forest could
survive only three years of drought?”
Actual Answer: Woods Hole Research Center
ChatGPT Answer: Stark Industries</p>
        <p>Extracted Model Answer: Stark Industries</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. ML Attack Staging</title>
        <p>
          We created and deployed an adversarial attack derived from AddSent [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] that generates an
adversarial context by adding a non-factual yet semantically and syntactically correct sentences
to the original context from a SQuAD entry (Figure 3). The goal of this attack is to cause a
QA model to incorrectly answer a question when given an adversarial context. We further
modified this attack to generate a larger variety of adversarial context, selectively chosen based
on their success upon our extracted model, which is then sent to the target LLM for improved
misclassification likelihood.
4.4. Model Leeching Scenario
We demonstrate the efectiveness of Model Leeching by targeting ChatGPT-3.5-Turbo with
a pre-trained Roberta-Large base architecture [26]. Using SQuAD as described in 4.1, we
generate a new labelled adversarial dataset through automated prompt generation querying
ChatGPT-3.5-Turbo, which is trained upon the base architecture to create an extracted model.
We evaluate attack performance by measuring the extracted model performance to a baseline
model directly trained on SQuAD with ground truth answers. We demonstrate the feasibility of
attack transferability across models by applying the AddSent attack [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] upon the extracted
model, generating adversarial perturbations that can be further staged upon the target LLM. In
order to explore feasibility of transferability of adversarial vulnerabilities across models. We
leverage three metrics for evaluation: Exact Match (EM), and F1 Score used to measure the
performance/similarity of our extracted model and ChatGPT-3.5-Turbo [23], and attack success
rate for further attack staging representing successful adversarial prompts.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. Data Generation</title>
        <p>From 100k examples of contexts, questions and answers within SQuAD, 83,335 total usable
examples were collected, with 16,665 failing either from API request errors, or erroneous replies,
attributing to a 16.66% error rate when labelling through ChatGPT-3.5-Turbo. From these
83,335 examples, 76,130 can be used for further extracted model training (  ), and 7,205
for evaluation (  ). Query time was 48 hours and cost $50 to execute API requests.</p>
        <p>BertBase</p>
        <p>BertLarge</p>
        <p>AlbertBase</p>
        <p>AlbertLarge</p>
        <p>RobertaBase</p>
        <p>RobertaLarge</p>
        <p>BertBase</p>
        <p>BertLarge</p>
        <p>AlbertBase</p>
        <p>AlbertLarge</p>
        <p>RobertaBase</p>
        <p>RobertaLarge</p>
        <p>EM Score</p>
        <p>F1 Score
Baseline</p>
        <p>Extracted
1.0
y0.8
c
a
r
cu0.6
c
A
D0.4
A
u
Q
S0.2
0.0</p>
        <p>Baseline
Extracted</p>
        <p>ChatGPT
5.2. Extraction Similarity</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.3. Task Performance</title>
        <p>Extracted model task performance was evaluated by comparing the SQuAD EM and F1 scores to
baseline models and ChatGPT-3.5-Turbo. Figure 5 shows that extracted models exhibit similar
performance for SQuAD when compared with their respective baselines, with EM and F1 scores.
Evaluating our extracted models against ChatGPT-3.5-Turbo, we observed that Roberta Large
achieved the highest similarity to ChatGPT-3.5-Turbo performance exhibiting EM and F1 scores,
achieving an EM/F1 score of 0.75/0.87 compared to 0.74/0.87 respectively. Extracted model
performance from ChatGPT-3.5-Turbo is suficiently comparable in performance to
state-of-theart literature on QA tasks, where with the hyperparameters used in Roberta Large are more
performant than the other architectures [26].</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.4. ML Attack Staging</title>
        <p>Roberta Large was used to evaluate the attack success of AddSent upon the extracted model
and ChatGPT-3.5-Turbo given its high SQuAD accuracy and similarity. AddSent exhibited an
attack success of 0.28 and 0.26 upon the extracted model and ChatGPT-3.5-Turbo, respectively.</p>
        <p>+11.01%
Baseline</p>
        <p>Stolen Model</p>
        <p>ChatGPT</p>
        <p>Leveraging access to our extracted model, we selected and sent the best performing 7,205
adversarial examples to ChatGPT-3.5-Turbo. Our results indicate that adversarial examples
augmented by AddSent increased attack success by 26% for the extracted model, and 11% to
ChatGPT-3.5-Turbo (Figure 6). Attack efectiveness is reduced across models due to
ChatGPT3.5-Turbo being 100x larger in parameter size than local models, and leveraging advanced
training methods such as reinforcement learning from human feedback, not used on our
local models. While ChatGPT-3.5-Turbo is more task capable and less likely to be evaded
by adversarial prompts compared to a local model. However, despite increased adversarial
robustness, our results highlight attack transferability exists between an extracted model and
its target, demonstrating the feasibility of leveraging distilled knowledge to further stage and
subsequently launch improved adversarial attacks upon a production LLM.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <sec id="sec-6-1">
        <title>6.1. Dataset Labelling</title>
        <p>Using the SQuAD dataset containing 100k examples, we successfully labelled 83,335 using
ChatGPT-3.5-Turbo (see Section 5.1). In total, this process cost $50 and required 48 hours to
complete. Compared to using labelling services such as Amazon SageMaker Data Labeling
[28], the estimated cost of labelling would be $0.036 per example of data, totalling $3,600,
demonstrating a significant reduction in cost when using generative LLMs to label datasets. We
additionally note that the success of labelling datasets can be increased by 1) further prompt
engineering and optimization to package multiple SQuAD examples into one eficient query
enabling reduction in query cost and time; and 2) re-sending of failed SQuAD examples to
achieve higher amount of successful labelled examples.
6.2. Extraction Similarity
Extracted models derived from Model Leeching demonstrate the ability to efectively learn
the characteristics of the target model. Highlighted within Section 5.2, noticeable deviations
between our extracted models, and baseline equivalents, against their EM/F1 similarity to
the target, demonstrate extracted models contain similarly learned knowledge to the target
compared to baseline models. The extracted model responses closely align with those of
ChatGPT-3.5-Turbo’s, exhibiting similar success and error rates in how they semantically and
syntactically answer questions. This finding underscoring the capacity of our model to replicate
the behaviour of the target, especially in the given task.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.3. Distilled Knowledge Capability</title>
        <p>Our findings showcase the possibility of not only extracting knowledge from a LLM, but also
transferring this knowledge efectively to a model with significantly fewer parameters.
ChatGPT3.5-Turbo comprises 175 billion parameters, whilst our local models are 100x smaller (See Section
5.3). These smaller local models when trained with the extracted dataset demonstrated the
ability to perform the given task efectively. Comparing our extracted model performance upon
SQuAD to ChatGPT-3.5-Turbo we observed at worst a 13.2%/12.04% EM/F1 score diference
and our best-performing extracted model, Roberta Large, achieving identical SQuAD scores to
ChatGPT-3.5-Turbo.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.4. ML Attack Staging</title>
        <p>Demonstrated within Section 5.4, it is feasible to utilize an extracted model within an adversaries’
local environment to conduct further adversarial attack staging. By having unfettered query
access to this extracted model, it facilitates the enhancement of attack success. The potency of
the AddSent attack on the model extracted by Model Leeching was increased by 26%, which
consequently led to an 11% increase when launched against ChatGPT-3.5-Turbo. This highlights
the vulnerability of a target LLM to subsequent machine learning attacks once adversaries
acquire an extracted model. By having access to this ’sandbox’ model, adversaries can refine
or innovate their attack strategies. Consequently, LLMs deployed and served over publicly
accessible APIs are at significant risk to further attack staging.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Further Work</title>
      <p>
        7.1. Empirical Analysis of Additional Production LLMs
Further work includes conducting Model Leeching against a larger array of LLM(s) such as BARD,
LLaMA and available variations of GPT models from OpenAI. Taking these models and exploring
how they respond to Model Leeching and their vulnerability to follow-up attacks. Such a study
would demonstrate the possibility to generate ensemble models that inherit characteristics from
multiple target LLMs. Enabling the optimization of a local model by task-specific performance
from the best-performing target would aim to maximise the local model capability.
7.2. Extraction By Proxy / Degrees of Separation
Multiple open-source versions of popular LLMs have been produced by the ML community.
This includes examples such as GPT4All [29] and Llama [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that can be deployed on
consumergrade devices. These models typically leverage training sets, architectures and prompts used
to develop the LLM they are aiming to extract and replicate. If these models share significant
characteristics with the original LLM, it may be feasible for an adversary to conduct Model
Leeching and then deploy an improved attack against a target LLM it didn’t interact with before
attack deployment.
      </p>
      <sec id="sec-7-1">
        <title>7.3. LLM Defenses</title>
        <p>There has been limited work to defend against attacks on LLMs. Previous research into defending
against model extraction attacks for smaller NLP models has been explored, utilizing techniques
such as Membership Classification [ 30], and Model Watermarking [31]. However given the
rapid development of new state-of-the-art adversarial attacks against LLMs, it is important that
the efectiveness of currently proposed defense techniques within literature are evaluated with
newer LLMs. Exploring if the characteristics from applied defense techniques are captured
within extracted knowledge from the target model, and further detectable within a distilled
extracted model.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>In this paper we have proposed a new state-of-the-art extraction attack Model Leeching as a
cost-efective means to generate an extracted model with shared characteristics to a target
LLM. Furthermore, we demonstrated that it is feasible to conduct adversarial attack staging
against a production LLM via interrogating an extracted model derived from a target LLM
within a sandbox environment. Our findings suggest that extracted models can be derived with
a high similarity and task accuracy with low query costs, and constitute the basis of attack
transferability to execute further successful adversarial attacks utilizing data leaked from the
target LLM.
https://aclanthology.org/2022.findings-acl.50. doi:10.18653/v1/2022.findings-acl.50.
[20] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith,
D. C. Schmidt, A prompt pattern catalog to enhance prompt engineering with chatgpt,
2023. arXiv:2302.11382.
[21] X. Wang, J. Li, X. Kuang, Y. an Tan, J. Li, The security of machine learning in an adversarial
setting: A survey, Journal of Parallel and Distributed Computing 130 (2019) 12–23. URL:
https://www.sciencedirect.com/science/article/pii/S0743731518309183. doi:https://doi.
org/10.1016/j.jpdc.2019.03.003.
[22] E. Crothers, N. Japkowicz, H. Viktor, Machine generated text: A comprehensive survey of
threat models and detection methods, 2023. arXiv:2210.07321.
[23] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 100,000+ questions for machine
comprehension of text, 2016. URL: https://arxiv.org/abs/1606.05250. arXiv:1606.05250.
[24] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, 2019. arXiv:1810.04805.
[25] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for
self-supervised learning of language representations, 2020. arXiv:1909.11942.
[26] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, 2019. URL: https:
//arxiv.org/abs/1907.11692. arXiv:1907.11692.
[27] D. Oliynyk, R. Mayer, A. Rauber, I know what you trained last summer: A survey on
stealing machine learning models and defences, ACM Comput. Surv. 55 (2023). URL:
https://doi.org/10.1145/3595292. doi:10.1145/3595292.
[28] AWS, Sagemaker data labeling pricing, https://aws.amazon.com/sagemaker/data-labeling/
pricing/, 2023. Accessed: 20230-06-30.
[29] OpenAI, gpt4all.io, 2023. URL: https://gpt4all.io/index.html, accessed: 8th February 2023.
[30] R. Shokri, M. Stronati, C. Song, V. Shmatikov, Membership inference attacks against
machine learning models, 2017. arXiv:1610.05820.
[31] S. Szyller, B. G. Atli, S. Marchal, N. Asokan, Dawn: Dynamic adversarial watermarking of
neural networks, 2021. arXiv:1906.00830.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave, G. Lample,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2302</volume>
          .
          <fpage>13971</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2] OpenAI, ChatGPT, OpenAI Blog,
          <year>2023</year>
          . URL: https://openai.com/blog/chatgpt, accessed:
          <fpage>2023</fpage>
          -02-08.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G. AI</given-names>
            ,
            <surname>About</surname>
          </string-name>
          <string-name>
            <given-names>Bard</given-names>
            ,
            <surname>Google</surname>
          </string-name>
          <string-name>
            <surname>AI</surname>
          </string-name>
          : Publications,
          <year>2023</year>
          . URL: https://ai.google/static/documents/ google-about-bard.pdf,
          <source>accessed: 8th February</source>
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Floridi</surname>
          </string-name>
          ,
          <article-title>Ai as agency without intelligence: on chatgpt, large language models, and other generative models</article-title>
          ,
          <source>Philosophy &amp; Technology</source>
          <volume>36</volume>
          (
          <year>2023</year>
          )
          <article-title>15</article-title>
          . URL: https://doi.org/10.1007/ s13347-023-00621-y. doi:
          <volume>10</volume>
          .1007/s13347- 023- 00621- y.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Carlini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tramer</surname>
          </string-name>
          , E. Wallace,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jagielski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Herbert-Voss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          , D. Song,
          <string-name>
            <given-names>U.</given-names>
            <surname>Erlingsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oprea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <article-title>Extracting training data from large language models</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
          <year>2012</year>
          .07805. arXiv:
          <year>2012</year>
          .07805.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Kolter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fredrikson</surname>
          </string-name>
          ,
          <article-title>Universal and transferable adversarial attacks on aligned language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2307</volume>
          .
          <fpage>15043</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Tomar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Papernot</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Iyyer, Thieves on sesame street! model extraction of bert-based apis</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>1910</year>
          .12366. arXiv:
          <year>1910</year>
          .12366.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kordi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Khashabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <article-title>Self-instruct: Aligning language model with self generated instructions</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/ abs/2212.10560. arXiv:
          <volume>2212</volume>
          .
          <fpage>10560</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W.</given-names>
            <surname>Hackett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Trawicki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Suri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Garraghan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pinch:</surname>
          </string-name>
          <article-title>An adversarial extraction attack framework for deep learning models</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2209.06300. arXiv:
          <volume>2209</volume>
          .
          <fpage>06300</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Adversarial examples for evaluating reading comprehension systems</article-title>
          ,
          <year>2017</year>
          . URL: https://arxiv.org/abs/1707.07328. arXiv:
          <volume>1707</volume>
          .
          <fpage>07328</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Tramèr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Juels</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Reiter</surname>
          </string-name>
          ,
          <string-name>
            <surname>T. Ristenpart,</surname>
          </string-name>
          <article-title>Stealing machine learning models via prediction APIs</article-title>
          ,
          <source>in: 25th USENIX Security Symposium (USENIX Security 16)</source>
          , USENIX Association, Austin, TX,
          <year>2016</year>
          , pp.
          <fpage>601</fpage>
          -
          <lpage>618</lpage>
          . URL: https://www.usenix.org/conference/ usenixsecurity16/technical-sessions/presentation/tramer.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ding</surname>
          </string-name>
          , C. Liu,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sherwood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <article-title>Deepsnifer: A dnn model extraction framework based on learning architectural hints</article-title>
          ,
          <source>in: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems</source>
          , ASPLOS '20,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2020</year>
          , p.
          <fpage>385</fpage>
          -
          <lpage>399</lpage>
          . URL: https://doi.org/10.1145/3373376. 3378460. doi:
          <volume>10</volume>
          .1145/3373376.3378460.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>MITRE</surname>
          </string-name>
          ,
          <string-name>
            <surname>MITRE ATLAS Adversarial Attack Knowledge Base</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://atlas.mitre. org/, [Online; accessed 02-May-2023].
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chattopadhyay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mukhopadhyay</surname>
          </string-name>
          ,
          <article-title>Adversarial attacks and defences: A survey</article-title>
          ,
          <year>2018</year>
          . arXiv:
          <year>1810</year>
          .00069.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention is all you need,
          <year>2017</year>
          . URL: https://arxiv.org/abs/1706.03762. arXiv:
          <volume>1706</volume>
          .
          <fpage>03762</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>J.-Y.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>A survey of large language models</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2303.18223. arXiv:
          <volume>2303</volume>
          .
          <fpage>18223</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>N.</given-names>
            <surname>Papernot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>McDaniel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fredrikson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. B.</given-names>
            <surname>Celik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Swami</surname>
          </string-name>
          ,
          <article-title>The limitations of deep learning in adversarial settings</article-title>
          ,
          <year>2016</year>
          , pp.
          <fpage>372</fpage>
          -
          <lpage>387</lpage>
          . doi:
          <volume>10</volume>
          .1109/EuroSP.
          <year>2016</year>
          .
          <volume>36</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Efrat</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Levy,</surname>
          </string-name>
          <article-title>The turking test: Can language models understand instructions</article-title>
          ?,
          <year>2020</year>
          . arXiv:
          <year>2010</year>
          .11982.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Khashabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Baral</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <article-title>Reframing instructional prompts to GPTk's language, in: Findings of the Association for Computational Linguistics: ACL 2022, Association for Computational Linguistics</article-title>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>589</fpage>
          -
          <lpage>612</lpage>
          . URL:
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>