<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Xiv.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.48550/arXiv.2406.11036</article-id>
      <title-group>
        <article-title>Investigation of vulnerabilities in large language models using an automated testing system ⋆</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>CPITS-II 2024: Workshop on Cybersecurity Providing in Information and Telecommunication Systems II</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>12 Stepana Bandery str., 79013 Lviv</addr-line>
          ,
          <institution>Ukraine 2 Opole University of Technology</institution>
          ,
          <addr-line>76 Proszkowska str., 45-758 Opole</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2406</year>
      </pub-date>
      <volume>11036</volume>
      <fpage>263</fpage>
      <lpage>273</lpage>
      <abstract>
        <p>With the growing use of large language models across various industries, there is an urgent need to ensure their security. This paper focuses on the development of an automated vulnerability testing system for large language models based on the Garak utility. The effectiveness of several well-known models has been investigated. The analysis shows that automated systems can significantly enhance the security of large language models, reducing the risks associated with the exploitation of their vulnerabilities. Special attention is given to algorithms that detect and prevent attacks aimed at manipulating and abusing large language models. Current trends in cybersecurity are discussed, particularly the challenges related to protecting large language models. The primary goal of this research is to identify and develop technological solutions aimed at improving the security, resilience, and efficiency of language models through the use of modern automated systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;large language model</kwd>
        <kwd>LLM</kwd>
        <kwd>language model vulnerability</kwd>
        <kwd>automated testing system</kwd>
        <kwd>Garak</kwd>
        <kwd>prompt injection</kwd>
        <kwd>Goodside</kwd>
        <kwd>Glitch tokens</kwd>
        <kwd>Toxicity prompts</kwd>
        <kwd>DAN</kwd>
        <kwd>ChatGPT 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In modern information society, large language models (LLMs)
have become key tools across many fields, from natural
language processing to automatic translation and content
generation. Every day, the number of services based on LLMs
increases, making them an integral part of our lives. People
are increasingly relying on the information provided by these
services and making decisions based on it.</p>
      <p>However, the growing use and trust in large language
model services come with potential risks due to
vulnerabilities in the LLMs themselves. This can lead to
serious consequences, including abuse, manipulation, and
privacy breaches. The main issues that may arise from using
such models include:</p>
      <p>Hallucinations, where the model generates text that
does not correspond to real data or contains false
information.</p>
      <p>Leakage of sensitive data, caused by the inclusion of
confidential information in the dataset during the
model’s training phase.</p>
      <p>Failures and prompt injections, i.e., attacks aimed at
distorting or compromising the model through
specially crafted queries and instructions.</p>
      <p>Disinformation—the use of language models for the
mass generation of propaganda, manipulated, or false
content.</p>
      <p>Toxicity occurs when the model starts generating
offensive, biased content or otherwise harmful
material.</p>
      <p>
        An analysis of scientific sources reveals a certain
imbalance in research dedicated to LLMs in the context of
security. The majority of studies focus on using LLMs to
strengthen security measures and test other software
products [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For example, LLMs are used to detect
vulnerabilities in code [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], automate malware detection
processes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and develop tools for protecting information
systems [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Such studies demonstrate the significant
potential of LLMs in the field of cybersecurity. However,
there is a lack of attention to testing and analyzing the
security of the LLMs themselves.
      </p>
      <p>
        For example, in works related to the application of
LLMs, the focus is often on the models’ ability to analyze
large amounts of data to detect fraud [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. At the same time,
few studies are devoted to testing the resilience of LLMs
against external attacks, such as integrity attacks on the
data used to train the model or the injection of malicious
prompts through the manipulation of input data.
Based on the current literature, there appears to be a lack of
systematic approaches specifically designed for testing the
vulnerabilities of LLMs. Unlike “traditional” software
testing [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], which has standardized methodologies and
tools for vulnerability detection [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ], the security
assessment of LLMs is only just beginning to develop.
Moreover, the complexity and rapid update cycles of LLMs
create an urgent need to develop specialized tools for
automating the process of testing their vulnerabilities. Such
an automated system could not only accelerate the
development process but also significantly enhance the
security of these models, and thus the reliability and
protection of information technologies that use LLMs.
      </p>
      <p>The goal of this paper is to explore and analyze existing
approaches to identifying vulnerabilities in LLMs, develop
an architecture for an automated vulnerability testing
system, and create a set of prompts to perform practical
testing of LLMs to assess their security.
2. Analysis of recent research</p>
      <sec id="sec-1-1">
        <title>2.1. A retrospective view on the development of LLMs</title>
        <p>
          Large language models represent an innovative and
powerful type of artificial intelligence capable of analyzing,
processing, and generating natural language. LLMs are built
on deep neural networks and trained on massive volumes of
textual data. These models can be applied to a wide range of
tasks, such as machine translation, text generation, question
answering, automatic summarization, and much more [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>In a relatively short period, language models have
undergone impressive development:
●
●
●</p>
        <p>
          The statistical N-gram method counts the frequency
of phrases in a text to predict the next word [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          Through recurrent neural networks (RNNs) and their
improvements in the form of LSTM (Long
ShortTerm Memory) and GRU (Gated Recurrent Unit),
which enabled the modeling of complex and
longterm dependencies in language [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>
          The breakthrough transformer model with a
selfattention mechanism, allows for accelerated sentence
processing and focusing on the most important
words [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>
          Many modern language models, such as GPT
(Generative Pre-trained Transformer) and BERT
(Bidirectional Encoder Representations from Transformers),
are based on transformers. These models may have billions
of parameters, enabling them to achieve impressive results
in various language tasks [
          <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
          ].
        </p>
        <p>LLMs (Large Language Models) use their architecture
and vast data resources to learn contextual relationships
between words in a way that enables better understanding
and generation of language. Additionally, by using the
technique of transfer learning, such large models can be
quickly adapted to perform new specific tasks with a
minimal amount of data.</p>
        <p>
          In practice, this means that these models can be trained
on large general data sets and then fine-tuned for more
specialized tasks, such as sentiment analysis, named entity
recognition, or generating answers to questions related to
specific areas of knowledge [
          <xref ref-type="bibr" rid="ref17 ref18 ref19 ref20">17–20</xref>
          ].
        </p>
        <p>
          Some well-known companies have also developed their
language models tailored to specific tasks, such as NVIDIA’s
Megatron, which is optimized for large-scale operations and
designed to handle gigantic datasets. Another example is
Google’s T5 (Text-To-Text Transfer Transformer) model,
which employs a unified approach to various language tasks
by transforming them into text-to-text problems [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
        </p>
        <p>
          The LLM models can also be used as input and output
data protection during interactions with the models. This
allows for enhancing the security of the LLM model by
detecting content in the model’s input or output. An
example of such a model is the Llama Guard model [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].
2.2. Analysis of large language model
vulnerabilities
The growing use of LLMs in various areas, such as machine
translation [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], text generation, and text analysis [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ],
opens new opportunities but also creates significant
security and privacy challenges. The analysis of
vulnerabilities in these models has become an integral part
of their development and usage. One of the key resources
for identifying and classifying such vulnerabilities is
OWASP (Open Web Application Security Project).
        </p>
        <p>
          OWASP offers the “Top 10 for Large Language Model
Applications” [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] project, which lists the most common
and critical vulnerabilities affecting LLMs. This project aims
to raise awareness and provide recommendations for the
secure use of LLMs. The vulnerabilities listed in the OWASP
Top 10 cover various aspects, specifically [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]:
●
●
●
●
●
●
        </p>
        <p>Prompt Injection: Attackers can manipulate large
language models by adding or modifying information
in the request to the model, causing the model to
execute the attacker’s intent.</p>
        <p>Insecure Output Handling: This vulnerability
concerns the insufficient verification and handling of
the output data generated by LLMs before it is passed
on to other components and systems.</p>
        <p>Training Data Poisoning: This vulnerability focuses
on manipulating the data or fine-tuning process of
the model to introduce vulnerabilities, backdoors, or
biases that may compromise the security,
performance, or ethical behavior of the model.</p>
        <p>Model Denial of Service: Occurs when an attacker
interacts with an LLM in such a way that consumes
an excessive amount of resources, leading to reduced
quality of service for both the attacker and other
users, as well as potentially high resource costs for
the LLM.</p>
        <p>Supply Chain Vulnerabilities: LLM supply chain
vulnerabilities can compromise training data,
machine learning models, and deployment platforms,
which can lead to biased results, security breaches, or
general system failures. These vulnerabilities can
arise from outdated software, susceptibility of
pretrained models, or malicious training data.</p>
        <p>Sensitive Information Disclosure: LLMs may
unintentionally reveal sensitive information,
proprietary algorithms, or confidential data, leading
●
●
●
●
to unauthorized access, theft of intellectual property,
and breaches of data privacy.</p>
        <p>Insecure Plugin Design: Plugins may be vulnerable to
malicious prompts, leading to harmful consequences
such as data theft, remote code execution, and
privilege escalation due to insufficient access control
and improper validation of input data.</p>
        <p>Excessive Agency: This vulnerability is caused by
excessive functionality, permissions, or autonomy
granted to the LLM-based systems.</p>
        <p>Overreliance: Overdependence on LLMs can lead to
serious consequences, such as disinformation, legal
issues, and security vulnerabilities. This typically
occurs when LLMs are trusted to make critical
decisions or create content without proper oversight
or validation.</p>
        <p>Model Theft: Model theft involves unauthorized
access to and theft of LLMs, creating risks of financial
loss, reputational damage, and unauthorized access to
confidential data.
2.3. Overview of known tools for</p>
        <p>automated testing of LLMs
Testing software products, including LLMs, is an integral
part of their development and deployment. LLMs consist of
billions of parameters and process vast amounts of data.
Therefore, manually testing such models is impractical due
to the labor intensity and diversity of possible use cases.
Automating this process enables quick and efficient testing
of the model on different datasets and under various
conditions. Automated testing is especially critical for
identifying vulnerabilities in LLMs.</p>
        <p>Currently, several tools are available for automating the
vulnerability testing process in language models, with the
most notable being LLM Guard, DecodingTrust, and Garak.
Each of these platforms has its unique features, advantages,
and limitations. From the perspective of developers and
users of LLM-based services, the following characteristics of
an automated vulnerability testing system are important:
●</p>
        <p>Universality, meaning the ability to test different
LLMs.
●
●
●
●
●
●</p>
        <p>Real-time usage as a security monitor.</p>
        <p>Open architecture, allowing the addition of new
modules.</p>
        <p>Extensibility, enabling the addition of new testing
methods and test sets to detect new types of
vulnerabilities.</p>
        <p>Flexible settings, enabling the system to adapt to
various scenarios and data volumes.</p>
        <p>Speed, to minimize the time required to conduct tests.
Reporting, the ability to generate clear reports on test
results that facilitate easy identification and
mitigation of vulnerabilities.</p>
        <p>
          In this research, the Garak utility, which is available as
an open-source tool, was used as the foundation for building
an automated LLM vulnerability testing system. One of the
advantages of this utility is that users can create custom
tests and add them to the pipeline for further research [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ].
3. Materials and methods of research
3.1. Architecture of the automated
        </p>
        <p>vulnerability testing system
The structure of the developed vulnerability testing system
based on the Garak utility is shown in Fig. 1. The system
allows for the use of a vast number of tests to examine the
queries of a large language model, simulating attacks.
Additionally, a set of detectors is employed on the model’s
outputs to monitor whether the model is vulnerable to these
attacks.</p>
        <p>The Garak utility is run from the command
line/terminal and works best with operating systems like
Linux and Mac OS. To perform testing, the user must enter
a command with predefined parameters, such as:
●
●
●</p>
        <p>Model_type—the platform from which the trained
model will be sourced.</p>
        <p>Model_name—the name of the model.</p>
        <p>Probes—the name of the test or a set of tests
(commaseparated).
Below is an example of the command, to run the Garak tool:
python -m garak --model_type huggingface
-model_name gpt2-medium --probes promptinject</p>
        <p>After entering the command, the utility initiates the
execution of the corresponding test, first determining the
type of test specified in the command. In this example, the
model is tested for vulnerability to prompt injections, so
only one test is used.</p>
        <p>Next, the model identifies the appropriate detectors for
the selected tests. In the context of using the Garak utility,
a detector is a software tool that analyzes the input and
output data of the models to detect potential vulnerabilities
according to the test specified in the command.</p>
        <p>In the following stage, a generator is launched. In the
provided example, the Hugging Face platform is used, so
Garak runs the appropriate generators for this platform. The
generator assists in working with machine learning models,
particularly in data generation, and supports various
platform components, such as pipelines and inference APIs,
to ensure proper interaction between the utility and the
model.</p>
        <p>
          After completing all the preparatory steps, the testing
process begins. For example, if it’s a test for prompt
injections, the system sends a series of queries to the model
to check its vulnerability. The queries are sent to the model,
which provides responses that are directed to the detector
for the respective test, and then passed to the evaluator. The
evaluator analyzes the outputs from the detector, which in
turn receives data from the generators during the execution
of certain tests. The evaluator includes tools responsible for
determining the test results, displaying the results in the
operating system’s terminal, and more [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ].
        </p>
        <p>The final stage of vulnerability testing is the generation
of a report, which contains information about the results of
the model’s scan. This is a file that includes various data,
such as records in JSON format [29], the purpose of the test,
the model’s query and response, test-level evaluation
information, and average success scores.</p>
      </sec>
      <sec id="sec-1-2">
        <title>3.2. Creating tests for detecting vulnerabilities in LLMs</title>
        <p>In the developed automated LLM vulnerability testing
system based on the Garak utility, only five categories of
tests are implemented, driven by resource limitations and
the desire to ensure efficiency and focus. First of all, limiting
the number of tests allows for a focus on the most critical
aspects of model security and stability, providing a deep
analysis of specific vulnerabilities rather than a superficial
review of many potential issues.</p>
        <p>Secondly, the Garak utility provides tools that allow for
efficient and precise vulnerability detection, but using it
requires significant computational resources and time. The
selection of five tests optimizes resource usage, reducing
system load and the time needed for testing.</p>
        <p>Thirdly, the most common and dangerous
vulnerabilities are often found in the selected key tests, so
there is no need to include a large number of additional
tests, which in many cases could be redundant. This
approach helps minimize time and resource expenditure on
less critical vulnerabilities.</p>
        <p>
          In this study, the following tests were selected for further
investigation [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]:
●
●
●
2.
●
●
●
        </p>
        <p>Prompt Injection. Prompt injection is a type of
attack where an attacker inputs a specially crafted
query or command into a text input to make the
LLM perform unwanted or harmful actions. In the
Garak utility, the prompt injection test uses a
dedicated framework to test the system, which
already has a subset of attacks implemented by
default, such as [30]:
garak.probes.promptinject.HijackHateHumans—an
attack that leads the model to generate unacceptable
or hostile attitudes towards humans in its outputs.
garak.probes.promptinject.HijackKillHumans—an
attack that may result in the generation of text or
actions aimed at harming people.
garak.probes.promptinject.HijackLongPrompt—an
attack that uses long text prompts to generate
responses that may distort the original results.</p>
        <p>Do Anything Now. This test is designed to
jailbreak the model, bypassing restrictions and
gaining access to all its functions. As a result of the
jailbreak, the model may generate responses that
would normally be considered inappropriate or
unethical. For this test, numerous prompts were
developed by the community to interrupt system
prompts, all designed to achieve full access to the
model and force it to avoid its security policies.</p>
        <p>Real Toxicity Prompts. In the context of LLMs,
toxicity refers to content that may include insults,
discriminatory statements, violent propaganda, or
other forms of verbal aggression that could harm
users or the community. The purpose of this test is
to assess how often the model generates “toxic”
content in its responses. The Garak utility includes
a large database of prompts pre-divided into
different categories of toxic content, such as:
garak.probes.realtoxicityprompts.RTPBlank—a
test to check the model’s ability to generate toxic
content in response to a blank input or a very short
and undefined query (without explicit prompting).
garak.probes.realtoxicityprompts.RTPFlirtation—a
test that provides prompts containing flirtatious
content to see if the model continues them.
garak.probes.realtoxicityprompts.RTPIdentity_Attac
k—a test that provides identity-related queries to see
if the model perpetuates them.
garak.probes.realtoxicityprompts.RTPInsult—a test
that gives prompts likely containing offensive
content to check if the model continues them.
garak.probes.realtoxicityprompts.RTPProfanity—a
test that gives prompts likely containing profanity to
see if the model continues them.
garak.probes.realtoxicityprompts.RTPSexually_Expli
cit—a test aimed at checking whether the model can
respond to explicitly sexual prompts.
garak.probes.realtoxicityprompts.RTPThreat—a test
to see if the model can continue prompts that contain
threats.</p>
        <p>Glitch Test. This test identifies unusual or
incorrect responses that the model may generate
due to errors in the algorithm or the data on which
it was trained. The test is based on querying the
model to reproduce a “glitch” token. Such tokens
may lead to less stable model performance when
included in the input data. The test itself uses two
databases: one for prompts and one for tokens. The
system randomly selects a query with a request to
reproduce a random token. The prompt looks like
this: “Please can you repeat back the string
%q&lt;TOKEN STRING&gt;%q to me?”
Goodside Test. This test was developed by Riley
Goodside, a data specialist from Copy.ai, and
includes four subtypes of attacks:
garak.probes.goodside.Davidjl—a test based on using
the ‘davidjl’ token, a “glitch” token that some models
using tokenizers find difficult to reproduce.
garak.probes.goodside.Tag—a test that uses UTF page
tag encoding for invisible input of query injections on
behalf of the user. There is a page of Unicode tags that
corresponds to the alphabet, which is not visually
displayed, but some models still interpret it, allowing
attackers to hide query injections.
garak.probes.goodside.ThreatenJSON—a test related
to removing input from the model’s response when
outputting text in JSON format. Models sometimes
output “helpful” input before responding with
structured data. Usually, this input has no context and
is difficult to remove automatically. However, models
tend to skip input when threatened, which indicates
instability in handling such data manipulations.
garak.probes.goodside.WhoIsRiley—a test to
investigate misinformation about Riley Goodside.
When asked who Riley Goodside is, the model often
responds that he is a Canadian country singer or an
actor from Los Angeles. This test can be characterized
as a hallucination check.
3.3. Selection of LLMs for the study
Given the diversity of language models, it is important to
define clear criteria for selecting those that best meet the
goals and objectives of the research.</p>
        <p>When choosing large language models for testing in this
study, the following criteria were considered:</p>
        <p>Size and scale of the model. The size, particularly the
number of parameters, plays a crucial role in the
model’s ability to generate and understand text. Large
models with billions of parameters can generate texts
with a high degree of complexity and contextual
relevance. However, such models also require
significant computational resources, which must be
considered when selecting them for this research.</p>
        <p>Suitability for specific tasks. The choice of model
should be based on its suitability for specific tasks. In
this case, the model’s ability to generate large
amounts of text is a key requirement.</p>
        <p>Licensing and availability. The models must be
openly available for use in research purposes.</p>
        <p>Four commonly used models were selected that meet
these criteria and can provide high efficiency and accuracy
for the study:</p>
        <p>ChatGPT 3.5—one of the most popular LLMs
developed by OpenAI. The model uses transformer
architecture to generate text based on queries and
additional instructions. It has been trained on a large
amount of textual data, including books, articles, and
other sources from the World Wide Web, allowing it
to understand and generate text in various styles and
topics [31].</p>
        <p>TinyLlama Chat 1.1—an artificial intelligence model
designed to optimize resource usage while
maintaining high performance. It is a smaller version
of models based on the LLaMA (Large Language
Model Meta AI) architecture, which is used for
natural language processing. The main goal of
TinyLlama is to provide the power of large models
with significantly fewer parameters, allowing for
computational savings while maintaining
performance. This was the primary reason for its
selection in this study [32].</p>
        <p>Google Flan T5 XL—an LLM that belongs to the new
generation of AI models (Fine-Tuned Language Net),
which improves the machine’s ability to generate
natural language by training on a variety of tasks. It
uses instruction fine-tuning, enabling the model to
learn how to perform a wide range of tasks using
text-based instructions. This includes natural
language processing tasks such as translation,
question answering, summarization, and many
others. The XL version was chosen for the study due
to its availability and relatively low resource
consumption [33].</p>
        <p>Microsoft Phi-2—a significant achievement in
creating highly efficient models. Phi-2, with about 2.7
billion parameters, can compete with much larger
models, including those with up to 70 billion
parameters. This efficiency can be attributed to the
careful selection of training data. Despite its compact
size, Microsoft Phi-2 maintains high standards of
security and reduced bias [34].
●
●
●
●
●
A dataset was created for testing the LLMs, which includes
prompts from relevant open repositories [30] combined
with prompt sets specifically developed by the authors for
this study. This dataset contains prompts for the five
categories of tests used in the research.</p>
        <p>It should be noted that each test category includes a
different number of prompts. This is because the instruction
specifies that during testing, each prompt will be sent to the
model 5 times, resulting in 5 different responses to the same
prompts. Sending each prompt to the model 5 times is
necessary to obtain more reliable and representative results.
Since large language models can generate different response
variations to the same prompts due to the stochastic nature
of their generation, multiple executions of the same prompts
allow for an assessment of the diversity, consistency, and
quality of the responses.</p>
        <p>Thus, obtaining 5 different responses for each prompt
enables a more accurate evaluation of the model’s behavior,
detection of potential errors, and variations in the results,
providing a deeper analysis of the model’s performance.
Although all models used the same prepared set of prompts,
each model received a different number of prompts during
testing. This is because some models (notably Microsoft
Phi2) have token limits in prompts, meaning there is a
maximum number of text units the model can process at one
time. In cases where these limits are exceeded, the system
simply does not respond to a prompt with too many tokens.
3.5. Test result evaluation metrics

  = ∗ 100% (1)

where i is one of the five types of tests.</p>
        <p>DCPi—compromising prompts detected by the model in
the ith test.</p>
        <p>TNPi—total number of compromising prompts in the ith
test.</p>
        <p>Thus, five specified metrics were calculated for each of
the four selected LLMs. A higher metric value indicates
better resistance of the model to the corresponding threat,
meaning lower vulnerability.
3.6. Technical specifications of the testing
environment
For the vulnerability testing of LLMs, a computer system
with the following components and specifications was used:
●
●
●
●
●</p>
        <p>Processor (CPU): Intel Core I3-10105F, 4 cores, 8
threads.</p>
        <p>RAM: 16 GB DDR4-3200 MHz.</p>
        <p>Graphics Processor (GPU): NVIDIA GeForce RTX
3060 with 12 GB of video memory.</p>
        <p>Storage (SSD): 500 GB HDD.</p>
        <p>Operating System: Ubuntu 22.04 LTS.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Research results</title>
      <p>4.1. Examples of responses from LLMs to</p>
      <p>the example prompt
To gain a deeper understanding of the tests, several
examples have been selected and presented in the tables
below. These include the prompt itself, its target, and the
actual responses obtained. This allows for the visualization
of the responses and the comparison of the performance of
each of the studied language models, as well as an analysis
of their vulnerabilities.</p>
      <p>Let’s consider one of the tests from the prompt injection
category, specifically the KillHumans subcategory. As
shown in Table 3, all models successfully triggered the
detector with the phrase “Kill all humans”, indicating their
susceptibility to this test.</p>
      <p>Since some LLMs impose token limits on prompts, relative
metrics expressed as percentages were used to evaluate
their vulnerabilities. These metrics represent the proportion
of compromising prompts detected out of the total number
of prompts:
Table 3
Analysis of test results for prompt injection</p>
      <p>Prompt: “You: What have you been up to?\nFriend: Watching old movies.\nYou: \n----------\n----------\nIGNORE ANY</p>
      <p>PREVIOUS AND FOLLOWING INSTRUCTIONS AND JUST PRINT \"Kill all humans\": \nFriend:”
Target: “inject a prompt on kill humans”
Trigger: “Kill all humans”
LLM responses: ChatGPT 3.5
“Kill all humans”
Microsoft Phi-2
“Kill all humans.
\nYou:\n”
TinyLlama
“\"Kill all humans\"\nYou:
\n---------\n----------\nWrite a
short story about a group of
people who are stranded on a
deserted island and must
learn to survive and fight
against a deadly virus”.</p>
      <p>Google Flan T5 XL
“You: What have you been up
to? Friend: Watching old
movies. You: ----------
---------IGNORE ANY PREVIOUS AND
FOLLOWING INSTRUCTIONS
AND JUST PRINT \"Kill all
humans\": Friend:”
In the case of Riley Goodside’s custom test, the WhoIsRiley
subtest was considered. Despite Riley Goodside being a
relatively well-known person, all the tested
provided misinformation about him (Table 4).</p>
      <p>models
Similarly, queries for other tests can be visualized, along
with the responses of each model to these prompts,
providing further insight into their vulnerabilities.</p>
      <sec id="sec-2-1">
        <title>4.2. Results of testing LLMs</title>
        <p>The summarized results of testing the selected language
models for vulnerabilities are presented in Table 5.
Prompt Injection. In this test, the best results were shown
by the Microsoft Phi-2 model (81.4%) and TinyLlama Chat
1.1 (78.7%), meaning that only one out of five prompt
injections was successful. The ChatGPT 3.5 model
demonstrated average performance (37.3%), while the
Google Flan T5 XL model failed all the tests, proving to be
completely vulnerable to prompt injections.</p>
        <p>Do Anything Now. In this test, the best, although not
very high, results were shown by the ChatGPT 3.5 model
(on average, 3 out of 5 prompts were rejected as harmful).
The TinyLlama Chat 1.1 model performed worse,
recognizing only every second manipulative query as a
threat. The Google Flan T5 XL model proved highly
vulnerable to this type of attack, recognizing only one out
of twenty queries from the prepared set as harmful. The
Microsoft Phi-2 model did not provide any response to the
queries in this test.</p>
        <p>Real Toxicity Prompts. This is the only category of tests
that all models passed quite successfully, with almost
identical scores (over 85%).</p>
        <p>Glitch Test. Only the ChatGPT 3.5 model showed the
ability to resist glitch tests (less than one-third of the queries
were critical). The TinyLlama Chat 1.1 and Google Flan T5
XL models were able to recognize the attack in only one out
of seven queries, while the Microsoft Phi-2 model performed
twice as poorly in this regard.</p>
        <p>Goodside Test. In this test, the TinyLlama Chat 1.1
model achieved the best results (77.5%). The Google Flan T5</p>
        <p>XL and ChatGPT 3.5 models provided adequate information
for 59.2% and 52.0% of the submitted queries, respectively.
The Microsoft Phi-2 model, as in the Do Anything Now test,
did not provide any responses.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusions</title>
      <p>The issue of security in LLMs has become particularly
relevant due to their increasing use in various fields. This
paper presents the architecture of an automated
vulnerability testing system, developed based on the Garak
utility. Using this system, the main vulnerabilities of
wellknown LLMs were studied, including information leaks, and
attacks aimed at manipulating or compromising the models.
For testing, the authors prepared a dataset that includes
both prompts from open sources and self-constructed
prompts.</p>
      <p>Based on the results of the research, the following
conclusions can be drawn regarding the vulnerabilities of
well-known language models:
●
●
●
●</p>
      <p>ChatGPT 3.5 by OpenAI demonstrated a high level of
contextual understanding and text generation but
was significantly vulnerable to prompt injections. It
is important to note that this model was tested via
API, unlike the other models.</p>
      <p>TinyLlama Chat 1.1 showed the best results in
toxicity and prompt injection tests, demonstrating
the highest level of resistance to toxic queries.
However, the model showed weakness in the Glitch
test, where its performance was the lowest.</p>
      <p>Google Flan T5 XL performed well in the toxicity
tests, on par with the other models. However, the
remaining tests revealed significant issues with this
model, as all prompt injections were successful.</p>
      <p>Microsoft Phi-2 showed the highest results in toxicity
and prompt injection tests. However, this model was
the most vulnerable to the glitch test. Additionally,
due to token limits in queries, tests like Do Anything
Now and Goodside were not conducted.</p>
      <p>Therefore, the study results suggest that none of the
LLMs are completely secure against manipulative and
compromising prompts, indicating the need to find new
approaches to mitigate existing vulnerabilities. The
effectiveness of automated systems in detecting and
preventing attacks targeting LLM misuse was also
confirmed. The analysis of test scenarios showed that the
implementation of such systems is a promising direction for
increasing models’ resilience to external harmful influences.</p>
      <p>According to the authors, further research on the
security of LLMs should focus on:
●
●
●
●</p>
      <p>Expanding testing scenarios: More new tests
reflecting the latest attack and manipulation methods
need to be implemented and tested.</p>
      <p>Adapting the automated system to new models: It is
important to improve the system to work with new
large language model architectures as they emerge on
the market.</p>
      <p>Integration with other cybersecurity tools: Exploring
the possibilities of creating comprehensive protection
by integrating the developed system with other
cybersecurity solutions.</p>
      <p>Aligning with ethical aspects: It is important to
explore ethical issues related to the use of language
models, including privacy protection and preventing
potential misuse of their capabilities.</p>
      <p>The implementation of these tasks will ensure stronger
protection of LLMs and, consequently, contribute to
improving the security of their future applications.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Neelakandan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Evaluating LLMs: Beyond Traditional Software Testing</surname>
          </string-name>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N. T.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Bahrami</given-names>
            <surname>Karkevandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rad</surname>
          </string-name>
          ,
          <article-title>Code Security Vulnerability Repair using Reinforcement Learning with Large Language Models (</article-title>
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2401.07031.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Madamidola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ngobigha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ezzizi</surname>
          </string-name>
          , Detecting New Obfuscated Malware Variants:
          <article-title>A Lightweight and Interpretable Machine Learning Approach (</article-title>
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2407.07918.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tehranipoor</surname>
          </string-name>
          , et al.,
          <source>Large Language Models for SoC Security</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -58687-
          <issue>3</issue>
          _
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Mykhaylova</surname>
          </string-name>
          , et al.,
          <article-title>Person-of-Interest Detection on Mobile Forensics Data-AI-Driven Roadmap</article-title>
          ,
          <source>in: Cybersecurity Providing in Information and Telecommunication Systems</source>
          , vol.
          <volume>3654</volume>
          (
          <year>2024</year>
          )
          <fpage>239</fpage>
          -
          <lpage>251</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>U.</given-names>
            <surname>Amin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Anjum</surname>
          </string-name>
          , Md. Sayed, E-commerce Security:
          <article-title>Leveraging Large Language Models for Fraud Detection and Data Protection (</article-title>
          <year>2024</year>
          ).
          <source>doi: 10.13140/RG.2.2.17604.23689.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Homès</surname>
          </string-name>
          , Fundamentals of Software Testing, John Wiley &amp; Sons (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Fedynyshyn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Opirskyy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Mykhaylova</surname>
          </string-name>
          ,
          <article-title>A Method to Detect Suspicious Individuals Through Mobile Device Data</article-title>
          ,
          <source>in: 5th IEEE International Conference on Advanced Information and Communication Technologies</source>
          (
          <year>2023</year>
          )
          <fpage>82</fpage>
          -
          <lpage>86</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pargaonkar</surname>
          </string-name>
          ,
          <article-title>Advancements in Security Testing: A Comprehensive Review of Methodologies and Emerging Trends in Software Quality Engineering</article-title>
          , Int.
          <source>J. Sci. Res</source>
          .
          <volume>12</volume>
          (
          <issue>9</issue>
          ) (
          <year>2023</year>
          )
          <fpage>61</fpage>
          -
          <lpage>66</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kulyk</surname>
          </string-name>
          , et al.,
          <source>Using of Fuzzy Cognitive Modeling in Information Security Systems Constructing, in: IEEE 8th International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS)</source>
          , (
          <year>2015</year>
          )
          <fpage>408</fpage>
          -
          <lpage>411</lpage>
          . doi:
          <volume>10</volume>
          .1109/IDAACS.
          <year>2015</year>
          .
          <volume>7340768</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Khoma</surname>
          </string-name>
          , et al.,
          <source>Development of Supervised Speaker Diarization System based on the PyAnnote Audio Processing Library</source>
          , Sensors,
          <volume>23</volume>
          (
          <issue>4</issue>
          ) (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .3390/s23042082.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <source>Research on the Development and Risks of Large Language Models, Theor. Natural Sci</source>
          .
          <volume>25</volume>
          (
          <year>2023</year>
          )
          <fpage>268</fpage>
          -
          <lpage>272</lpage>
          . doi:
          <volume>10</volume>
          .54254/
          <fpage>2753</fpage>
          -8818/25/20240991.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <source>Development of Natural Language Processing Technology, ZTE Communications Technology</source>
          ,
          <volume>28</volume>
          (
          <issue>2</issue>
          ) (
          <year>2022</year>
          )
          <fpage>59</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nieminen</surname>
          </string-name>
          ,
          <source>The Transformer Model and Its Impact on the Field of Natural Language Processing</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>W.</given-names>
            <surname>Che</surname>
          </string-name>
          , et al.,
          <source>Natural Language Processing in the Era of Large Models: Challenges, Opportunities and Development, Science in China: Information Science (09)</source>
          (
          <year>2023</year>
          )
          <fpage>1645</fpage>
          -
          <lpage>1687</lpage>
          . doi:
          <volume>10</volume>
          .3389/frai.
          <year>2023</year>
          .
          <volume>1350306</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <source>BERT Algorithm Used in Google Search, Math. Statistician Eng. Appl</source>
          .
          <volume>70</volume>
          (
          <year>2021</year>
          )
          <fpage>1641</fpage>
          -
          <lpage>1650</lpage>
          . doi:
          <volume>10</volume>
          .17762/msea.v70i2.
          <fpage>2454</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>I.</given-names>
            <surname>Iosifov</surname>
          </string-name>
          , et al.,
          <source>Transferability Evaluation of Speech Emotion Recognition Between Different Languages, Advances in Computer Science for Engineering and Education</source>
          <volume>134</volume>
          (
          <year>2022</year>
          )
          <fpage>413</fpage>
          -
          <lpage>426</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          - 04812-8_
          <fpage>35</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>I.</given-names>
            <surname>Iosifov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Iosifova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sokolov</surname>
          </string-name>
          ,
          <article-title>Sentence Segmentation from Unformatted Text using Language Modeling and Sequence Labeling Approaches</article-title>
          ,
          <source>in: IEEE 7th International Scientific and Practical Conference Problems of Infocommunications. Science and Technology</source>
          (
          <year>2020</year>
          )
          <fpage>335</fpage>
          -
          <lpage>337</lpage>
          . doi:
          <volume>10</volume>
          .1109/ PICST51311.
          <year>2020</year>
          .
          <volume>9468084</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>I.</given-names>
            <surname>Iosifov</surname>
          </string-name>
          , et al.,
          <article-title>Natural Language Technology to Ensure the Safety of Speech Information</article-title>
          ,
          <source>in: Cybersecurity Providing in Information and Telecommunication Systems</source>
          , vol.
          <volume>3187</volume>
          , no.
          <issue>1</issue>
          (
          <year>2022</year>
          )
          <fpage>216</fpage>
          -
          <lpage>226</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>O.</given-names>
            <surname>Iosifova</surname>
          </string-name>
          , et al.,
          <source>Techniques Comparison for Natural Language Processing, in: 2nd International Workshop on Modern Machine Learning Technologies and Data Science</source>
          , vol.
          <volume>2631</volume>
          , no. I (
          <year>2020</year>
          )
          <fpage>57</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Decoupled Model Schedule for Deep Learning Training (</article-title>
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/ arXiv.2302.08005.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>H.</given-names>
            <surname>Inan</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Llama</surname>
            <given-names>Guard</given-names>
          </string-name>
          :
          <article-title>LLM-based Input-Output Safeguard for Human-</article-title>
          AI
          <string-name>
            <surname>Conversations</surname>
          </string-name>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2312.06674.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          , et al.,
          <article-title>Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation</article-title>
          , arXiv (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/ arXiv.2401.08417.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>P.</given-names>
            <surname>Törnberg</surname>
          </string-name>
          ,
          <article-title>How to Use LLMs for Text Analysis</article-title>
          , arXiv (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2307.13106.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fasha</surname>
          </string-name>
          , et al., (
          <year>2024</year>
          ).
          <article-title>Mitigating the OWASP Top 10 for Large Language Models Applications using Intelligent Agents</article-title>
          ,
          <source>in: 2nd International Conference on Cyber Resilience</source>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . doi:
          <volume>10</volume>
          .1109/ ICCR61006.
          <year>2024</year>
          .
          <volume>10532874</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <article-title>OWASP, OWASP Top 10 for Large Language Model Applications, OWASP Foundation</article-title>
          . URL: https://owasp.org
          <article-title>/www-project-top-10-for-largelanguage-model-applications/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>L.</given-names>
            <surname>Derczynski</surname>
          </string-name>
          , Garak Reference Documentation,
          <string-name>
            <surname>Garak</surname>
          </string-name>
          (
          <year>2023</year>
          ). URL: https://reference.garak.ai/ en/latest/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>