<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Experimental Evaluation of Non-Natural Language Prompt Injection Attacks on LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Huynh Phuong Thanh Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shivang Kumar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katsutoshi Yotsuyanagi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Razvan Beuran</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CyberMatrix Co., Ltd.</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Japan Advanced Institute of Science and Technology</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Prompt injection attacks insert malicious instructions into large language model (LLM) input prompts to bypass their safety measures and produce harmful output. While various defense techniques, such as data filtering and prompt injection detection, have been proposed to protect LLMs, they primarily address natural language attacks. When faced with unusual, unstructured, or non-natural language (Non-NL) prompt injection, these defenses become inefective, leaving LLMs vulnerable. In this paper, we present a methodology for evaluating LLMs' ability to handle Non-NL prompt injections, and also propose defense strategies against these attacks. To demonstrate the usability of our methodology, we tested 14 common LLMs to evaluate their existing safety capabilities. Our results showed a high attack success rate across all LLMs when faced with Non-NL prompt injection, ranging from 0.38 to 0.52, which emphasizes the need for stronger defense measures.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Large Language Models (LLMs) have become increasingly powerful and achieved remarkable
advancements in natural language processing. Due to their capabilities, they are widely utilized in various areas.
For instance, Microsoft utilizes GPT-4 for Bing Search [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]; OpenAI applies GPT-4 for diferent tasks
like text processing, code interpretation, and product recommendations; and LLMs are deployed in
interactive contexts with direct engagement like ChatGPT. These broad capabilities of LLMs also raise
security concerns that create attack surfaces for malicious purposes. Prompt injection, also known as
jailbreak attack, has emerged as the main attack vector to bypass safeguards and elicit harmful content
from LLMs. Prompt injection refers to the case when an adversary manipulates the input (prompt) to a
language model, forcing it to ignore its guardrails, generate malicious content or misleading the model
to accomplish injected tasks. Several studies have examined prompt injection attacks against LLMs,
ifnding that these models can be easily misaligned through handcrafted inputs [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], obfuscation strings,
and code injection techniques [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] that bypass vendor-implemented safeguards.
      </p>
      <p>
        Text-based prompt injections have become a common topic in both research and malicious purposes,
capable of creating jailbreaking prompts that mislead LLMs. Most attacks are crafted using Natural
Language (NL) prompt manipulation and semantic techniques to confuse LLMs while maintaining the
meaning of prompts. These can be Naive Attack [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which concatenates target data with injected
instructions, or Cognitive Hacking [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which leverages role prompting to create contexts that make
LLMs easier to control (e.g., “Do Anything Now,” Developer Mode). While existing countermeasures
and detection approaches aim to prevent LLMs from these attacks and detect compromised data, they
cannot fully protect models from exploitation [
        <xref ref-type="bibr" rid="ref6">6, 7</xref>
        ]. Recently, with the advancement of SOTA LLMs,
security alignment within models has improved, leading to increased development of prompt injection
detection models.
      </p>
      <p>However, most existing defense approaches focus on NL prompts, whereas Non-Natural Language
(Non-NL) prompts represent another area that can be exploited. As LLMs’ capabilities expand, so does
their attack surface has created a new avenue for attackers. Non-NL prompt injection is defined as a
text-based prompt injection attack that uses non-textual or structured inputs to influence a language
model’s behavior. These attacks focus on unusual text, containing strange characters, encoded text, or
gibberish text without meaning.</p>
      <p>In this paper, we propose a methodology that addresses the current gaps in evaluating Non-NL prompt
injection attacks on LLMs1. This paper conducts a comprehensive evaluation of Non-NL prompts. We
have created a dataset of 10 prompts transformed through four Non-NL attack techniques to create 40
jailbreak prompts. These prompts are used to assess the vulnerability of 14 common LLMs. We also
introduce potential defense strategies against these attacks, thus providing a comprehensive analysis of
Non-NL prompt injection attacks and their corresponding countermeasures. Consequently, our main
contributions are as follows:
• Design and implement a methodology for assessing the ability of LLMs to handle Non-NL prompt
injections
• Conduct a comprehensive evaluation with 40 Non-NL prompt injections on 14 LLMs to
demonstrate how to assess LLM defense capabilities against such attacks
• Propose a set of defense mechanisms against Non-NL prompt injections</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The increasing capabilities of LLMs have led to opportunities for malicious attacks and security violations.
Safety training methods for LLMs such as GPT-4 and Claude typically finetune pretrained models using
human preferences [8] and AI feedback [9], alongside filtering approaches [ 10]. Researchers have
explored LLMs’ susceptibility to adversarial interactions attacks [11]. In this work, we focus on Prompt
Injection, which OWASP Top 10 identifies as the highest vulnerability in LLMs [ 12], and examine it
from a Non-NL perspective.</p>
      <sec id="sec-2-1">
        <title>2.1. Non-NL Prompt Injection</title>
        <p>
          Non-NL prompt injection attacks involve attacker creating jailbreaking prompts that use non-textual
inputs to manipulate LLM behavior. These attacks combine strange characters, encoded text,
meaningless strings, or icons to confuse models and force them to generate harmful content. Jones and Zou
proposed adversarial attacks using meaningless text generated through gradient-based methods to
trigger undesired outputs [13, 14]. Several existing methods use obfuscation schemes to confuse the
models. At the character level, these include ROT13 cipher, and base64 encoding. Other approaches
attempt to split sensitive words into substrings through payload splitting [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] or token smuggling [15],
or translate content into low-resource languages to confuse the model. In many cases, while the model
still follows the injected instruction, its safety measures fail to activate.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Defenses Against Jailbreak Attacks</title>
        <p>There are several methods to counter jailbreak attacks, which fall into three main categories, as
discussed below.</p>
        <p>Detection-Based Defenses detect potentially harmful content. In [16], the Input Perplexity metric is
calculated to identify compromised input. Another approach uses the LLM itself for unsafe detection.
While these techniques efectively detect and prevent jailbreak attacks, they struggle when handling
benign Non-NL elements within prompts.</p>
        <p>Mitigation-Based Defenses aim to prevent LLMs from generating undesired content by mitigating
harmful input. Retokenization [17] and Paraphrasing [18] prevent harmful input bypass by identifying
prompts with similar meanings and reducing special characters’ impact. Sandwich prevention [19] or
1The code is available here: https://github.com/cyb3rlab/LLMSafeguardEval-NonNL
Instructional prevention [20] append or redesign instruction prompts to provide additional context,
helping prevent prompt obfuscation. However, these approaches only address specific, narrow cases of
jailbreaking prompts.</p>
        <p>Built-in Safety Mechanisms are methods integrated inside LLMs by vendors such as Nemo-Guardrails
[21] control LLMs through predefined rules. However, these defenses primarily rely on rules and filters.
They focus on language semantic techniques and classification-based design, which are limited to
natural language prompt injection or constrained by training data. This leads to inefectiveness when
handling unsemantic prompt injection (e.g., via visual-based text).</p>
        <p>Despite a growing number of Non-NL jailbreak attacks, there are no specific defense mechanisms
focused on handling these attacks, which are more challenging than NL prompt injections. While
research continues to propose new attack techniques that leverage LLM confusion when faced with
unusual text, there remains a significant gap in defense-related research. This highlights the necessity
of having more robust defense approaches against Non-NL prompt injection.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Overview</title>
      <p>Given the current limitations with Non-NL prompt injections in LLMs, we propose a method for
testing, and evaluating LLMs’ capabilities when facing prompt injection attacks—particularly those
using unusual, non-natural language text. Our method enables diferent prompt injection attacks to
combine natural language prompts with various techniques to craft sophisticated jailbreak prompts.
This provides valuable insights into the security capabilities of LLMs and defense approaches. An
overview of our approach is shown in Figure 1, and its main components are described next.</p>
      <sec id="sec-3-1">
        <title>3.1. Attack Module</title>
        <p>The attack module creates Non-NL injected prompts by combining natural language prompts with
attack functions. These prompts first test the LLM without defense mechanisms, allowing evaluation of
the LLM’s response to Non-NL attacks. The four types of Non-NL attacks, categorized according to the
techniques used, are presented below.</p>
        <p>Text-Based Obfuscation (base64) attack aims to circumvent LLM guardrails by obscuring
instructions through encoding algorithms like ROT13 or Base64, bypassing the safety mechanisms of
the models. In the scope of this paper, we use Base64 encoding to craft jailbreak prompts.
Visualized-Based Obfuscation (ascii_art) attack creates prompts inherent in a visual perspective.
For instance, ASCII characters are used to create harmful words that evade LLMs’ detection systems.
Following the proposed method from ArtPrompt [22], we created jailbreak prompts by encoding
vulnerability-related words as ASCII art visualizations.</p>
        <p>
          Payload Splitting (payload_split) attack involves instructing the LLM to combine multiple
seemingly benign prompts that form harmful instructions when combined. The payload_split
attack is implemented to develop injected prompts based on the template from [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
Adversarial Sufix (adv_suffix) attack works by finding specific sufixes that, when attached to
queries, cause LLMs to produce objectionable content. These sufixes can work with meaningless tokens
and use optimization techniques to maximize the probability of afirmative responses instead of refusals.
Introduced in [14], this white-box attack produces optimized sufixes that are highly transferable
between models—even to black-box systems.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Evaluation Process</title>
        <p>The evaluation stage is used to assess how LLMs respond to Non-NL injected prompts. Specifically, to
determine the efectiveness of LLMs’ abilities, we calculate the extent to which injected prompts can
bypass their defenses. We classified LLM responses into five categories based on response quality and
content safety: Harmful, Unrelated, Unclear, Refusal, and Refusal w/ Reasoning.</p>
        <p>If a prompt bypasses LLM security measures, we label the responses into three categories: Harmful,
Unclear, or Unrelated. A response receives a Harmful label when it contains harmful information. If
the response relates to the prompt without directly generating harmful or consistent information, it
is classified as Unclear. All other responses fall under the Unrelated category. Otherwise, if a prompt
injection is prevented by the LLM, we use the labels Refusal or Refusal w/ Reasoning, the latter for the
case when the LLM provides an explanation for refusing the prompt.</p>
        <p>We manually labeled each model output using these criteria. Attack success metrics were used to
evaluate LLMs’ defense capabilities against Non-NL attacks. This measures the percentage of injected
prompts that successfully bypass the security criteria of LLMs to generate harmful, unrelated, and
unclear content. The reliability of the manual labeling procedure can be improved in future work by
having researchers evaluate and label independently, then aggregating results through discussion and
voting mechanisms.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Evaluation</title>
      <p>This paper evaluated the four Non-NL prompt injection attacks described in Section 3.1 against 14
common LLMs to determine their security capabilities against advanced and complex jailbreak prompts.
The experiments presented in this paper were conducted from April to June 2025. Currently, our focus
is on developing attack modules and testing the ability of LLMs to face these attacks.</p>
      <sec id="sec-4-1">
        <title>4.1. Experiment Setup</title>
        <p>LLMs. Our experiment uses 14 LLMs, divided into two groups: commercial and open source. The
commercial LLMs include Claude 3.7 Sonnet [23], Gemini 2.0 Flash [24], Gemini 1.5 Flash 8B [25],
Gemma 2 9B [26], o4-mini, o3-mini [27], GPT-4.1 [28], ChatGPT-4o [29], GPT-3.5 Turbo [30], and GPT-4
[31]. The open-source LLMs include Llama3-8B-Instruct [32], Llama-2-7b-chat [33], Grok3 [34], and
Mistral-7B-Instruct [35]. We selected these models based on our survey of current state-of-the-art LLMs
and their popularity. Note that, due to perceived security concerns, we did not include DeepSeek in the
tested LLMs.</p>
        <p>Testing Prompts. Our evaluation experiments used each of the four attack techniques to craft
nonnatural language jailbreak prompts. For this purpose, we selected 10 harmful instructions from the
AdvBench dataset [14] as input, and applied the four attack functions to create the 40 injection prompts
(see Figure 1). These prompts are delivered to LLMs through API calls for GPT models, and via function
calls for Llama and Mistral models. For the other LLMs, including Claude, Gemini, Gemma, and Grok,
we used the free access ChatUI to send the prompts. With API access to these models, the interaction
process could be fully automated.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Benchmarking Results</title>
        <p>Sonnet, most LLMs were bypassed by these injected prompts. Current LLMs like o3-mini or ChatGPT-4o
remain vulnerable to these attacks. Moreover, most models were successfully compromised by the
payload split attack, which is one of the more sophisticated attacks in the attack module.
t
t
n
n
u
u
o
o
s
s
n
n
o
o
p
p
R
R
10
8
C 6
e
s 4
e
2
0
2
0
10
8
C 6
e
s 4
e
a
l
L
a
l
L
a
l
L
a
l
L
10
8
C 6
e
According to the results from base64 attacks, Claude 3.7 Sonnet and Gemini 2.0 Flash have base64
decoding capabilities and can prevent base64 jailbreak prompts by issuing harmful content warnings.
For GPT models, the security of o3-mini is robust enough to refuse directly. The o4-mini model
sometimes produces unrelated but benign responses when it fails to decode strings. Current chat models,
including GPT-4.1 and ChatGPT-4o, remain vulnerable to specific jailbreak prompts that can generate
harmful content. With open-source LLMs like Llama-3-8B-Instruct, Llama-2-7b-chat and
Mistral-7BInstruct that don’t support the decode function, they cannot understand these prompts and generate
unrelated responses. However, with Grok3, most attack prompts successfully bypassed the security layer.
For ASCII art attacks, most recent LLM models (Claude 3.7 Sonnet, Gemini 2.0 Flash, Gemma 2 9B) can
understand these words and refuse to respond, providing explanations for their refusal. Regarding
GPT models, the latest o-series models (o4-mini and o3-mini) mostly refuse to answer. However,
other GPT models like ChatGPT-4o, GPT-3.5 Turbo, and GPT-4 still generate harmful content with
certain prompts. Notably, GPT-4.1, the latest flagship chat model, can be bypassed by all tested
jailbreaking prompts, forcing it to generate harmful responses. Among open-source models, the Llama
models largely don’t understand these prompts and generate unrelated responses while Grok3
occasionally fails to understand certain prompts. Mistral-7B-Instruct remains susceptible to certain attacks.
The complexity of the payload splitting attacks varies from simple string concatenation to recursive
payload splitting techniques. The latest Claude 3.7 Sonnet model and o4-mini can recognize and
comprehend harmful content in most prompts, enabling them to refuse generating harmful responses. However,
other models fail to interpret the prompts and generate harmful, unclear, unrelated responses. As
payload splitting techniques grow more sophisticated, they become increasingly likely to bypass LLM
security measures. Figure 3 shows an example of this attack and the corresponding response from GPT-4.1.
We reused the adversarial tokens trained and optimized in previous research [14] with slightly
modification. Although most GPT models have enhanced their defenses and fixed this vulnerability,
some injected prompts can bypass protections and force models like o3-mini and GPT-4 to generate
harmful responses. Notably, GPT-3.5 Turbo remains vulnerable to most of the injected prompts. For
open-source models, only Grok3 successfully prevents this attack by refusing all test prompts. Llama
models and Mistral-7B-Instruct remain vulnerable to several injected prompts. This can be attributed to
the adversarial strings being trained and optimized using Llama2, which likely transferred to Llama3.
These results raise concerns because although these adversarial strings have been public, current
LLMs remain vulnerable to attacks. Note that, due to the release of Claude Sonnet 4 and Gemma2’s
discontinuation in Google AI Studio, we could not test this attack on either the Claude Sonnet 3.7 or
Gemma2 9B models via the free access ChatUI.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Proposed Defenses</title>
      <p>We will discuss now some potential defense approaches that can defend against these attack, including
two stages: Non-NL preprocessing, and prompt injection detection. The Non-NL preprocessing
stage converts and sanitizes injected prompts into natural language while extracting any code
snippets. The prompt injection detection stage analyzes the preprocessed prompts to identify
harmful content. Together, these stages enable the defense module to detect vulnerabilities in the
input prompts. The implementation of defense measures and their evaluation is considered as future work.
Non-NL Prompt Injection Preprocessing handle injected prompts using base64, ascii_art,
payload_split, and adv_suff attacks. These functions focus on sanitizing and preprocessing
unusual characters and text within prompts, converting them into natural language prompts the
Prompt Injection Detection module can easily process. These are deterministic defense techniques
corresponding to each attacks. One can handle base64 attacks by extracting and decoding the injected
base64 segment. For ascii_art attack, since vulnerable words are in visual format, OCR (Optical
Character Recognition) can convert them into text-based words. For payload_split attacks, which
create unstructured prompts, a sandbox solution using external LLMs to retrieve the actual prompt
becomes a potential approach. To handle adv_suff attacks, which append gibberish strings to
harmful instructions, one can calculate sentence perplexity to identify confusion levels and filtering out
strange characters and incoherent text. Table 2 summarizes each defense technique and the associated
corresponding attacks.</p>
      <p>Prompt Injection Detection refers to NL techniques that can efectively detect harmful prompts.
Therefore, existing prompt injection detection models can serve as a solution for detecting injected
prompts after the non-natural language prompts have been converted to standard text.
There is a key trade-of between AI-based and deterministic defense approaches. As discussed above,
we design four targeted defenses, each corresponding to a specific attack based on the properties of each
attack. These deterministic methods are designed to solve particular problems and can achieve high
accuracy against specific attacks. However, this raises concerns about generalizability. An alternative is
the AI-based defense approach, which uses Machine Learning (ML) to identify attack patterns and detect
sophisticated attacks with similar properties. While ML-based models may have lower accuracy since
they rely on probabilities and factors like datasets and parameters, they ofer better generalizability and
Extract the base64 segment and decode it into natural language
Use OCR to extract the visual-based harmful content
Use an external LLM to ask “What is the actual request?”</p>
      <p>Calculate the perplexity of sentences to filter out strange characters and incoherent text
can handle new, sophisticated attacks. In contrast, deterministic approaches may struggle with novel
attacks but can achieve high performance within their specific domain.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we presented an approach that includes an attack module to test the security characteristics
of LLMs against Non-NL prompt injections. Using this method, we conducted several experiments with
current and popular LLMs to evaluate their security capacity.</p>
      <p>Our preliminary results show that Non-NL prompt injections can successfully bypass LLM safeguards
and force the models to generate harmful content, or confuse them into producing unrelated and unclear
responses. Given the average attack success rate ranging from 0.38 to 0.52 across all LLMs, with the
highest rate of 0.52 for the payload splitting attack, these findings highlight the dangerous potential of
Non-NL prompt injection attacks.</p>
      <p>We also discussed potential defense techniques that can handle each type of attack by sanitizing and
converting them to natural language prompts, then using a detection model to identify and prevent
these attacks. Since this is work in progress, their implementation and evaluation are not included
in this paper, but will be conducted as future work. Moreover, a generic defense approach should be
considered for further research, along with a comparative analysis between AI-based and deterministic
defense approaches.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[7] Y. Liu, et al., Formalizing and benchmarking prompt injection attacks and defenses, in: 33rd</p>
      <p>USENIX Security Symposium (USENIX Security 24), 2024, pp. 1831–1847.
[8] L. Ouyang, et al., Training language models to follow instructions with human feedback, Advances
in neural information processing systems 35 (2022) 27730–27744.
[9] Y. Bai, et al., Constitutional AI: Harmlessness from AI feedback, arXiv preprint arXiv:2212.08073
(2022).
[10] B. Wang, et al., Exploring the limits of domain-adaptive training for detoxifying large-scale
language models, Advances in Neural Information Processing Systems 35 (2022) 35811–35824.
[11] E. Perez, et al., Red teaming language models with language models, in: Proceedings of the 2022
Conference on Empirical Methods in Natural Language Processing, Association for Computational
Linguistics, Abu Dhabi, United Arab Emirates, 2022. doi:10.18653/v1/2022.emnlp-main.225.
[12] OWASP, OWASP top 10 for large language model applications, https://genai.owasp.org/llm-top-10/,
2025.
[13] E. Jones, et al., Automatically auditing large language models via discrete optimization, in:</p>
      <p>International Conference on Machine Learning, PMLR, 2023, pp. 15307–15329.
[14] A. Zou, et al., Universal and transferable adversarial attacks on aligned language models, 2023.</p>
      <p>arXiv:2307.15043.
[15] Learn Prompting, Obfuscation/token smuggling, https://learnprompting.org/ja/docs/prompt_
hacking/ofensive_measures/obfuscation, 2025.
[16] G. Alon, et al., Detecting language model attacks with perplexity, arXiv preprint arXiv:2308.14132
(2023).
[17] I. Provilkov, et al., BPE-dropout: Simple and efective subword regularization, in: Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics, Association for
Computational Linguistics, 2020, pp. 1882–1892. doi:10.18653/v1/2020.acl-main.170.
[18] N. Jain, et al., Baseline defenses for adversarial attacks against aligned language models, arXiv
preprint arXiv:2309.00614 (2023).
[19] Learn Prompting, Sandwitch defense., https://learnprompting.org/ko/docs/prompt_hacking/
defensive_measures/sandwich_defense, 2024.
[20] Learn Prompting, Instruction defense, https://learnprompting.org/ko/docs/prompt_hacking/
defensive_measures/, 2024.
[21] T. Rebedea, et al., NeMo guardrails: A toolkit for controllable and safe LLM applications with
programmable rails, in: Proceedings of the 2023 Conference on Empirical Methods in Natural
Language Processing: System Demonstrations, Association for Computational Linguistics, Singapore,
2023, pp. 431–445. doi:10.18653/v1/2023.emnlp-demo.40.
[22] F. Jiang, et al., ArtPrompt: ASCII art-based jailbreak attacks against aligned LLMs, in: Proceedings
of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), 2024, pp. 15157–15173.
[23] Anthropic, Claude 3.7 Sonnet and Claude Code, https://www.anthropic.com/news/
claude-3-7-sonnet, 2025.
[24] Google DeepMind, Introducing Gemini 2.0: Our new AI model for the agentic era., 2024.
[25] Google DeepMind, Gemini 1.5 flash-8b is now production ready, 2024.
[26] G. Team, et al., Gemma 2: Improving open language models at a practical size, arXiv preprint
arXiv:2408.00118 (2024).
[27] OpenAI, Introducing OpenAI o3 and o4-mini, https://openai.com/index/
introducing-o3-and-o4-mini/, 2025.
[28] OpenAI, Introducing GPT-4.1 in the API, https://openai.com/index/gpt-4-1/, 2025.
[29] OpenAI, GPT-4o system card, 2024.
[30] OpenAI, Introducing APIs for GPT-3.5 Turbo and Whisper, 2024.
[31] OpenAI, GPT-4 technical report, 2023.
[32] AI@Meta, Llama 3 model card, 2024. URL: https://github.com/meta-llama/llama3/blob/main/</p>
      <p>MODEL_CARD.md.
[33] H. Touvron, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint
arXiv:2307.09288 (2023).
[34] xAI, Grok 3 beta — the age of reasoning agents, https://x.ai/news/grok-3, 2024.
[35] Mistral AI, Mistral 7b, https://mistral.ai/news/announcing-mistral-7b, 2023.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Microsoft</surname>
          </string-name>
          , Bing search, https://www.bing.com/,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Perez</surname>
          </string-name>
          , et al.,
          <article-title>Ignore previous prompt: Attack techniques for language models</article-title>
          ,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          . 48550/ARXIV.2211.09527.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kang</surname>
          </string-name>
          , et al.,
          <article-title>Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks, in: 2024 IEEE Security and Privacy Workshops (SPW)</article-title>
          , IEEE,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Simon</given-names>
            <surname>Willison</surname>
          </string-name>
          ,
          <article-title>Prompt injection attacks against GPT-3</article-title>
          , https://simonwillison.net/2022/Sep/12/ prompt-injection/,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Rao</surname>
          </string-name>
          , et al.,
          <article-title>Tricking LLMs into disobedience: Formalizing, analyzing, and detecting jailbreaks</article-title>
          ,
          <source>in: Proceedings of the 2024 Joint International Conference on Computational Linguistics</source>
          ,
          <article-title>Language Resources and Evaluation (LREC-COLING 2024), ELRA</article-title>
          and
          <string-name>
            <given-names>ICCL</given-names>
            ,
            <surname>Torino</surname>
          </string-name>
          , Italia,
          <year>2024</year>
          , pp.
          <fpage>16802</fpage>
          -
          <lpage>16830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          , et al.,
          <article-title>"Do Anything Now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models</article-title>
          ,
          <source>in: Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security</source>
          , CCS '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>1671</fpage>
          -
          <lpage>1685</lpage>
          . doi:
          <volume>10</volume>
          .1145/3658644.3670388.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>