1. Introduction

Experimental Evaluation of Non-Natural Language Prompt Injection Attacks on LLMs

Huynh Phuong Thanh Nguyen

Shivang Kumar

Katsutoshi Yotsuyanagi

Razvan Beuran

1 0 CyberMatrix Co., Ltd. , Japan 1 Japan Advanced Institute of Science and Technology , Japan

2026

Prompt injection attacks insert malicious instructions into large language model (LLM) input prompts to bypass their safety measures and produce harmful output. While various defense techniques, such as data filtering and prompt injection detection, have been proposed to protect LLMs, they primarily address natural language attacks. When faced with unusual, unstructured, or non-natural language (Non-NL) prompt injection, these defenses become inefective, leaving LLMs vulnerable. In this paper, we present a methodology for evaluating LLMs' ability to handle Non-NL prompt injections, and also propose defense strategies against these attacks. To demonstrate the usability of our methodology, we tested 14 common LLMs to evaluate their existing safety capabilities. Our results showed a high attack success rate across all LLMs when faced with Non-NL prompt injection, ranging from 0.38 to 0.52, which emphasizes the need for stronger defense measures.

1. Introduction

Large Language Models (LLMs) have become increasingly powerful and achieved remarkable advancements in natural language processing. Due to their capabilities, they are widely utilized in various areas. For instance, Microsoft utilizes GPT-4 for Bing Search [ 1 ]; OpenAI applies GPT-4 for diferent tasks like text processing, code interpretation, and product recommendations; and LLMs are deployed in interactive contexts with direct engagement like ChatGPT. These broad capabilities of LLMs also raise security concerns that create attack surfaces for malicious purposes. Prompt injection, also known as jailbreak attack, has emerged as the main attack vector to bypass safeguards and elicit harmful content from LLMs. Prompt injection refers to the case when an adversary manipulates the input (prompt) to a language model, forcing it to ignore its guardrails, generate malicious content or misleading the model to accomplish injected tasks. Several studies have examined prompt injection attacks against LLMs, ifnding that these models can be easily misaligned through handcrafted inputs [ 2 ], obfuscation strings, and code injection techniques [ 3 ] that bypass vendor-implemented safeguards.

Text-based prompt injections have become a common topic in both research and malicious purposes, capable of creating jailbreaking prompts that mislead LLMs. Most attacks are crafted using Natural Language (NL) prompt manipulation and semantic techniques to confuse LLMs while maintaining the meaning of prompts. These can be Naive Attack [ 4 ], which concatenates target data with injected instructions, or Cognitive Hacking [ 5 ], which leverages role prompting to create contexts that make LLMs easier to control (e.g., “Do Anything Now,” Developer Mode). While existing countermeasures and detection approaches aim to prevent LLMs from these attacks and detect compromised data, they cannot fully protect models from exploitation [ 6, 7 ]. Recently, with the advancement of SOTA LLMs, security alignment within models has improved, leading to increased development of prompt injection detection models.

However, most existing defense approaches focus on NL prompts, whereas Non-Natural Language (Non-NL) prompts represent another area that can be exploited. As LLMs’ capabilities expand, so does their attack surface has created a new avenue for attackers. Non-NL prompt injection is defined as a text-based prompt injection attack that uses non-textual or structured inputs to influence a language model’s behavior. These attacks focus on unusual text, containing strange characters, encoded text, or gibberish text without meaning.

In this paper, we propose a methodology that addresses the current gaps in evaluating Non-NL prompt injection attacks on LLMs1. This paper conducts a comprehensive evaluation of Non-NL prompts. We have created a dataset of 10 prompts transformed through four Non-NL attack techniques to create 40 jailbreak prompts. These prompts are used to assess the vulnerability of 14 common LLMs. We also introduce potential defense strategies against these attacks, thus providing a comprehensive analysis of Non-NL prompt injection attacks and their corresponding countermeasures. Consequently, our main contributions are as follows: • Design and implement a methodology for assessing the ability of LLMs to handle Non-NL prompt injections • Conduct a comprehensive evaluation with 40 Non-NL prompt injections on 14 LLMs to demonstrate how to assess LLM defense capabilities against such attacks • Propose a set of defense mechanisms against Non-NL prompt injections

2. Related Work

The increasing capabilities of LLMs have led to opportunities for malicious attacks and security violations. Safety training methods for LLMs such as GPT-4 and Claude typically finetune pretrained models using human preferences [8] and AI feedback [9], alongside filtering approaches [ 10]. Researchers have explored LLMs’ susceptibility to adversarial interactions attacks [11]. In this work, we focus on Prompt Injection, which OWASP Top 10 identifies as the highest vulnerability in LLMs [ 12], and examine it from a Non-NL perspective.

2.1. Non-NL Prompt Injection

Non-NL prompt injection attacks involve attacker creating jailbreaking prompts that use non-textual inputs to manipulate LLM behavior. These attacks combine strange characters, encoded text, meaningless strings, or icons to confuse models and force them to generate harmful content. Jones and Zou proposed adversarial attacks using meaningless text generated through gradient-based methods to trigger undesired outputs [13, 14]. Several existing methods use obfuscation schemes to confuse the models. At the character level, these include ROT13 cipher, and base64 encoding. Other approaches attempt to split sensitive words into substrings through payload splitting [ 3 ] or token smuggling [15], or translate content into low-resource languages to confuse the model. In many cases, while the model still follows the injected instruction, its safety measures fail to activate.

2.2. Defenses Against Jailbreak Attacks

There are several methods to counter jailbreak attacks, which fall into three main categories, as discussed below.

Detection-Based Defenses detect potentially harmful content. In [16], the Input Perplexity metric is calculated to identify compromised input. Another approach uses the LLM itself for unsafe detection. While these techniques efectively detect and prevent jailbreak attacks, they struggle when handling benign Non-NL elements within prompts.

Mitigation-Based Defenses aim to prevent LLMs from generating undesired content by mitigating harmful input. Retokenization [17] and Paraphrasing [18] prevent harmful input bypass by identifying prompts with similar meanings and reducing special characters’ impact. Sandwich prevention [19] or 1The code is available here: https://github.com/cyb3rlab/LLMSafeguardEval-NonNL Instructional prevention [20] append or redesign instruction prompts to provide additional context, helping prevent prompt obfuscation. However, these approaches only address specific, narrow cases of jailbreaking prompts.

Built-in Safety Mechanisms are methods integrated inside LLMs by vendors such as Nemo-Guardrails [21] control LLMs through predefined rules. However, these defenses primarily rely on rules and filters. They focus on language semantic techniques and classification-based design, which are limited to natural language prompt injection or constrained by training data. This leads to inefectiveness when handling unsemantic prompt injection (e.g., via visual-based text).

Despite a growing number of Non-NL jailbreak attacks, there are no specific defense mechanisms focused on handling these attacks, which are more challenging than NL prompt injections. While research continues to propose new attack techniques that leverage LLM confusion when faced with unusual text, there remains a significant gap in defense-related research. This highlights the necessity of having more robust defense approaches against Non-NL prompt injection.

3. Overview

Given the current limitations with Non-NL prompt injections in LLMs, we propose a method for testing, and evaluating LLMs’ capabilities when facing prompt injection attacks—particularly those using unusual, non-natural language text. Our method enables diferent prompt injection attacks to combine natural language prompts with various techniques to craft sophisticated jailbreak prompts. This provides valuable insights into the security capabilities of LLMs and defense approaches. An overview of our approach is shown in Figure 1, and its main components are described next.

3.1. Attack Module

The attack module creates Non-NL injected prompts by combining natural language prompts with attack functions. These prompts first test the LLM without defense mechanisms, allowing evaluation of the LLM’s response to Non-NL attacks. The four types of Non-NL attacks, categorized according to the techniques used, are presented below.

Text-Based Obfuscation (base64) attack aims to circumvent LLM guardrails by obscuring instructions through encoding algorithms like ROT13 or Base64, bypassing the safety mechanisms of the models. In the scope of this paper, we use Base64 encoding to craft jailbreak prompts. Visualized-Based Obfuscation (ascii_art) attack creates prompts inherent in a visual perspective. For instance, ASCII characters are used to create harmful words that evade LLMs’ detection systems. Following the proposed method from ArtPrompt [22], we created jailbreak prompts by encoding vulnerability-related words as ASCII art visualizations.

Payload Splitting (payload_split) attack involves instructing the LLM to combine multiple seemingly benign prompts that form harmful instructions when combined. The payload_split attack is implemented to develop injected prompts based on the template from [ 3 ]. Adversarial Sufix (adv_suffix) attack works by finding specific sufixes that, when attached to queries, cause LLMs to produce objectionable content. These sufixes can work with meaningless tokens and use optimization techniques to maximize the probability of afirmative responses instead of refusals. Introduced in [14], this white-box attack produces optimized sufixes that are highly transferable between models—even to black-box systems.

3.2. Evaluation Process

The evaluation stage is used to assess how LLMs respond to Non-NL injected prompts. Specifically, to determine the efectiveness of LLMs’ abilities, we calculate the extent to which injected prompts can bypass their defenses. We classified LLM responses into five categories based on response quality and content safety: Harmful, Unrelated, Unclear, Refusal, and Refusal w/ Reasoning.

If a prompt bypasses LLM security measures, we label the responses into three categories: Harmful, Unclear, or Unrelated. A response receives a Harmful label when it contains harmful information. If the response relates to the prompt without directly generating harmful or consistent information, it is classified as Unclear. All other responses fall under the Unrelated category. Otherwise, if a prompt injection is prevented by the LLM, we use the labels Refusal or Refusal w/ Reasoning, the latter for the case when the LLM provides an explanation for refusing the prompt.

We manually labeled each model output using these criteria. Attack success metrics were used to evaluate LLMs’ defense capabilities against Non-NL attacks. This measures the percentage of injected prompts that successfully bypass the security criteria of LLMs to generate harmful, unrelated, and unclear content. The reliability of the manual labeling procedure can be improved in future work by having researchers evaluate and label independently, then aggregating results through discussion and voting mechanisms.

4. Experimental Evaluation

This paper evaluated the four Non-NL prompt injection attacks described in Section 3.1 against 14 common LLMs to determine their security capabilities against advanced and complex jailbreak prompts. The experiments presented in this paper were conducted from April to June 2025. Currently, our focus is on developing attack modules and testing the ability of LLMs to face these attacks.

4.1. Experiment Setup

LLMs. Our experiment uses 14 LLMs, divided into two groups: commercial and open source. The commercial LLMs include Claude 3.7 Sonnet [23], Gemini 2.0 Flash [24], Gemini 1.5 Flash 8B [25], Gemma 2 9B [26], o4-mini, o3-mini [27], GPT-4.1 [28], ChatGPT-4o [29], GPT-3.5 Turbo [30], and GPT-4 [31]. The open-source LLMs include Llama3-8B-Instruct [32], Llama-2-7b-chat [33], Grok3 [34], and Mistral-7B-Instruct [35]. We selected these models based on our survey of current state-of-the-art LLMs and their popularity. Note that, due to perceived security concerns, we did not include DeepSeek in the tested LLMs.

Testing Prompts. Our evaluation experiments used each of the four attack techniques to craft nonnatural language jailbreak prompts. For this purpose, we selected 10 harmful instructions from the AdvBench dataset [14] as input, and applied the four attack functions to create the 40 injection prompts (see Figure 1). These prompts are delivered to LLMs through API calls for GPT models, and via function calls for Llama and Mistral models. For the other LLMs, including Claude, Gemini, Gemma, and Grok, we used the free access ChatUI to send the prompts. With API access to these models, the interaction process could be fully automated.

4.2. Benchmarking Results

Sonnet, most LLMs were bypassed by these injected prompts. Current LLMs like o3-mini or ChatGPT-4o remain vulnerable to these attacks. Moreover, most models were successfully compromised by the payload split attack, which is one of the more sophisticated attacks in the attack module. t t n n u u o o s s n n o o p p R R 10 8 C 6 e s 4 e 2 0 2 0 10 8 C 6 e s 4 e a l L a l L a l L a l L 10 8 C 6 e According to the results from base64 attacks, Claude 3.7 Sonnet and Gemini 2.0 Flash have base64 decoding capabilities and can prevent base64 jailbreak prompts by issuing harmful content warnings. For GPT models, the security of o3-mini is robust enough to refuse directly. The o4-mini model sometimes produces unrelated but benign responses when it fails to decode strings. Current chat models, including GPT-4.1 and ChatGPT-4o, remain vulnerable to specific jailbreak prompts that can generate harmful content. With open-source LLMs like Llama-3-8B-Instruct, Llama-2-7b-chat and Mistral-7BInstruct that don’t support the decode function, they cannot understand these prompts and generate unrelated responses. However, with Grok3, most attack prompts successfully bypassed the security layer. For ASCII art attacks, most recent LLM models (Claude 3.7 Sonnet, Gemini 2.0 Flash, Gemma 2 9B) can understand these words and refuse to respond, providing explanations for their refusal. Regarding GPT models, the latest o-series models (o4-mini and o3-mini) mostly refuse to answer. However, other GPT models like ChatGPT-4o, GPT-3.5 Turbo, and GPT-4 still generate harmful content with certain prompts. Notably, GPT-4.1, the latest flagship chat model, can be bypassed by all tested jailbreaking prompts, forcing it to generate harmful responses. Among open-source models, the Llama models largely don’t understand these prompts and generate unrelated responses while Grok3 occasionally fails to understand certain prompts. Mistral-7B-Instruct remains susceptible to certain attacks. The complexity of the payload splitting attacks varies from simple string concatenation to recursive payload splitting techniques. The latest Claude 3.7 Sonnet model and o4-mini can recognize and comprehend harmful content in most prompts, enabling them to refuse generating harmful responses. However, other models fail to interpret the prompts and generate harmful, unclear, unrelated responses. As payload splitting techniques grow more sophisticated, they become increasingly likely to bypass LLM security measures. Figure 3 shows an example of this attack and the corresponding response from GPT-4.1. We reused the adversarial tokens trained and optimized in previous research [14] with slightly modification. Although most GPT models have enhanced their defenses and fixed this vulnerability, some injected prompts can bypass protections and force models like o3-mini and GPT-4 to generate harmful responses. Notably, GPT-3.5 Turbo remains vulnerable to most of the injected prompts. For open-source models, only Grok3 successfully prevents this attack by refusing all test prompts. Llama models and Mistral-7B-Instruct remain vulnerable to several injected prompts. This can be attributed to the adversarial strings being trained and optimized using Llama2, which likely transferred to Llama3. These results raise concerns because although these adversarial strings have been public, current LLMs remain vulnerable to attacks. Note that, due to the release of Claude Sonnet 4 and Gemma2’s discontinuation in Google AI Studio, we could not test this attack on either the Claude Sonnet 3.7 or Gemma2 9B models via the free access ChatUI.

5. Proposed Defenses

We will discuss now some potential defense approaches that can defend against these attack, including two stages: Non-NL preprocessing, and prompt injection detection. The Non-NL preprocessing stage converts and sanitizes injected prompts into natural language while extracting any code snippets. The prompt injection detection stage analyzes the preprocessed prompts to identify harmful content. Together, these stages enable the defense module to detect vulnerabilities in the input prompts. The implementation of defense measures and their evaluation is considered as future work. Non-NL Prompt Injection Preprocessing handle injected prompts using base64, ascii_art, payload_split, and adv_suff attacks. These functions focus on sanitizing and preprocessing unusual characters and text within prompts, converting them into natural language prompts the Prompt Injection Detection module can easily process. These are deterministic defense techniques corresponding to each attacks. One can handle base64 attacks by extracting and decoding the injected base64 segment. For ascii_art attack, since vulnerable words are in visual format, OCR (Optical Character Recognition) can convert them into text-based words. For payload_split attacks, which create unstructured prompts, a sandbox solution using external LLMs to retrieve the actual prompt becomes a potential approach. To handle adv_suff attacks, which append gibberish strings to harmful instructions, one can calculate sentence perplexity to identify confusion levels and filtering out strange characters and incoherent text. Table 2 summarizes each defense technique and the associated corresponding attacks.

Prompt Injection Detection refers to NL techniques that can efectively detect harmful prompts. Therefore, existing prompt injection detection models can serve as a solution for detecting injected prompts after the non-natural language prompts have been converted to standard text. There is a key trade-of between AI-based and deterministic defense approaches. As discussed above, we design four targeted defenses, each corresponding to a specific attack based on the properties of each attack. These deterministic methods are designed to solve particular problems and can achieve high accuracy against specific attacks. However, this raises concerns about generalizability. An alternative is the AI-based defense approach, which uses Machine Learning (ML) to identify attack patterns and detect sophisticated attacks with similar properties. While ML-based models may have lower accuracy since they rely on probabilities and factors like datasets and parameters, they ofer better generalizability and Extract the base64 segment and decode it into natural language Use OCR to extract the visual-based harmful content Use an external LLM to ask “What is the actual request?”

Calculate the perplexity of sentences to filter out strange characters and incoherent text can handle new, sophisticated attacks. In contrast, deterministic approaches may struggle with novel attacks but can achieve high performance within their specific domain.

6. Conclusion

In this paper, we presented an approach that includes an attack module to test the security characteristics of LLMs against Non-NL prompt injections. Using this method, we conducted several experiments with current and popular LLMs to evaluate their security capacity.

Our preliminary results show that Non-NL prompt injections can successfully bypass LLM safeguards and force the models to generate harmful content, or confuse them into producing unrelated and unclear responses. Given the average attack success rate ranging from 0.38 to 0.52 across all LLMs, with the highest rate of 0.52 for the payload splitting attack, these findings highlight the dangerous potential of Non-NL prompt injection attacks.

We also discussed potential defense techniques that can handle each type of attack by sanitizing and converting them to natural language prompts, then using a detection model to identify and prevent these attacks. Since this is work in progress, their implementation and evaluation are not included in this paper, but will be conducted as future work. Moreover, a generic defense approach should be considered for further research, along with a comparative analysis between AI-based and deterministic defense approaches.

Declaration on Generative AI

The author(s) have not employed any Generative AI tools. [7] Y. Liu, et al., Formalizing and benchmarking prompt injection attacks and defenses, in: 33rd

USENIX Security Symposium (USENIX Security 24), 2024, pp. 1831–1847. [8] L. Ouyang, et al., Training language models to follow instructions with human feedback, Advances in neural information processing systems 35 (2022) 27730–27744. [9] Y. Bai, et al., Constitutional AI: Harmlessness from AI feedback, arXiv preprint arXiv:2212.08073 (2022). [10] B. Wang, et al., Exploring the limits of domain-adaptive training for detoxifying large-scale language models, Advances in Neural Information Processing Systems 35 (2022) 35811–35824. [11] E. Perez, et al., Red teaming language models with language models, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022. doi:10.18653/v1/2022.emnlp-main.225. [12] OWASP, OWASP top 10 for large language model applications, https://genai.owasp.org/llm-top-10/, 2025. [13] E. Jones, et al., Automatically auditing large language models via discrete optimization, in:

International Conference on Machine Learning, PMLR, 2023, pp. 15307–15329. [14] A. Zou, et al., Universal and transferable adversarial attacks on aligned language models, 2023.

arXiv:2307.15043. [15] Learn Prompting, Obfuscation/token smuggling, https://learnprompting.org/ja/docs/prompt_ hacking/ofensive_measures/obfuscation, 2025. [16] G. Alon, et al., Detecting language model attacks with perplexity, arXiv preprint arXiv:2308.14132 (2023). [17] I. Provilkov, et al., BPE-dropout: Simple and efective subword regularization, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2020, pp. 1882–1892. doi:10.18653/v1/2020.acl-main.170. [18] N. Jain, et al., Baseline defenses for adversarial attacks against aligned language models, arXiv preprint arXiv:2309.00614 (2023). [19] Learn Prompting, Sandwitch defense., https://learnprompting.org/ko/docs/prompt_hacking/ defensive_measures/sandwich_defense, 2024. [20] Learn Prompting, Instruction defense, https://learnprompting.org/ko/docs/prompt_hacking/ defensive_measures/, 2024. [21] T. Rebedea, et al., NeMo guardrails: A toolkit for controllable and safe LLM applications with programmable rails, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Singapore, 2023, pp. 431–445. doi:10.18653/v1/2023.emnlp-demo.40. [22] F. Jiang, et al., ArtPrompt: ASCII art-based jailbreak attacks against aligned LLMs, in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 15157–15173. [23] Anthropic, Claude 3.7 Sonnet and Claude Code, https://www.anthropic.com/news/ claude-3-7-sonnet, 2025. [24] Google DeepMind, Introducing Gemini 2.0: Our new AI model for the agentic era., 2024. [25] Google DeepMind, Gemini 1.5 flash-8b is now production ready, 2024. [26] G. Team, et al., Gemma 2: Improving open language models at a practical size, arXiv preprint arXiv:2408.00118 (2024). [27] OpenAI, Introducing OpenAI o3 and o4-mini, https://openai.com/index/ introducing-o3-and-o4-mini/, 2025. [28] OpenAI, Introducing GPT-4.1 in the API, https://openai.com/index/gpt-4-1/, 2025. [29] OpenAI, GPT-4o system card, 2024. [30] OpenAI, Introducing APIs for GPT-3.5 Turbo and Whisper, 2024. [31] OpenAI, GPT-4 technical report, 2023. [32] AI@Meta, Llama 3 model card, 2024. URL: https://github.com/meta-llama/llama3/blob/main/

MODEL_CARD.md. [33] H. Touvron, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 (2023). [34] xAI, Grok 3 beta — the age of reasoning agents, https://x.ai/news/grok-3, 2024. [35] Mistral AI, Mistral 7b, https://mistral.ai/news/announcing-mistral-7b, 2023.

[1] Microsoft , Bing search, https://www.bing.com/, 2025 .

[2]

Perez , et al., Ignore previous prompt: Attack techniques for language models , 2022 . doi: 10 . 48550/ARXIV.2211.09527.

[3]

Kang , et al., Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks, in: 2024 IEEE Security and Privacy Workshops (SPW) , IEEE, 2024 .

[4]

Simon

Willison , Prompt injection attacks against GPT-3 , https://simonwillison.net/2022/Sep/12/ prompt-injection/, 2022 .

[5]

A. S.

Rao , et al., Tricking LLMs into disobedience: Formalizing, analyzing, and detecting jailbreaks , in: Proceedings of the 2024 Joint International Conference on Computational Linguistics , Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL , Torino , Italia, 2024 , pp. 16802 - 16830 .

[6]

Shen , et al., "Do Anything Now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models , in: Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , CCS '24, Association for Computing Machinery, New York, NY, USA, 2024 , p. 1671 - 1685 . doi: 10 .1145/3658644.3670388.