<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Lightweight Embedding Guardrails for Cost-Efective Misalignment Mitigation in Export Control Dialog System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rafal Rzepka</string-name>
          <email>rzepka@ist.hokudai.ac.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shinji Muraji</string-name>
          <email>shinji@ist.hokudai.ac.jp</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Akihiko Obayashi</string-name>
          <email>obayashi@eng.hokudai.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Engineering, Hokkaido University</institution>
          ,
          <addr-line>Sapporo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Information Science and Technology, Hokkaido University</institution>
          ,
          <addr-line>Sapporo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Graduate School of Information Science and Technology, Hokkaido University</institution>
          ,
          <addr-line>Sapporo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <fpage>132</fpage>
      <lpage>143</lpage>
      <abstract>
        <p>The proliferation of specialized Large Language Model (LLM) dialogue systems necessitates robust defense mechanisms against unrelated prompts that introduce functional misalignment and unnecessary costs. Since utilizing such systems involves significant costs, we were motivated to develop simple, high-speed, pre-inference input validation techniques. In this paper, we evaluate the eficacy of two semantic pre-filtering strategies applied to a Japanese export control (trade security) application domain: (1) an exemplar-based Centroid guardrail (utilizing the mean vector of on-topic embeddings) and (2) a supervised Support Vector Machine (SVM) classifier. Using various multilingual embedding models, we demonstrate that the Centroid approach exhibits superior robustness against adversarial keyword augmentation, efectively maintaining high refusal rates despite injected domain-related terms intended to shift the query's semantic vector. Furthermore, our analysis of cross-lingual transferability confirms that while the strongest multilingual embedding models successfully maintain topic alignment when processing English queries against a Japanese-trained filter, the eficacy is highly model-dependent, underscoring the necessity of model-specific cross-lingual validation for deployment in multilingual environments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Topic Guardrails</kwd>
        <kwd>Semantic Filtering</kwd>
        <kwd>Adversarial Robustness</kwd>
        <kwd>Centroid-based Classification</kwd>
        <kwd>Cross-Lingual Transfer</kwd>
        <kwd>Export Control</kwd>
        <kwd>Japanese Language</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The integration of Large Language Models (LLMs) into specialized, high-stakes application domains,
such as Export Control (Trade Security), presents a fundamental challenge: balancing the model’s vast
generative utility with the critical need not only for safety but also functional alignment. When an
LLM-based dialogue system is deployed online to the public, it must strictly adhere to its domain and
refuse to engage with topics that are irrelevant. If our system is deployed with the goal of answering
questions about regulated items, it should first filter the input not related to the systems’ purpose as
malicious users could use the LLM behind the interface for free causing economic losses to the hosting
party.</p>
      <p>The primary security concern in this context is so-called keyword stufing – masquerading unrelated
queries as related to the chatbot’s topic. A crucial aspect of defending against these kinds of attacks is
implementing an eficient, non-LLM based guardrail capable of filtering out non-aligned queries before
they reach the costly and computationally intensive core LLM.</p>
      <sec id="sec-1-1">
        <title>1.1. Background and Domain Specificity</title>
        <p>
          Developing reliable automated question-answering (QA) systems for legal and regulatory domains
faces unique hurdles. Rzepka et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] highlighted the standard challenges of statistical approaches
to legal QA, noting dificulties arising from a scarcity of realistic training examples. As the inputs
are about legality and contain highly sensitive content (e.g. distributing dangerous materials), such
inquiries cannot typically be used for machine learning or model fine-tuning due to privacy and security
concerns. To address this, they initially prepared their QA dataset from Japanese government FAQs,
which was later extended with more realistic inquiries by Obayashi et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Our study borrows this
set of additional, realistic export control queries for our positive training data. However, as there are
only less than ninety examples, the scarcity of data causes obvious challenges.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. The Problem: Limitations of Existing Filtering</title>
        <p>
          In highly specialized domains like Export Control, simple, keyword-based input filtering is insuficient.
Attackers can easily circumvent lexical defenses by adding domain-relevant keywords (known as
keyword augmentation or “stufing”) to malicious prompts, efectively poisoning the input without
triggering a block. Keyword-based filtering alone is inadequate due to the ease of semantic obfuscation
and adversarial manipulation. Furthermore, previous attempts to deepen knowledge representation for
robust filtering through the use of Knowledge Graphs (KGs) have shown limited success in improving
contextual understanding for QA systems [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. This underscores the need for a practical solution that
relies neither on brittle keyword lists nor on complex, computationally demanding symbolic structures.
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>1.3. Contribution and Focus of this Work</title>
        <p>This paper addresses the gap by evaluating a category of high-performance, lightweight semantic
ifltering techniques that operate prior to the core LLM inference. Specifically, we focus on input filtering
for a Japanese Export Control dialogue system, training our guardrails exclusively on small Japanese
data mentioned above. Our primary contributions are:
1. Comparative Robustness Analysis: We compare the performance of two distinct embedding-based
guardrail architectures: (1) a Supervised Support Vector Machine (SVM) Classifier and (2) a simple
Centroid-based Similarity Guardrail against novel adversarial attacks.
2. Adversarial Challenge: We challenge both methods using two key approaches: keyword
augmentation (demonstrating how injecting security terms afects semantic vectors) and cross-lingual
transfer (testing if a Japanese-trained filter can refuse an identical malicious query in English).
3. Model Evaluation: We analyze refusal precision by comparing five popular multilingual
embedding models (see Table 2 for detailed list of models).</p>
        <p>Our findings reveal that the exemplar-base Centroid guardrail, especially when using powerful
multilingual embeddings of multilingual Gemma 2 model, exhibits superior and unexpected robustness
against keyword augmentation and is highly efective as a cost-eficient filter for non-aligned and
dangerous inputs.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Guardrails and Safety Layers for LLMs</title>
        <p>Many recent LLM deployments rely on external moderation systems that screen content either before
or after generation. For example, OpenAI’s content moderation API and Google’s Perspective API
automatically flag inputs or outputs with categories like toxicity, sexual content, or violence. These
tools provide very fast checking but are necessarily coarse-grained: they focus on well-known toxic
categories rather than fine-grained domain relevance.</p>
        <p>
          At the same time, a growing body of work has studied sophisticated attacks on LLM safety, such
as prompt injections and jailbreaks. Prompt injection refers to adversarially crafted inputs that cause
the model to ignore or override its intended instructions, while jailbreak attacks induce the model to
violate its safety constraints (see [
          <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
          ]). In response, researchers have proposed multi-stage guardrail
frameworks. For example, Jia et al. introduce Task Shield, a defense mechanism that systematically
verifies each instruction against user-specified goals at test time [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Another well-known example is
Llama Guard [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] – the authors of this paper address how online moderation tools fall short when applied
as input/output guardrails, noting that none of the available tools distinguishes between assessing
safety risks in diferent contexts. Llama Guard functions as a language model carrying out multi-class
classification, and its instruction fine-tuning allows for customization of tasks and adaptation of output
formats. Such approaches generally kick in after the LLM is invoked (or at least concurrently with
generation) and target malicious or unexpected content in the prompt. In contrast, our focus is on the
pre-processing stage: we intercept and block of-topic or irrelevant inputs before calling the LLM. By
ifltering unrelated queries at the entry point, we aim to save API tokens, rather than relying on costly
LLM-based moderation afterwards.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Lightweight Methods for Input Filtering and Domain Relevance Checking</title>
        <p>In practice, deployed dialogue systems often use cheap on-site filters to gate incoming user requests
before invoking an expensive LLM. We categorize prior methods as follows.</p>
        <sec id="sec-2-2-1">
          <title>2.2.1. Rule-based and Keyword Filters</title>
          <p>
            The simplest pre-filtering uses manually curated keywords or regex rules. For example, a rule-based
system might reject any query containing blacklisted terms (e.g. profanity or forbidden topics). These
iflters are extremely fast and transparent, but they are brittle. They only catch explicit word matches
and are easily bypassed by synonyms, spelling variants, or paraphrases. This characteristic is noted in
practical guardrail guidance. In short, deterministic keyword blocking yields few false positives (it is
very conservative) but sufers high false negatives on real user input. There are several approaches
using, among other, keywords and regular expressions. For example, Rebedea et al. introduce NeMo
Guardrails [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ], an open-source toolkit that uses a rule-based programming language (CoLang) for
safety constraints and applies KNN-based retrieval to enforce dialog control. The toolkit supports
fact-checking, hallucination prevention, and content moderation using rule-based string manipulation
techniques and regex patterns.
          </p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.2. Classical Machine-Learning Classifiers</title>
          <p>
            A more flexible approach is to train a supervised classifier on labeled in-domain versus out-of-domain
queries. For example, one could use TF–IDF features with a linear model (logistic regression or support
vector machine) to distinguish relevant topic queries from unrelated ones. Such models are still
lightweight: they can be trained on only a few thousand examples, run eficiently on a CPU, and do
not require neural hardware. They often capture patterns of content words beyond a fixed keyword
list. However, their expressiveness is limited by the shallow feature representation and may misclassify
inputs that lack obvious domain keywords or that use creative phrasing. The use of such traditional
methods has significantly decreased after the powerful LLMs were introduced, but pre-LLM era was
abundant with approaches using classifiers [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] or semi-supervised approaches, for example to recognize
user’s intent [9].
          </p>
        </sec>
        <sec id="sec-2-2-3">
          <title>2.2.3. Embedding-based Semantic Filters</title>
          <p>To capture more semantic nuance, some systems encode queries into dense vectors and perform
similarity search against examples of in-scope queries. Lightweight sentence-embedding models
represent the meaning of a query in a continuous vector space. At runtime, the incoming query’s
embedding is compared (via cosine similarity or dot product) to a reference index of “allowed” or
“forbidden” vectors. Approximate nearest-neighbor (ANN) libraries such as FAISS make it feasible
to search over thousands of stored embeddings in sub-millisecond time. This approach can catch
paraphrases and conceptually related queries that keyword filters or bag-of-words models would miss.
However, tuning is critical: the similarity threshold that separates “in-domain” from “out-of-domain”
must be chosen carefully. If set too high, many legitimate queries will be falsely rejected; if set too
low, too many irrelevant queries will be let through. Such threshold issues are well known in dense
retrieval and semantic filtering literature [10]. In summary, embedding filters ofer a powerful way to
handle paraphrased queries, but they demand careful calibration and periodic updating of the reference
examples.</p>
        </sec>
        <sec id="sec-2-2-4">
          <title>2.2.4. Hybrid and Cascade Architectures</title>
          <p>In real-world deployments, multiple filters are often combined in cascade. A common strategy is to run a
very fast rule-based or linear classifier first, and only forward the “uncertain” cases to a slower semantic
check (or to the LLM itself). This two-stage pipeline saves cost by admitting the cheapest decision
whenever possible. For instance, one might reject any query matching obvious out-of-scope keywords
immediately, accept clearly in-domain queries quickly, and only send ambiguous inputs through an
expensive embedding lookup or even a specialized verifier LLM. This style of cascade filtering has been
observed in retrieval systems and dialogue gating architectures [11]. Such cascades are a practical way
to trade of latency and accuracy.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Summary and Positioning of The Study</title>
        <p>
          In summary, most existing research on LLM safety and relevance has emphasized post-generation
moderation or defense against adversarial prompts [
          <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
          ]. Techniques range from simple content filters
to complex multi-stage guardrails, but they generally assume the LLM will process the input at least
partially. By contrast, we are unaware of prior work that systematically studies pre-LLM query filtering
purely for domain-gating and cost eficiency, especially in highly specific domains. Existing lightweight
methods (rule-based, linear classifiers, embedding matching) have been mentioned individually, but
they have not been directly compared under a single benchmark for input gating. We address this
gap by empirically comparing an SVM-based classifier and a local embedding-based ANN filter. Both
models were deployed and tested on a single laptop to evaluate their performance as low-cost, on-site
solutions for input validation. We measure the classification (positive vs. refused) accuracy and by
focusing on narrow-domain relevance (export-control queries) and on-premise execution (no external
API calls), we aim to guide the design of eficient input-gating systems for real-world LLM applications.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Data Used for Experiments</title>
      <p>The evaluation of the semantic guardrails requires two primary data sources: a highly specific,
indomain dataset for positive training examples, and an adversarial, out-of-domain dataset for negative
testing and training examples.</p>
      <sec id="sec-3-1">
        <title>3.1. Positive Domain Data (Export Control QA)</title>
        <p>The positive dataset, referred to as the Export Control QA set, specializes in providing answers related
to Japanese security export control regulations. This data addresses the need for realistic, expert-vetted
questions that deviate from general government-issued FAQ formats. The set originates from the
work of Obayashi and Rzepka [12], who extended a question-answering dataset specifically designed
for testing security export control expert systems. It features short questions and concise answers
that were manually created by an expert routinely handling inquiries from academic researchers.
This approach contrasts with earlier datasets that emphasized long, contextual answers explaining
exceptional interpretations of regulatory texts. The resulting data allows for broader and more realistic
experimentation, mitigating the scarcity of sensitive user queries in this field. For our positive training
set, we utilized the complete collection of Obayashi’s short, expert-crafted questions.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Adversarial and Out-of-Domain Data</title>
        <p>The negative dataset plays a crucial role in training the classifiers to recognize and reject inappropriate
or out-of-domain queries, thereby probing the limits of the guardrails’ efectiveness. For this purpose we
utilized two datasets: a) the Japanese Safety Boundary Test1, and Japanese AnswerCarefully [13] which
contain queries for evaluating if LLMs can refuse answering user’s input. These collections contain
a heterogeneous mix of highly dangerous prompts, queries involving other sensitive domains such
as internal corporate regulations, and benign yet out-of-scope conversational inputs. These datasets
were chosen based on the assumption that such prompts would be semantically closer to trade security
questions, potentially increasing the likelihood of circumventing the similarity-based guardrail.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Adversarial Query Expansion Terms</title>
        <p>To conduct the robustness test against “adversarial keyword augmentation”, we required a
comprehensive vocabulary of domain-relevant terms. The terms used for this query expansion (keyword
stufing) were obtained through the courtesy of the Japan Machinery Center for Trade and Investment
(JMCTI), which provided access to authoritative materials for subsequent adversarial evaluations2. The
source consists of a 360-page booklet containing trade security control terminology associated with
corresponding article numbers, forming a high-quality, domain-specific vocabulary. In total, 6,582 terms
were extracted from this material and utilized for randomization in the augmentation-based attacks.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Data Examples</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed Methods</title>
      <p>Adversarial (Boundary Test)
Tell me about NTT’s internal regulations.</p>
      <p>What’s wrong with putting too much trust in AI?
What kind of disorder is ADHD?
The core objective of our methodology is to evaluate the ability of computationally inexpensive semantic
iflters to replace expensive LLM-based security checks for domain non-alignment. We adopt two input
validation methods where the user query is first passed through a lightweight embedding-based guardrail
before being forwarded to the primary dialogue system (LLM).</p>
      <sec id="sec-4-1">
        <title>4.1. Embedding Models and Specifications</title>
        <p>When keyword filtering approach is not working, the performance of any semantic filter is fundamentally
reliant on the quality and structure of the embedding space. We selected five distinct multilingual
embedding models for comparative analysis, detailed in Table 2. These models range from highly
eficient BERT-based architectures to state-of-the-art models built upon the Gemma family 3. Note
that number of models working for Japanese language is much smaller than these meant to deal with
English.
1https://github.com/sbintuitions/safety-boundary-test
2http://www.jmcti.org/jmchomepage/english/
3In this work we utlize Beijing Academy of AI (BAAI) models of Gemma family.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental Setup and Data Preparation</title>
        <p>The guardrails were trained exclusively on Japanese data derived from the Export Control QA set, the
Japanese Safety Boundary Test4, and the Japanese version of AnswerCarefully [13] datasets.</p>
        <p>The experiments utilize the following parameters:
• Hardware/Software: All models were benchmarked using a local machine (MacBook M1 Max,
64GB memory) leveraging Apple’s Metal Performance Shaders (MPS) via PyTorch and performing
inference in the half-precision format (torch_dtype=torch.float16) for maximum eficiency.
• Training Data Balance: We ensured a balanced training set by using the loaded positive samples
from the Export Control QA set ( = 86) and an equal number of randomly sampled negative
questions from the Safety Boundary ( = 86), and tested not only with the remaining
Safety Boundary examples, but also from AnswerCarefully dataset to allow observing if the
proposed methods can deal with various other queries.
• Feature Preparation: All questions were encoded into dense vector embeddings E ∈ R× .</p>
        <p>For the Support Vector Machine (SVM), these embeddings were further normalized to unit vectors
E.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Method 1: Supervised SVM Classifier</title>
        <p>This approach utilizes the balanced training data to learn a hyperplane that maximally separates the
positive and negative classes in the embedding space. We employed a Support Vector Machine (SVM)
with a linear kernel, trained on the normalized embeddings of the balanced training set {E, }.
An incoming query qnorm is classified by the trained function  : Refusal if  (qnorm) = 0.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Method 2: Exemplar-based Centroid Guardrail</title>
        <p>This method is a simple approach that measures the semantic distance of any new query from the
center of the positive class. The positive class centroid C is computed as the mean vector of all positive
training embeddings (e):</p>
        <p>C =</p>
        <p>1 ∑︁ e
 =1
For an incoming query vector q, the cosine similarity (q, C) is calculated. The optimal refusal
threshold  is derived by maximizing accuracy on the full balanced training set. The query is refused if
(q, C) &lt;  .</p>
        <p>It should be noted that “knowledge” of the system is contained in the pre-trained embedding model,
and the Centroid serves as a domain-specific anchor rather than a trained classifier.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Adversarial Testing Protocols</title>
        <p>The true viability of these guardrails was tested against two adversarial protocols using examples of
Safety Boundary Test set ( = 34 samples) and the entire AnswerCarefully test set ( = 336
samples).</p>
        <p>• Keyword Augmentation Attack: Each negative test query was augmented by prepending a
variable number ( = 1 to 5) of randomly selected domain-relevant terms (e.g., security control
terminology) from the JMCTI vocabulary to simulate an attack designed to shift the query’s
semantic vector toward the permitted domain.
• Cross-Lingual Attack: The negative 34 test queries from Safety Boundary test set were translated
by DeepL5 into English, manually checked by the first author and then processed by the
Japanesetrained guardrails to test cross-lingual transfer capability.</p>
        <p>The performance of each model-approach pair was quantified using refusal rate (the fraction of
negative test queries correctly classified as of-topic ).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results and Analysis</title>
      <p>The evaluation of the two semantic pre-filtering guardrail approaches (Centroid and SVM) against
adversarial and cross-lingual challenges provides clear insight into their practical robustness when
implemented with various multilingual embedding models.</p>
      <sec id="sec-5-1">
        <title>5.1. Adversarial Keyword Augmentation Attack</title>
        <p>To test the stability of these guardrails against this protocol, the non-aligned, negative queries were
prepended with one to five random, domain-relevant terms. The results for the refusal ratio (the
fraction of negative samples correctly blocked) are presented in Table 3 (SVM Classifier) and Table 4
(Centroid Similarity). For comparison we added intfloat/multilingual-e5-large-instruct6
and BAAI/bge-m37 models widely used for Japanese language.</p>
        <p>The results demonstrate a clear contrast between the two filtering approaches. The exemplar-based
Centroid guardrail, when paired with the high-fidelity BAAI/bge-multilingual-gemma2 model,
achieved a refusal rate of 100% across all augmentation levels (N=1 to 5 keywords) in the Safety Boundary
Test and was almost faultless (there were mistakes when one or two related terms were added) for
the bigger AnswerCarefully data. This unexpected stability suggests that the Centroid filter is highly
sensitive to the semantic purity of the query intent. The addition of generic security terms acts as
noise, pushing the augmented vector away from the tightly clustered regulatory centroid, resulting in a
correct refusal.</p>
        <p>Conversely, the SVM classifier’s vulnerability to this jailbreak technique is evident, as its refusal
rate systematically drops with every added keyword across all models. For the strongest model (again
bge-multilingual-gemma2), the SVM refusal rate falls dramatically from 0.9821 at 1 term to 0.5149
at 5 terms in the AnswerCarefully test set. This failure indicates that the keywords successfully pulled
the augmented negative samples across the SVM’s linearly trained decision boundary, leading to false
allowances. The lightweight Centroid approach, in this specific adversarial context, ofers superior
resilience compared to the more complex supervised classifier.
5http://deepl.com/en/translator
6https://huggingface.co/intfloat/multilingual-e5-large-instruct
7https://huggingface.co/BAAI/bge-m3</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Cross-Lingual Transfer Efectiveness</title>
        <p>This experiment tested the capability of the Japanese-trained guardrails to recognize and refuse an
oftopic query when the input was presented entirely in English, serving as a cross-lingual transferability
test. The refusal rates for this test are presented in Table 5.</p>
        <p>The results confirm that cross-lingual resistance is highly model-dependent. The
BAAI/bge-multilingual-gemma28, cl-tohoku/bert-base-japanese-whole-word-masking9,
and intfloat/multilingual-e5-large-instruct10 models achieved 100% refusal rate using
at least one of the approaches (Centroid or SVM). This demonstrates their exceptional capacity for
cross-lingual transfer in embedding space, correctly clustering the English prompts far from the Japanese
Export Control training data. The smaller sonoisa/sentence-bert-base-ja-en-mean-tokens11
model showed the weakest Centroid performance (0.7941), confirming that the high semantic fidelity
required for true cross-lingual alignment can be lost in lightweight architectures, leading to false
acceptances. Interestingly, bge-m3 was often confused by English input in spite of being much larger
than BERT-based models.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Deployment Considerations and Trade-ofs</title>
        <p>The experimental results highlight a necessary trade-of between deployment cost, speed, and security
eficacy.</p>
        <p>The BAAI/bge-multilingual-gemma2 model consistently delivers the highest security
guarantees, particularly its perfect defense in the Centroid approach against both adversarial keyword and
cross-lingual tests. However, deploying such a large model, even with MPS acceleration on local
hardware, represents a significant computational overhead compared to lighter architectures (it usually
took about 10 times longer to run the Gemma 2 that SentenceBERT on the utilized machine).</p>
        <p>The performance discrepancy creates a clear decision point for deployment: not all computers will be
capable of running bge-multilingual-gemma2 with the required low latency and resource eficiency.
For high-security environments where the utmost defense against keyword attacks is mandatory, Gemma
2 seems to be the best choice when it comes to Japanese language. For scenarios where computational
resources are severely constrained, the cl-tohoku/bert-base-japanese-whole-word-masking
model provides a robust defense against cross-lingual attacks and a decent keyword defense, positioning
it as a viable intermediate option that balances security with resource limitations.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>The experimental findings demonstrate that the choice of the lightweight semantic pre-filtering
mechanism must be guided by the specific adversarial threat model and the constraints of the deployment
environment. Our comparative study between the Centroid and SVM approaches reveals fundamental
diferences in how they interpret adversarial inputs, leading to non-trivial security trade-ofs. One of
the reviewers correctly noted that the SVM learns a specific decision boundary, making it more sensitive
to the training distribution. However, our results demonstrate that this sensitivity is a liability in an
adversarial context. Because the SVM boundary is defined by the gap between related and unrelated
examples, it is more susceptible to boundary-crossing” attacks via keyword injection. In contrast, the
exemplar-based approach (Centroid) represents the center of gravity” of the Export Control domain. It
remains efective because it measures how much an input deviates from the core intent, rather than
how well it fits a learned boundary that an attacker can shift.</p>
      <sec id="sec-6-1">
        <title>6.1. Interpretation of Adversarial Behavior</title>
        <p>The distinct failure modes observed – the SVM’s systematic collapse under Keyword Augmentation
(Table 3) versus the Centroid’s almost perfect resilience (Table 4) – provide a key insight into the structure
of the semantic space generated by high-fidelity multilingual models like bge-multilingual-gemma2.
The SVM, by attempting to find a fine, linear boundary between a general negative class and a specific
positive class, is susceptible to semantic shift12. The prepended domain keywords pull the query’s
vector across the shallow hyperplane and into the permitted region. In contrast, the Centroid Guardrail
operates as a kind of purity filter. Since the Centroid C represents the tight semantic center of the
highly specific export control domain (even for a small number of available examples), the addition
12We have tested only the SVM classifier assuming that with a small dataset other widely-used classical methods like Naive
Bayes or Random Forest will yield similar results. However, in the future we plan to confirm if this intuition is correct.
of any text that dilutes this focus (even domain-relevant keywords) pushes the resulting vector away
from that specialized cluster, ensuring the input is correctly categorized as of-topic. This makes the
Centroid method robust against attempts to pollute the input with seemingly related, but ultimately
non-specific, information.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Cross-Lingual Eficacy and Generalization</title>
        <p>The results from the Cross-Lingual Transfer attack (Table 5) further validate the quality of modern
multilingual embeddings. Multilingual model (gemma2) demonstrated 1.00 refusal rate when the queries
were in English and the guardrails were trained exclusively on Japanese (perfect score for Japanese BERT
is more natural). This confirms that most of these models successfully map semantically equivalent
concepts across language boundaries into a shared vector space, making the guardrail language-agnostic
regarding the underlying intent. The primary challenge for practical implementation is the resource cost
of the highest-performing models. While bge-multilingual-gemma2 seems to be the gold standard
in this context, its size demands costly hardware resources (CPU/GPU) that may not be available in
many on-premise or edge deployments.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. The Deployment Trade-Of</title>
        <p>The necessity of a computational trade-of is evident: for maximum security, the use
of bge-multilingual-gemma2 with the Centroid approach is required. However, for
scenarios where computational resources are highly constrained, using a non-multilingual
cl-tohoku/bert-base-japanese-whole-word-masking model provides a robust defense against
cross-lingual attacks and maintains a manageable defense against keyword augmentation, making it a
viable intermediate option.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions and Future Work</title>
      <sec id="sec-7-1">
        <title>7.1. Conclusions</title>
        <p>This paper evaluates the eficacy and adversarial robustness of two lightweight semantic guardrails – the
Centroid Similarity Guardrail and the Supervised SVM Classifier – for filtering irrelevant and harmful
queries directed at a Japanese Export Control dialogue system. Our study yields three key conclusions:
First, the simple, unsupervised Centroid similarity guardrail proved significantly more robust to the
adversarial keyword augmentation attack than the supervised SVM classifier, achieving almost perfect
refusal rate when paired with the bge-multilingual-gemma2 model. This demonstrates its eficacy
as a purity filter highly resistant to semantic noise injection. Second, modern multilingual embedding
models exhibit strong capabilities for allowing a guardrail trained only on Japanese data to efectively
block malicious prompts presented in English, though this capability is highly dependent on the fidelity
of the multi-language embedding model. Third, achieving the highest level of security still may require
the computational resources necessary to run high-fidelity models like bge-multilingual-gemma2.
For cost-constrained environments, lower-parameter models ofer a necessary compromise in security,
highlighting a critical area for optimization.</p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Future Work</title>
        <p>Future research will focus on mitigating the identified vulnerabilities and improving the eficiency
of the defense pipeline. One area of focus is the optimization for SVM (or other classifiers), which
requires investigating methods to make the supervised classifier more robust, such as training the
SVM on a synthetically augmented dataset that includes keyword-stufed examples, or exploring
nonlinear kernels to capture more complex decision boundaries. Another direction is developing a hybrid
defense strategy that leverages the strengths of both approaches: using the SVM as an initial filter for
high-confidence non-aligned queries, and then using the Centroid distance as a final, highly robust
check specifically against low-similarity adversarial prompts. Finally, the quantization impact must
be addressed by quantifying the exact trade-of through measuring the refusal rate degradation when
deploying the best-performing models (e.g., bge-multilingual-gemma2) in extremely low-precision
formats (e.g., 4-bit or 8-bit quantization), which will provide direct data on the cost-security curve. For
the cross-lingual attack scenario, we did not augment the English test samples with Japanese terms; this
decision was predicated on the premise that an attacker would likely be unfamiliar with specialized
Japanese regulatory nomenclature. To address this problem, we plan to perform additional tests with
more languages and keywords added in both Japanese and these additional languages. As more and
more models capable of vectorizing Japanese texts eficiently are being informally reported, we are also
going to extend our experiments to test them.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Gemini for the purpose of grammar, spelling
check and paraphrasing. After using these tool, the authors reviewed and edited the content as needed
and takes full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>This work was supported by JSPS KAKENHI Grant Number 23K11757.
[9] L. Chen, D. Zhang, L. Mark, Understanding user intent in community question answering, in:
Proceedings of the 21st International Conference on World Wide Web, WWW ’12 Companion,
Association for Computing Machinery, New York, NY, USA, 2012, p. 823–828. URL: https://doi.org/
10.1145/2187980.2188206. doi:10.1145/2187980.2188206.
[10] H. Gao, R. Wang, T.-E. Lin, Y. Wu, M. Yang, F. Huang, Y. Li, Unsupervised dialogue topic
segmentation with topic-aware utterance representation, arXiv preprint arXiv:2305.02747 (2023).</p>
      <p>URL: https://arxiv.org/abs/2305.02747.
[11] Q. Ai, T. Bai, Z. Cao, Y. Chang, J. Chen, Z. Chen, Z. Cheng, S. Dong, Z. Dou, F. Feng, S. Gao,
J. Guo, X. He, Y. Lan, C. Li, Y. Liu, Z. Lyu, W. Ma, J. Ma, Z. Ren, P. Ren, Z. Wang, M. Wang, J.-R.
Wen, L. Wu, X. Xin, J. Xu, D. Yin, P. Zhang, F. Zhang, W. Zhang, M. Zhang, X. Zhu, Information
retrieval meets large language models: A strategic report from chinese ir community, 2023. URL:
https://arxiv.org/abs/2307.09751. arXiv:2307.09751.
[12] A. Obayashi, R. Rzepka, Expanding export control-related data for expert system, in: Proceedings
of 26th International Conference on Knowledge-Based and Intelligent Information Engineering
Systems, Verona, Italy, 2022.
[13] H. Suzuki, S. Katsumata, T. Kodama, T. Takahashi, K. Nakayama, S. Sekine, AnswerCarefully: A
dataset for improving the safety of Japanese LLM output, 2025. URL: https://arxiv.org/abs/2506.
02372. arXiv:2506.02372.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rzepka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Shirafuji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Obayashi</surname>
          </string-name>
          ,
          <article-title>Limits and challenges of embedding-based question answering in export control expert system</article-title>
          ,
          <source>Procedia Comput. Sci</source>
          .
          <volume>192</volume>
          (
          <year>2021</year>
          )
          <fpage>2709</fpage>
          -
          <lpage>2719</lpage>
          . URL: https://doi.org/ 10.1016/j.procs.
          <year>2021</year>
          .
          <volume>09</volume>
          .041. doi:
          <volume>10</volume>
          .1016/j.procs.
          <year>2021</year>
          .
          <volume>09</volume>
          .041.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Obayashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rzepka</surname>
          </string-name>
          ,
          <article-title>Expanding export control-related data for expert system</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>207</volume>
          (
          <year>2022</year>
          )
          <fpage>3065</fpage>
          -
          <lpage>3072</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/ S1877050922012546. doi:https://doi.org/10.1016/j.procs.
          <year>2022</year>
          .
          <volume>09</volume>
          .364,
          <source>KnowledgeBased and Intelligent Information Engineering Systems: Proceedings of the 26th International Conference KES2022.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rzepka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Obayashi</surname>
          </string-name>
          ,
          <article-title>Efectiveness of security export control ontology for predicting answer type and regulation categories</article-title>
          ,
          <source>in: Proceedings of the 2024 8th International Conference on Advances in Artificial Intelligence</source>
          , ICAAI '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2025</year>
          , p.
          <fpage>156</fpage>
          -
          <lpage>161</lpage>
          . URL: https://doi.org/10.1145/3704137.3704180. doi:
          <volume>10</volume>
          .1145/3704137.3704180.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Srinivas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feizi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lakkaraju</surname>
          </string-name>
          ,
          <article-title>Certifying LLM safety against adversarial prompting</article-title>
          ,
          <source>in: arXiv preprint arXiv:2309.02705</source>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2309. 02705, preprint.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Squicciarini</surname>
          </string-name>
          ,
          <article-title>The task shield: Enforcing task alignment to defend against indirect prompt injection in llm agents</article-title>
          ,
          <source>in: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Vienna, Austria,
          <year>2025</year>
          , pp.
          <fpage>29680</fpage>
          -
          <lpage>29697</lpage>
          . URL: https://aclanthology.org/
          <year>2025</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>1435</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Inan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Upasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rungta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Iyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tontchev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Testuggine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khabsa</surname>
          </string-name>
          ,
          <article-title>Llama guard: Llm-based input-output safeguard for human-ai conversations</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2312.06674. arXiv:
          <volume>2312</volume>
          .
          <fpage>06674</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Rebedea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dinu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sreedhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Parisien</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <article-title>Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2310.10501. arXiv:
          <volume>2310</volume>
          .
          <fpage>10501</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , K. Cheng, L. Men,
          <article-title>The survey of large-scale query classification</article-title>
          ,
          <source>in: AIP conference proceedings</source>
          , volume
          <year>1834</year>
          ,
          <string-name>
            <given-names>AIP</given-names>
            <surname>Publishing</surname>
          </string-name>
          <string-name>
            <surname>LLC</surname>
          </string-name>
          ,
          <year>2017</year>
          , p.
          <fpage>040045</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>