<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Scalable, Context-Aware NLP Moderation for Child Safety: A Multi-Agent Ethical and Legal Compliance Framework</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Fillies</string-name>
          <email>jan.fillies@fu-berlin.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Theodoros Mitsikas</string-name>
          <email>mitsikas@central.ntua.gr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ralph Schäfermeier</string-name>
          <email>ralph.schaefermeier@imise.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrian Paschke</string-name>
          <email>adrian.paschke@fu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer FOKUS</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Freie Universität Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Applied Informatics</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Leipzig University</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>National Technical University of Athens</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Protecting children from harmful online content requires systems that are accurate, adaptive, and legally compliant across jurisdictions. This paper presents a hybrid, rule-based, multi-agent moderation architecture designed to detect and mitigate toxic speech in real time while ensuring adherence to diverse legal and ethical standards. The system employs large language models, including Google Gemini, GPT-4o-nano, and GPT-4o, to classify user messages according to a detailed hate speech taxonomy. In addition to use case specifically defined ethical rules the approach dynamically identifies the applicable legal frameworks (e.g., COPPA, GDPR, DSA) based on the participants' country of origin and uses LLM-driven agents to generate relevant legal obligations as executable rules in the Prolog rule language. This is the base for a legal and ethical reasoning agent. Moderation decisions are thus context-sensitive, policy-aligned, and legally grounded. System performance was evaluated on a humanannotated dataset of illegal hate speech, demonstrating its efectiveness in identifying content that violates legal definitions. By integrating unsupervised classification with symbolic rule-based reasoning, the system ofers a scalable, reliable solution for protecting children and others in online communication environments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Symbolic and Sub-Symbolic AI</kwd>
        <kwd>Multi-Agent System</kwd>
        <kwd>Content Moderation</kwd>
        <kwd>Hate Speech Detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Children increasingly inhabit dynamic digital spaces, such as social media, gaming chats, and educational
forums, that require complex, real-time content moderation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Moderation must go beyond detecting
toxic language to address overlapping legal frameworks, cultural sensitivities, and ethical obligations,
all while safeguarding privacy and platform autonomy.
      </p>
      <p>Conventional methods, reliant on static filters or opaque models, lack the transparency, adaptability,
and legal interpretability needed to enforce nuanced laws like COPPA, GDPR, or the Digital Services Act.
Legal definitions, such as that of “illegal hate speech”, vary by jurisdiction, creating a need for systems
that adapt to linguistic and regulatory diversity. Platform owners must be able to select applicable legal
regimes, but translating these into formal, actionable rules remains a significant challenge.</p>
      <p>This paper presents a hybrid AI architecture that integrates large language models (LLMs) with
symbolic reasoning to enable compliant, real-time moderation of children’s online interactions. The
system uses LLMs, including Google Gemini and GPT-4o, paired with a detailed taxonomy of harmful
content to support accurate classification. These classifications are mapped to obligations from relevant
legal frameworks. Through a novel Rule Generation API, LLM agents generate symbolic inputs that
follow the Prolog syntax, powering a legal reasoning agent implemented in Prova (a rule language that
combines Prolog with Java) [? ], that assesses content in context and provides justifiable moderation
outcomes. Evaluation on a human-annotated dataset of illegal hate speech shows the system’s ability
to produce accurate and explainable rule-based decisions.</p>
      <p>The key contributions of this work are:
1. A multi-agent moderation architecture that integrates LLM-based content evaluation with
symbolic legal reasoning;
2. an automated rule generation mechanism that translates legal obligations into Prolog clauses
using LLM agents;
3. an empirical validation of the system’s ability to detect illegal hate speech using real-world
annotated data.</p>
      <sec id="sec-1-1">
        <title>The prototype is available at Discord1 and the code can be found at GitHub2.</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Research</title>
      <p>
        Content moderation has emerged as a critical concern in the era of digital communication, particularly
for safeguarding vulnerable populations such as children. Traditional moderation techniques, often
reliant on keyword filters or simple machine learning classifiers, have been criticized for their lack
of contextual understanding and legal interpretability [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Recent advances in natural language
processing (NLP), particularly transformer-based language models like BERT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and GPT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], have
significantly improved the detection of toxic and hateful content. However, their black-box nature
poses challenges for transparency, auditability, and regulatory compliance—key requirements in child
safety contexts.
      </p>
      <p>
        Eforts to introduce rule-based reasoning into moderation systems have gained momentum as a way
to address these limitations. Hybrid approaches that combine neural models with symbolic reasoning
have shown promise in increasing both interpretability and accuracy [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. In particular, multi-agent
systems have been used to simulate human-like decision-making in complex environments, making
them well-suited for context-sensitive moderation tasks [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        From a legal perspective, compliance with child protection laws such as COPPA in the United States,
GDPR in the EU, and more recently, the Digital Services Act (DSA), imposes stringent obligations on
platform operators. These legal instruments require proactive measures to detect and remove illegal
content while ensuring due process and privacy protection [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Yet, most current moderation systems
do not support customizable legal reasoning, leaving a gap in enforceable compliance mechanisms.
      </p>
      <p>
        Recent research has explored automated rule generation using NLP techniques to encode policy and
legal standards as executable logic [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], but integration with live moderation pipelines remains limited. In
this context, recent work by Fillies et al. integrates legal and ethical reasoning into moderation pipelines
using multi-agent architectures [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Their GDPR-compliant chat application restricts harmful content
when minors are present and generates personalized counter speech. They don’t allow for dynamic
alignment of moderation actions with platform-specific legal frameworks. The proposed approach
extends this line of work by enabling automated legal rule generation, LLM-based classification, and
real-time rule evaluation within a multi-agent architecture.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. System Design</title>
      <p>The system (see Figure 1) enables dynamic, autonomous, rule-based moderation of chat content using
LLM-based agents that detect, generate, and apply formalized legal and ethical rules. It consists of two
core processes: one for clause generation and another for content moderation. Each process leverages
semantic rule technologies and agent-based reasoning to ensure compliance with platform-specific
policies. A top-level agent dynamically determines whether new facts need to be generated or if existing
facts suficiently cover the country of origin of the users sending messages on the chat platform.
1https://discord.gg/9fSZJZSd
2https://github.com/fillies/GuardianAgents/tree/main</p>
      <sec id="sec-3-1">
        <title>3.1. Rule Generation Flow</title>
        <p>The Rule Generation Process encodes the platform owner’s moderation policies into machine-readable
rules using Prova, a rule-based semantic scripting language designed for agent environments [? 12].
Prova was chosen due to its logical inference agility, paired with its event-driven character and its
seamless integration with Java, which enables direct invocation of Java methods and facilitates interaction
with external systems and libraries, which together make it optimal for the moderation use case.</p>
        <p>Whenever a user from a previously unrepresented region writes in the moderated chat platform,
the normative framework—such as applicable legal, ethical, or community guidelines—is selected by
an agent and translated into executable Prova facts. The resulting rules are semantically enriched
and hierarchically organized, allowing for modular reuse and logical inference. A Legal/Ethical agent
validates and applies these formalized rules, exposing them for downstream use and ensuring traceability
and adaptability in dynamic regulatory contexts.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Moderation Flow</title>
        <p>The Moderation Process applies the generated clauses in real time to evaluate user-generated content.
Incoming messages from a chat platform are first processed by an LLM and agent-based content
evaluation module, which categorizes content using a large and granular predefined hate taxonomy.
This classification is then passed to a reasoning engine that applies the Prova rules to determine whether
the selected content adheres to platform policies. The output is a moderation decision, which is both
explainable and enforceable. This architecture supports compliance, transparency, and scalability in
automated content governance.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Prototype</title>
      <sec id="sec-4-1">
        <title>4.1. Agent Setup</title>
        <sec id="sec-4-1-1">
          <title>The prototype is implemented using the graphical AI agent workflow builder Langflow 3. Langflow,</title>
          <p>an open-source framework, was chosen for its ability to facilitate rapid prototyping and transparent
orchestration of agent interactions. The model (in Langflow terms referred to as flow) coordinates the
conditional execution of each agent, the tools they use, and the data transfer between the components
involved, including the submission of potential content moderation messages to the chat system.</p>
          <p>Figure 2 shows an extract of the agent pipeline in Langflow.</p>
          <p>The execution of an instance of the flow is triggered via a webhook notification initiated by the
chat system whenever a new message is posted by a user. The notification message includes the chat
message itself, the chat message’s id, and the user’s location.</p>
          <p>"message_id": "54934",
"location": "california",
"message": "Hi everyone, my name is Bob"</p>
          <p>In order to decide whether Prova clauses for the given location need to be created, a call to Prova’s
API location check endpoint is made, which returns a JSON object indicating either the presence or the
absence of a rule set for the location in question. The flow includes a conditional branch, depending
on the content of Prova’s response. In case the location has not been encountered before and Prova
reports the absence of corresponding facts, the branch leads to a call to the legal agent, which gathers
the relevant legal frameworks for the given location and outputs them in the following format, which
serves as the input format for the next agent.</p>
          <p>{
"name": "Full name of the law or regulation",
"citation": "Specific citation, e.g., §130 StGB or Article 8 GDPR",
"jurisdiction": "National / Regional / International",
"year_enacted": "Year of enactment or most recent version",
"description": "Brief description of what the law regulates",
{
}
},
...</p>
          <p>The Prova agent is sequentially connected to the legal agent and works on the latter’s output. Its
prompt commands it to create Prolog facts in a format accepted by the Prova engine.
"isIllegal(public_safety, dangerous_conspiracy, violence_or_threats, stochastic_terrorism,
˓→ usa ,\"Illegal under 18 U.S.C. §371\").",
"isIllegal(sexual_or_obscene, explicit_minor, _, _,usa, \"Illegal under 18 U.S.C.
˓→ §2252A\").",
"isIllegal(sexual_or_obscene, nonconsensual, _, _, usa, \"Illegal under 18 U.S.C.
˓→ §2241\").",
"isIllegal(sexual_or_obscene, sexual_harassment, _, _, usa, \"Illegal under 18 U.S.C.
˓→ §2244A\").",</p>
          <p>The resulting array of clauses along with their associated location are sent to the Prova rule upload
endpoint.</p>
          <p>The branch then remerges with the main branch and continues with the hate speech classifying
agent, which accepts the incoming message and classifies it according to our taxonomy. Its output
format is as follows.</p>
          <p>"message_id": "msg_10475",
"location": "california",
"content": "You're so mature for your age. Don't tell anyone we talk like this :wink:. I
˓→ saw your new pic—hot. If you don't send more, I'll have to show your messages to your
˓→ parents. And don't bother blocking me, I'll make sure your school finds out what you've
˓→ been doing.",
"tags": {
"sexual_or_obscene": ["sexual_harassment"],
"harassment": ["repeated_abuse", "reputational_threats"],
"coercion_or_grooming": [
"manipulation",
"normalization",
"emotional_blackmail"</p>
          <p>The output, including the classification result and the originating location, is sent to the Prova legal
and ethical rule checker. Depending on the legal rules’ validity in the given location, the output of
the checker includes all violations identified as such, accompanied with explaining references to the
respective laws.</p>
          <p>The result is then sent back to the chat system, where it is used for generating an explanatory content
moderation message.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Legal and Ethical Evaluation with Prova</title>
        <p>
          The Prova-based [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] legal and ethical evaluation agent accepts two input types. The first input type
consists of the Prolog facts that are generated. These are parsed and asserted to the Prova rulebase,
enriching it over time. The implementation does not include any hardcoded legislation, instead expecting
to be generated and asserted. On the other hand, the ethical evaluation is hardcoded at the current
development stage, to be expanded in the future to include custom ethical rules.
        </p>
        <p>It accepts two kinds of facts. The first are simple facts, as in the following example:
isIllegal(sexual_or_obscene,sexual_harassment,_,_,usa,"Illegal under 18 U.S.C. §2244A").
Simple facts capture legislation that is suficiently represented with a single pair of category-sublabel,
as seen in the first two arguments. In the above example, sexual harassment is illegal in the specific
jurisdiction, without requiring any other conditions to be applied, thus the next pair of arguments are
anonymous variables. Finally, the last two arguments are the jurisdiction and some explanatory text for
the legal basis.</p>
        <p>The second kind of accepted facts are facts with “conditional combinations”, as seen below:
isIllegal(hate_speech,extremist_symbols,violence_or_threats,symbolic_threat,california,
˓→ "Illegal under CA Penal Code § 11411").</p>
        <p>For example, in California, symbolic threats or displaying extremist symbols are not necessarily illegal
on their own. However, this is not the case when a symbolic threat is made using an extremist symbol.</p>
        <p>The second input type pertains to the message evaluation. In particular, each message and its
characteristics is evaluated against the asserted facts, by reactive rules.
legalChecker()
:rcvMult(X,P,F,complianceRequest,[Cat,ListSubCat,Region]),
illegalOrCombination(X,Cat,ListSubCat,Region,Law),
spawn(X,$Service,resume,[]).</p>
        <p>
          While the input, after preprocessing, is relayed to Prova using its message passing capabilities, the
result is communicated utilizing Prova’s capability to call external methods (via spawn/4) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ],
constructing this way the JSON response. This is performed through the auxiliary predicate
illegalOrCombination/5 which is implemented as follows:
illegalOrCombination(X,Cat1,ListSubCat1,Region,Law)
:element(SubCat1,ListSubCat1),
isIllegal(Cat1,SubCat1,Cat2,SubCat2,Region,Law),
bound(Cat2), % conditional combinations
spawn(X,$Service,triggerHasComplexRules,[]).
illegalOrCombination(X,Cat1,ListSubCat1,Region,Law)
:element(SubCat1,ListSubCat1),
isIllegal(Cat1,SubCat1,Cat2,SubCat2,Region,Law),
free(Cat2), % simple fact
spawn(X,$Service,updateResponse,[X,"legal_violation",Cat1,SubCat1,Law,Region]).
        </p>
        <p>The above implementation handles simple facts, and also is able to determine if conditional
combinations are needed, depending on if the third argument of isIllegal is a free variable.</p>
        <p>As an optimization, this is performed only if rules that require combinations are applicable to the
user’s jurisdiction, i.e., if the triggerHasComplexRules() Java method has been invoked. This
is because in such case, each pairwise combination of tags should be evaluated by Prova, which is
computationally intensive. The rule responsible for handling conditional combinations considers only
rules without free variables:
legalChecker()
:rcvMult(X,P,F,complianceRequest,[Cat1,ListSubCat1,Cat2,ListSubCat2,Region]),
element(SubCat1,ListSubCat1),
isIllegal(Cat1,SubCat1,Cat2Ground,SubCatGround,Region,Law),
bound(Cat2Ground),
equal(CatGround,Cat2),
...
spawn(X,$Service,updateResponse,[X,"legal_violation",Cat1,SubCat1,Law,Region]),
spawn(X,$Service,updateResponse,[X,"legal_violation",Cat2,SubCat2,Law,Region]),
spawn(X,$Service,resume,[]).</p>
        <p>Regarding ethical evaluation, at the current development stage, it is performed through a single
reactive rule which considers hardcoded ethical facts:
ethicalChecker()
:rcvMult(X,P,F,complianceRequest,[Cat1,ListSubCat1,Region]),
element(SubCat1,ListSubCat1),
isUnethical(Cat1,SubCat1),
spawn(X,$Service,updateResponse,[X,"ethical_violation",Cat1,SubCat1,Cat1,Region]),
spawn(X,$Service,resume,[]).</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. LLM based Clause Generation</title>
        <p>The system uses a two-stage pipeline powered by large language models to generate executable legal
rules that govern moderation decisions. This process is implemented using two Langflow agents.</p>
        <p>The first agent, operating on GPT-4o-nano, is responsible for identifying all relevant legal frameworks
based on a user’s administrative region, such as a country or state. GPT-4o-nano was chosen for the
ifrst stage because its eficiency and low computational cost make it well suited for structured retrieval
at scale. When provided with a region, the agent returns a structured JSON array containing the full
name of each applicable law, its legal citation (such as Article 8 of the GDPR or §130 of the German
Criminal Code), the jurisdictional scope (national, regional, or international), the year of enactment
or last revision, and a brief description of its regulatory scope. This ensures that the legal basis for
moderation is both specific and regionally appropriate.</p>
        <p>The second Langflow agent uses GPT-4o to convert the retrieved legal frameworks into executable
Prolog-style clauses, based on a predefined taxonomy of problematic content. This taxonomy includes
categories such as hate_speech, violence_or_threats, public_safety, and others, each with
associated sublabels (e.g., misgendering, true_threat, school_targeting).</p>
        <p>For each label, and combinations where legally significant, the agent generates facts in the following
format:
"isIllegal(meta_label, label, conditional_meta, conditional_label, jurisdiction,
˓→ "legal_basis")"</p>
        <p>The agent adheres to strict constraints: only labels from the taxonomy are used; rules must be based
on known legal frameworks; and conditional combinations (e.g., incitement + extremist symbols) are
explicitly modeled.</p>
        <p>The output is returned as a raw JSON array of rule strings and is directly ingested by the moderation
engine’s legal reasoning agent. This enables automated, traceable, and jurisdiction-aware enforcement
of moderation policies in real time.</p>
        <sec id="sec-4-3-1">
          <title>The full prompts and agents can be seen on GitHub4.</title>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. LLM based Problematic Speech Detection</title>
        <p>To classify user-generated messages according to potential policy or legal violations, the system employs
a lightweight Gemini 2.0 Flash-Lite model embedded in a Langflow agent. The Gemini 2.0 Flash-Lite
model was selected for this stage because its lightweight architecture provides rapid, cost-eficient
classification while maintaining suficient accuracy for content moderation tasks. This agent is tasked
with identifying harmful content based on a predefined taxonomy (See Appendix A) that covers a wide
spectrum of online abuse, including hate speech, threats, harassment, coercion, and policy violations.
The classification process plays a foundational role in enabling both ethical and legal reasoning, as
it determines whether content matches any regulated or prohibited categories prior to rule-based
evaluation.</p>
        <p>The agent operates by receiving a message object, including a unique identifier and the text
content, and returning a structured JSON output that annotates the content with relevant labels. These
labels are drawn exclusively from a predefined multi-level taxonomy, which includes top-level
categories such as hate_speech, violence_or_threats, sexual_or_obscene, and coercion_or_
grooming, each with a detailed list of subcategories. For example, if a message exhibits
manipulative behavior toward a minor, the agent may apply tags such as emotional_blackmail under
coercion_or_grooming, or reputational_threat under harassment, depending on the
context of the language used.</p>
        <p>The classification agent is designed to operate deterministically, without prompting follow-up
questions or generating open-ended explanations. It returns only the required JSON structure, ensuring
consistency and machine-readability across the moderation pipeline. The result is a precise, multi-label
content annotation that informs downstream agents in the system—most notably, the legal reasoning
engine responsible for determining rule violations under applicable jurisdictional frameworks.</p>
        <p>By tightly coupling the output to the taxonomy and formatting it as structured data, this approach
enables real-time classification that is both scalable and transparent. The use of GPT-4o-mini balances
4https://github.com/fillies/GuardianAgents/
model eficiency with semantic accuracy, making it suitable for live deployment in chat environments
where latency and explainability are critical.</p>
        <sec id="sec-4-4-1">
          <title>The full prompt and agent can be seen on GitHub5.</title>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Prototype Chat Room on Discord</title>
        <p>To demonstrate the practical application of the moderation framework, a live prototype was developed
and integrated into a Discord server. This setup allows for real-time content evaluation and moderation
decisions based on the legal and ethical rule engine described in previous sections.</p>
        <p>The chat room is designed with onboarding and moderation workflows that reflect both platform
policies and regulatory compliance. When a new user joins the server, they are assigned a temporary
“Quarantine” role and prompted to introduce themselves in a specific format, stating their name, country
of origin, and age. This initial message is parsed by the system to associate the user with a region,
which is then used to select the applicable legal framework for content moderation.</p>
        <p>Messages sent in the main chat channel are intercepted and evaluated through a webhook that
forwards the content and associated region to the Langflow pipeline. The pipeline classifies the message
based on the speech taxonomy and assesses legal violations through LLM-based classification and
rule-based reasoning. If violations are detected, the message is programmatically deleted using the
Discord API. In such cases, a summary message is posted in the channel that lists the legal violations and
the specific laws that were breached, providing users with a transparent explanation of the moderation
action. An example of such a moderated message is shown in Figure 1.</p>
        <p>The system also includes administrative tools: when new rules are generated by the rule generation
agent, an automated direct message is sent to the server administrator containing the rule set, message
context, and regional information. This ensures both traceability and manual oversight where needed.</p>
        <p>This integration showcases the operational feasibility of the proposed architecture in a real-time,
user-facing environment. It highlights how agent-based legal reasoning, combined with LLM-powered
classification, can be deployed to enforce region-specific moderation policies in a scalable and explainable
manner.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>Regarding the generated Prolog clauses, the LLM-based generation did not produce invalid syntax, and
also handled well the anonymous variables (which are present in the majority of facts). Similarly, the
message classification JSON output (which is subsequently passed to the Prova) was also unproblematic
with respect to the syntax validity, and also adhered to the predefined taxonomy of problematic content.
5https://github.com/fillies/GuardianAgents/tree/main/langflowAgents</p>
      <p>We evaluated the system’s ability to generate legal rules and assess hate speech in accordance with</p>
      <sec id="sec-5-1">
        <title>European Union legal standards, using a legally grounded dataset of 158 annotated posts6, developed</title>
        <p>
          by Zufall et al. (2022) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. This dataset aligns with the EU Framework Decision 2008/913/JHA and
contains social media messages labeled for legal punishability. Of the 158 messages, 24 were classified
as punishable, while the remaining 134 were considered highly toxic but not illegal under the EU
Framework Decision 2008/913/JHA.
        </p>
        <p>
          To also evaluate performance on non-toxic inputs, the 134 non-punishable but toxic messages were
paired with 134 completely non-toxic messages sourced from [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], which contains real-life, non-toxic
chat messages exchanged between teenagers on Discord.
        </p>
        <p>All messages were presented to the application as if they were real-time communications sent within
the EU. The system’s outputs were recorded and compared to the dataset’s ground-truth labels. Precision,
recall, and F1 score were used as evaluation metrics.</p>
        <sec id="sec-5-1-1">
          <title>5.1. Results</title>
          <p>The evaluation focused on three classes: (1) Toxic but not illegal, (2) Toxic and illegal (punishable), and
(3) Non-toxic and non-illegal.</p>
          <p>As shown in Table 1 and Table 2, the system performed well at identifying non-toxic, non-illegal
content (Class 3), achieving high precision (1.000), recall (0.926), and an F1 score of 0.962. However, it
failed to recognize highly toxic but non-punishable messages (Class 1), scoring 0 across all metrics. This
suggests the system struggles to distinguish socially harmful but legal speech from other categories,
possibly due to the fact that during the training of LLMs, legal but toxic statements are also against the
policy.</p>
          <p>For legally punishable content (Class 2), recall was perfect (1.000), but precision was low (0.143),
leading to many false positives. While the system successfully identified all punishable cases, it
frequently misclassified non-punishable content as illegal.</p>
          <p>Averaged metrics show moderate overall performance, with a macro F1 score of 0.404 and a micro
F1 of 0.507. The high weighted recall (0.902) indicates a conservative bias, prioritizing detection of
potentially illegal content at the expense of precision.</p>
          <p>Class
Class 1
Class 2
Class 3</p>
          <p>Precision
0.000
0.143
1.000</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>Average
Macro
Micro
Weighted
The evaluation highlights both the strengths and limitations of our hybrid moderation system. While
the system performs well at identifying non-toxic content and detecting legally punishable speech with
high recall, it tends to over-classify borderline toxic content as illegal, resulting in a significant number
of false positives. This conservative bias is build with intention in a system designed for child protection,
where caution is preferred. However, it also underscores the dificulty of translating complex, nuanced
legal standards into executable rules without overgeneralization.</p>
      <p>A key insight is that rule generation based on jurisdiction-specific legislation is possible but remains
incomplete. The problem of automatic rule generation is in no way solved, but this system ofers
a first step when human-based rule generation is not possible due to time and cost constraints. It
will be important to further evaluate the rules at the case level to gain a deeper understanding of the
6https://github.com/simulacrum6/op-hate-nlp/tree/main
ability of LLM-based rule generation. Furthermore, certain forms of hate speech, such as subtle or
coded expressions, may evade detection if they are not explicitly covered in the legal framework or not
captured by the current taxonomy. This limitation highlights the need for continuous refinement of
both the hate speech taxonomy and the legal rule generation pipeline.</p>
      <p>Given that the system is designed for environments with minors, privacy and data security are critical
concerns. The handling of sensitive or potentially harmful content must comply with regulations such
as GDPR. Storing or processing user-generated content, especially toxic or illegal material, poses risks
related to inadvertent exposure, misuse, or unauthorized access. In the current prototype, content
handling does not yet incorporate advanced measures for protecting personally identifiable information,
but also does not explicitly save any, future scaling to larger platforms will require rigorous encryption,
access control, and data anonymization strategies.</p>
      <p>The system’s reliance on LLMs for classification and rule synthesis introduces several risks that
are particularly relevant in the context of child safety and legal compliance. LLMs can hallucinate,
misinterpret legal standards, or misalign culturally, leading to both false positives and false negatives.
While the system incorporates conservative thresholds and rule-based safeguards, these do not fully
eliminate risks. Critically, legal misinterpretation could result in either over-censorship, which impacts
free expression, or under-censorship, which compromises child safety. Therefore, human oversight
and formal legal validation mechanisms remain necessary to ensure accountability, reliability, and
trustworthiness at scale.</p>
      <p>Finally, while the Discord-based prototype demonstrates feasibility in a real-life environment, scaling
this approach to larger platforms will require optimization of latency, handling of adversarial inputs.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Future Work</title>
      <p>This paper introduced a scalable, multi-agent moderation framework that integrates LLM-based
classiifcation with symbolic legal and ethical reasoning to ensure context-aware and jurisdiction-specific
content moderation. By combining automated rule generation with real-time detection, the system
addresses key challenges in protecting children from harmful online content while aligning with diverse
legal standards.</p>
      <p>Evaluation results show the system’s strong ability to detect illegal content, though precision remains
a limitation, particularly in distinguishing toxic but lawful speech. This trade-of reflects a deliberate,
risk-averse stance in child protection scenarios, but points to the need for more nuanced moderation
capabilities.</p>
      <p>Future work will address this by refining the hate speech taxonomy, improving rule specificity, and
incorporating more robust validation mechanisms, potentially involving legal experts or structured
legal ontologies. Additionally, we aim to extend the system’s language coverage, handle adversarial or
obfuscated content, and deploy it in more diverse, high-volume communication environments. This
will support broader applicability across global platforms while maintaining transparency and legal
compliance.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The ‘Terminology and Ontology- Based Phenotyping (TOP)’ project is funded by the German Federal
Ministry of Education and Research (grant number: 01ZZ2018).</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT in order to: Grammar and spelling
check, and reword. After using this service, the authors reviewed and edited the content as needed and
takes full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-10">
      <title>A. Toxic Language Schema</title>
      <p>The content moderation schema is displayed in a structured JSON format for categorizing various types
of harmful or policy-violating content. A complete listing of the schema, including tag categories, is
provided below.</p>
      <p>"type": "object",</p>
      <p>]
},
"sexual_or_obscene": {
"type": "array",
"items": {
"type": "string",
"enum": [
"explicit_minor",
"nonconsensual",
"sexual_harassment",
},
"legal_or_policy_violation": {
"type": "array",
"items": {
"type": "string",
"enum": [
"copyright",
"defamation",
"impersonation",
"spam",
"political_campaigning",
"misinformation",
"voter_suppression",
"tos_violation"
},
"context_disruption": {
"type": "array",
"items": {
"type": "string",
"enum": [
"off_topic",
"trolling",
"external_irrelevant",
"external_harmful",
"external_misleading"</p>
      <p>]
},
"coercion_or_grooming": {
"type": "array",
"items": {
"type": "string",
"enum": [
"manipulation",
"normalization",</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Theocharis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kosmidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zilinsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Quint</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pradel</surname>
          </string-name>
          , CONTENT WARNING:
          <article-title>Public Attitudes on Content Moderation and Freedom of Expression</article-title>
          , https://doi.org/10.17605/OSF.IO/F56BH,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gillespie</surname>
          </string-name>
          ,
          <article-title>Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions That Shape Social Media</article-title>
          , Yale University Press,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Chandrasekharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gandhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. W.</given-names>
            <surname>Mustelier</surname>
          </string-name>
          , E. Gilbert,
          <article-title>Crossmod: A Cross-Community Learning-Based System to Assist Reddit Moderators</article-title>
          ,
          <source>Proceedings of the ACM on Human-Computer Interaction</source>
          <volume>3</volume>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-Training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Language Models Are Few-Shot</surname>
            <given-names>Learners</given-names>
          </string-name>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Berant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. D.</given-names>
            <surname>Forbus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lao</surname>
          </string-name>
          ,
          <source>Neural Symbolic Machines: Learning Semantic Parsers on Freebase With Weak Supervision, arXiv preprint arXiv:1611.00020</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Besold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bader</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Domingos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hitzler</surname>
          </string-name>
          , K.-U. Kühnberger,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Lamb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. M. V.</given-names>
            <surname>Lima</surname>
          </string-name>
          , L. de Penning,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pinkas</surname>
          </string-name>
          , et al.,
          <article-title>Neural-Symbolic Learning and Reasoning: A Survey and Interpretation, in: Neuro-Symbolic Artificial Intelligence: The State of the Art</article-title>
          , IOS press,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wooldridge</surname>
          </string-name>
          , An Introduction to Multiagent Systems, John Wiley &amp; Sons,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Novelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Casolari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hacker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Spedicato</surname>
          </string-name>
          , L. Floridi,
          <article-title>Generative ai in eu law: Liability, privacy, intellectual property, and cybersecurity</article-title>
          ,
          <source>Computer Law &amp; Security Review</source>
          <volume>55</volume>
          (
          <year>2024</year>
          )
          <article-title>106066</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/S0267364924001328. doi:https: //doi.org/10.1016/j.clsr.
          <year>2024</year>
          .
          <volume>106066</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chalkidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hartung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bommarito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Androutsopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Katz</surname>
          </string-name>
          , N. Aletras,
          <article-title>LexGLUE: A Benchmark Dataset for Legal Language Understanding in English</article-title>
          ,
          <source>arXiv preprint arXiv:2110.00976</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fillies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mitsikas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schäfermeier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paschke</surname>
          </string-name>
          ,
          <article-title>Agent-Based Hate Speech Moderation Approach</article-title>
          , in: International Workshop on Causality,
          <source>Agents and Large Models</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>107</fpage>
          -
          <lpage>125</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kozlenkov</surname>
          </string-name>
          ,
          <source>Prova Rule Language version 3.0 User's Guide</source>
          ,
          <year>2010</year>
          . https://github.com/prova/ prova/tree/master/doc.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>F.</given-names>
            <surname>Zufall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hamacher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kloppenborg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zesch</surname>
          </string-name>
          ,
          <article-title>A Legal Approach to Hate Speech - Operationalizing the EU's Legal Framework against the Expression of Hatred as an NLP Task</article-title>
          , in: N.
          <string-name>
            <surname>Aletras</surname>
            ,
            <given-names>I. Chalkidis</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Barrett</surname>
          </string-name>
          , C. Goan t,ă, D. Preo t,iuc-Pietro (Eds.),
          <source>Proceedings of the Natural Legal Language Processing Workshop</source>
          <year>2022</year>
          , Association for Computational Linguistics, Abu Dhabi,
          <source>United Arab Emirates (Hybrid)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>53</fpage>
          -
          <lpage>64</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .nllp-
          <volume>1</volume>
          .5/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .nllp-
          <volume>1</volume>
          .5.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fillies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Peikert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paschke</surname>
          </string-name>
          , Hateful Messages:
          <article-title>A Conversational Data Set of Hate Speech Produced by Adolescents on Discord</article-title>
          , in: International Data Science Conference, Springer,
          <year>2023</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>