<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Y. Akaichi);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>ming for Explainable Artificial Intelligence</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yasmine Akaichi</string-name>
          <email>yasmine.akaichi@unamur.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jean-Marie Jacquet</string-name>
          <email>jean-marie.jacquet@unmaur.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Isabelle Linden</string-name>
          <email>isabelle.linden@unmaur.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wim Vanhoof</string-name>
          <email>wim.vanhoof@unmaur.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Computer Science, University of Namur</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1857</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Artificial Intelligence (AI) is transforming healthcare through predictive modeling and clinical decision support. However, its adoption remains limited due to two persistent challenges: the lack of model explainability and the sensitivity of patient data, which restricts data sharing across institutions. Federated Learning (FL) ofers a privacy-preserving solution by enabling collaborative model training across decentralized health data sources without exposing raw data. Yet, most FL implementations rely on black-box models, such as deep neural networks, which limits clinical transparency. While interpretable models like Decision Trees (DT) exist, they often lack the expressiveness required to capture complex medical relationships. To bridge this gap, we propose FILP: a novel framework that integrates Explainable AI (XAI) via Inductive Logic Programming (ILP) into a Federated Learning setting. FILP is applied to a real-world clinical task-predicting post-intubation complications. It builds on the symbolic and declarative nature of logic programming, to derive logical rules from local data. Each client independently learns interpretable theories using background knowledge and examples, then participates in a majority-vote consensus mechanism to produce a global theory that is both privacy-preserving and humanunderstandable. We instantiate FILP using two ILP systems: Andante , a Progol-based system, and Popper, which is based on Answer Set Programming (ASP). FILP is evaluated against decision tree baselines in both centralized and federated settings. While decision trees achieve strong predictive results on small test sets, ILP methods ofer a distinct advantage in interpretability and explainability by producing symbolic, domain-aligned rules. Our results indicate that, in this clinical setting, federated training yields slightly improved performance over centralized learning, while preserving privacy. This suggests that FL can ofer not only confidentiality guarantees but also potential advantages in predictive efectiveness.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Artificial Intelligence (AI) is increasingly integrated into healthcare workflows, where it supports tasks
such as diagnosis, prognosis, and treatment planning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Predictive models trained on electronic
health records (EHRs) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], medical imaging [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and physiological signals [4] have shown promise in
identifying risk patterns and improving outcomes.
      </p>
      <p>Yet, despite these advances, real-world clinical deployment remains limited. Two core challenges
persist: the lack of model explainability and the constraints imposed by patient data privacy [5, 6].
Modern healthcare institutions operate in a distributed setting, where data is siloed across hospitals.
This makes centralized training impractical and creates demand for AI systems that are both
privacypreserving and clinically interpretable.</p>
      <p>Federated Learning (FL) addresses data sharing challenges by enabling decentralized model training
without exposing raw data [6, 7]. FL has shown promise in healthcare applications, including medical
imaging and EHR analysis [8, 9], but most implementations rely on black-box models such as deep
neural networks [10], limiting transparency and clinical validation. In response, interpretable models
like decision trees have been proposed as white-box alternatives [11]. While decision trees provide
structural interpretability by mapping predictions to feature-based splits, they fall short in modeling rich
Italy</p>
      <p>LGOBE</p>
      <p>CEUR</p>
      <p>ceur-ws.org
relational dependencies and fail to incorporate structured domain knowledge in a reusable, declarative
form.</p>
      <p>This introduces a methodological gap: FL systems are often optimized for accuracy, but lack support
for semantically grounded, expert-verifiable reasoning [ 12, 13]. In high-stakes domains like healthcare,
this trade-of is critical.</p>
      <p>Symbolic AI, particularly Inductive Logic Programming (ILP), ofers a complementary solution. ILP
induces logic programs from labeled examples and background knowledge, producing hypotheses
in the form of interpretable rules [14, 15]. Unlike decision trees, ILP supports structured, relational
learning and natively integrates prior knowledge into the learning process. ILP has demonstrated
utility in biomedical domains such as toxicology [16], pharmacology [17], and genomics [18]. However,
existing ILP systems typically assume centralized access to data and are not designed to handle the
decentralization, heterogeneity, and coordination challenges of federated environments.</p>
      <p>In this paper, we propose FILP, a novel framework that integrates ILP into Federated Learning to
enable symbolic, interpretable modeling from decentralized medical data. Each client learns logical rules
locally, guided by background knowledge and declarative biases, and shares only symbolic predictions
on its data. A majority-vote aggregation mechanism is then used to derive a global consensus decision.
FILP is privacy-preserving by design and enables human-understandable model inspection, a key factor
in clinical adoption. Our main contributions are as follows:
• We introduce FILP, the first framework to bring ILP into the federated setting, addressing privacy
and regulatory constraints (e.g., GDPR, AI Act) while preserving interpretability.
• We implement FILP with two ILP engines, Andante (Progol-based) and Popper (ASP-based),
demonstrating generality across distinct symbolic reasoning paradigms.
• We evaluate FILP on a real-world clinical dataset, showing that it induces clinically plausible
rules even under strict privacy and limited data availability.
• We extend the evaluation to three established ILP benchmarks (Mutagenesis, Carcinogenesis,</p>
      <p>Alzheimer), highlighting FILP’s ability to generalize beyond a single domain.</p>
      <p>This paper is structured as follows. Section 2 introduces the background on Federated Learning and
ILP. Section 3 presents the FILP framework and its components. Section 4 reports experimental results
on both clinical and benchmark datasets. Finally, Section 5 concludes with a summary of contributions
and future research directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>In this section, we review the key concepts and formal principles that support our work. We begin with
an overview of FL, followed by a discussion of explainable FL in healthcare. We then introduce ILP, a
symbolic machine learning paradigm for interpretable model learning.</p>
      <sec id="sec-2-1">
        <title>2.1. Federated Learning</title>
        <p>Federated Learning is a decentralized machine learning paradigm introduced by McMahan et al. [10, 7],
designed to enable collaborative model training across multiple clients without sharing raw data, thereby
preserving privacy. Each client trains a model locally on its private dataset and sends updates to a
central server, which aggregates them to produce a global model that benefits from the collective
knowledge of all participants.</p>
        <p>FL Components. A typical FL setup includes three important components: parties or clients (e.g.,
mobile devices or institutions), the manager or the coordinator (e.g., central server), and the
communicationcomputation framework responsible for training the machine learning models [19].
Averaging (FedAvg):
FL Training. Federated training proceeds in synchronous communication rounds. At the beginning
of round  , the server  samples a subset of clients   ⊆ { 1, … ,   } and broadcasts the current global
parameters   . Each selected client   trains locally on its private data for a fixed number of epochs

and obtains updated parameters  +1 . The server then aggregates the client models using Federated
  = ∑   ,
∈ 
 +1 = ∑
∈ 
  
 
+1 ,
where   is the number of local training examples on client   . This weighted average ensures that
clients with larger datasets contribute proportionally to the global update. The procedure repeats for
multiple rounds until convergence.</p>
        <p>FL naturally supports partial participation, unbalanced sample sizes, and non-IID data. However,
standard FL assumes diferentiable, numeric model parameters and gradient-based optimization. In
contrast, ILP produces discrete symbolic hypotheses (logic programs), which cannot be averaged as
above. This incompatibility motivates our symbolic FL framework (FILP), where we exchange and
aggregate hypotheses rather than gradient-based parameter updates.
2.2. FL for XAI
FL’s applications in healthcare span a broad range of domains [6], including electronic health records
(EHRs), such as arrhythmia detection from single-lead ECG signals collected via IoT devices [20],
medical imaging [21], COVID-19 diagnosis from multimodal data [22], and smart health systems for
early Alzheimer’s detection [23].</p>
        <p>Recent studies tend to focus on either interpretability or privacy preservation, but rarely address
both simultaneously. Many approaches rely on post-hoc explainability techniques such as SHAP [24]
or LIME [5, 25], which attempt to rationalize predictions after model training. For instance, Janzing
et al. [26] integrated gradient-based explanations to help users understand whether a specific feature
positively or negatively influences a decision. Similarly, Xu et al. [ 27] applied feature selection for
interpretable load forecasting. Zeleke and Bochicchio [28] integrate explainable AI into FL for
arrhythmia detection by using an attention-based temporal convolutional network (TCN), where attention
weights are extracted post-training to highlight which parts of the ECG signal influenced the model’s
predictions.
structure.</p>
        <p>While methods like SHAP and LIME attempt to explain black-box predictions through feature
attribution, they remain external to the learning process and rely on approximations. This lack of
integration, coupled with susceptibility to adversarial manipulation, raises concerns in high-stakes
domains like healthcare. As Rudin [29] argues, inherently interpretable models should be preferred
over post-hoc explanations, as they ofer faithful and verifiable insights directly grounded in model</p>
        <p>To address these limitations, some recent eforts have introduced symbolic components into FL.
FedNSL [30] and symbolic regression-based methods [31] generate symbolic expressions, but still rely
on neural approximations. PeFLL [32] enhances personalization through client embeddings, but retains
a black-box structure. These approaches ofer partial improvements but fall short of full interpretability
and logical rigor. A more principled solution lies in the use of inherently interpretable models.</p>
        <p>Decision trees, for example, have been incorporated into FL frameworks [33, 34] to promote
transparency. Wu et al. [35] introduced FedForest, a federated random forest algorithm designed to balance
predictive accuracy and interpretability. While decision trees can express simple conjunctive rules over
tabular features, such as “if a patient is diabetic and has a short neck, then complications are likely,” they
remain limited in their ability to capture richer relational structures or integrate background domain
knowledge. Decision trees operate through isolated feature splits and lack the representational flexibility
of logic-based systems, making it dificult to model reusable abstractions, relationships between entities,
or derived clinical concepts in a declarative way.</p>
        <p>Of course, rule-based modeling in healthcare encompasses a broader spectrum of approaches than
those discussed here. In this section, however, we focus on representative families that are both widely
applied and closely aligned with the challenges of federated learning and explainability.</p>
        <p>Thus, another widely explored class of rule-based systems in healthcare is fuzzy logic. These models
are among the most established approaches to building interpretable systems, as they capture knowledge
in the form of human-readable IF–THEN rules with linguistic terms (e.g., low, medium, high). Their
ability to handle uncertainty and imprecision has made them particularly attractive in healthcare and
other sensitive domains where transparency is essential. Works such as [36] adopt fuzzy logic in a
centralized way, where interpretability is achieved directly through fuzzy rules. By contrast, Castellano
et al. [37] focus on enhancing the interpretability of deep learning models by integrating fuzzy logic
with graph neural architectures. Their approach first fuzzifies input features into linguistic terms,
constructs graph representations of the data, and then applies a GNN for classification.</p>
        <p>Other work [38, 39, 40, 41] has adopted fuzzy rule-based systems in a federated learning context. In
particular, Ducange et al. [42] focused on Takagi–Sugeno–Kang Fuzzy Rule-Based Systems (TSK-FRBS),
which naturally provide interpretability through fuzzy IF–THEN rules. They contrasted this approach
with a multi-layer perceptron neural network (MLP-NN), where explainability was obtained only
through post-hoc methods. In their framework, local models trained at diferent clients were aggregated
on the server by merging rule bases and resolving conflicts via a weighted averaging mechanism
that accounts for rule support and confidence. Their experiments on Alzheimer’s disease prediction
highlighted the benefits of interpretable fuzzy rules in FL.</p>
        <p>More recently, Barcena et al. [43] proposed a framework for FL of Fuzzy Regression Trees (FRTs),
aiming to jointly address privacy preservation and model interpretability. Their method builds a single
global tree by aggregating local statistics (e.g., fuzzy means, variances, and membership degrees) instead
of raw data, thereby enabling the construction of an interpretable tree structure across distributed
clients.</p>
        <p>These contributions demonstrate that fuzzy logic has been successfully applied in both centralized
and federated learning, ofering interpretable models through fuzzy rules or tree structures. By contrast,
our work introduces ILP into FL, where interpretability arises directly from first-order logical rules
rather than fuzzy sets. Moreover, instead of relying on statistical or weight-based aggregation, we adopt
a majority voting mechanism across distributed hypotheses, emphasizing consensus-driven model
construction. This positions our contribution as a complementary path toward explainable FL, centered
on symbolic logic expressiveness and federated consensus.
2.3. ILP for XAI
In the XAI, it is useful to discern between interpretability and explainability [44]. Interpretability refers
to the fact that the structure of a model can be directly understood by humans (e.g., reading a set of rules).
Explainability goes one step further: it requires understanding why a specific prediction was made. ILP
naturally supports both dimensions. It induces logic rules that are declarative, human-readable, and
grounded in domain knowledge [45, 46, 14].</p>
        <p>Unlike data-intensive statistical models, ILP can generalize from relatively few examples. Its output is
a set of rules that align with expert reasoning and can be verified logically. This makes ILP particularly
attractive for XAI, since it provides transparency, supports domain knowledge integration, and enables
transfer or lifelong learning [47, 48].</p>
        <p>ILP has been applied in several explainability contexts, for example to extract interpretable rules
from black-box models, explain CNN decisions [49, 50], and uncover biases in real-world tasks. Despite
its strengths, ILP faces challenges with scalability, noise tolerance, and eficient hypothesis search [ 51].
Recent advances combine ILP with statistical relational learning and neural-symbolic methods to
improve robustness while retaining interpretability. ILP’s versatility extends across domains: in legal
reasoning [52], reinforcement learning [53] and fraud detection [54]. In healthcare, ILP has produced
interpretable models for brain tumor classification [ 55], and toxicology [18].</p>
        <p>However, classical ILP systems assume centralized data and lack mechanisms for decentralized
training. Second, they struggle with scalability, as hypothesis search quickly becomes computationally
expensive when the amount of data or background knowledge grows [51]. Finally, symbolic rules cannot
be optimized with gradient descent, which makes ILP dificult to integrate into standard federated
learning architectures. These limitations motivate our proposed framework, FILP, which integrates ILP
into a federated setting to enable decentralized, explainable, and scalable learning.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.4. ILP Systems</title>
        <p>Our proposed system FILP is designed to work with diferent ILP systems. In particular, it can integrate
either Progol, which induces rules by generalizing from examples, or Popper, which searches the
hypothesis space under constraints.</p>
        <p>ILP Problem Setting ILP combines logic programming and machine learning to derive interpretable
rules from structured data. Given background knowledge   , a set of positive examples  +, and a set of
negative examples  −, the task is to induce a hypothesis  (a logic program) such that:
∀ + ∈  + ∶   ∪  ⊧  +</p>
        <p>and ∀ − ∈  − ∶   ∪  ⊭  −.</p>
        <p>In other words, the induced hypothesis should entail all positive examples while excluding all negative
ones.
2.4.1. Progol.</p>
        <p>Progol [56] is based on inverse entailment. For each positive example  ∈  +, it constructs the most
specific clause that explains the example relative to the background knowledge (called the bottom
clause⊥). It then generalizes this clause into a rule that (i) covers the example, (ii) avoids all negative
examples, and (iii) respects the syntactic bias defined by mode declarations and determinations.</p>
        <p>Algorithm 1 summarizes the learning procedure. The algorithm iteratively processes positive
examples, generates a bottom clause, generalizes it into a rule consistent with the negative examples, and
updates the hypothesis accordingly.</p>
        <p>Algorithm 1: Progol [56]</p>
        <sec id="sec-2-2-1">
          <title>Input:   : background knowledge, including mode declarations  +: set of positive examples  −: set of negative examples</title>
          <p>Output:  , a hypothesis consisting of a set of clauses
S :=  +
H := ∅
while  ≠ ∅ do
select example  from 
construct bottom clause ⊥ from   and 
induce clause ℎ generalizing ⊥ and consistent with  −
 ∶=  ∪ {ℎ}
remove all  ∈  such that (  ∪  ) ⊧</p>
          <p>Progol has been widely applied in relational learning tasks such as biomedical reasoning, natural
language processing [57], and molecular property prediction [58]. Its output is a set of symbolic rules
that are both human-readable and logically verifiable.
2.4.2. Popper.</p>
          <p>Popper [46] approaches ILP as a constraint-guided search problem. Instead of starting from
specific examples, Popper incrementally explores the space of possible hypotheses under syntactic
constraints such as clause length. Each candidate program is tested against positive and negative
examples. When a candidate fails, new constraints are learned to prune similar programs from the
search space. This iterative process continues until a valid program is found or the search space is
exhausted. Algorithm 2 summarizes the procedure. The algorithm increases clause length step by step,
generates candidate programs, tests them against examples, and refines the constraints accordingly. !
Algorithm 2: Popper [? ]
def popper(BK , E , declaration_bias, constraints, max_literals):
num_literals ← 1
while num_literals ≤ max_literals do
program ← generate(declaration_bias, constraints, num_literals)
if program == ’space_exhausted’ then
num_literals += 1
continue
outcome ← test(BK , E +, E −, program)
if outcome == (’all_positive’, ’none_negative’) then</p>
          <p>return program
constraints += learn_constraints(program, outcome)
return {}</p>
          <p>Popper is particularly suited for tasks that require compact, minimal programs with explicit control
over the search space. It has been used in program synthesis, strategy learning [59], and relational
reasoning over structured symbolic domains [? ].</p>
          <p>In contrast to Progol’s example-driven strategy, Popper systematically enumerates and prunes
hypotheses using constraints. FILP’s compatibility with both paradigms illustrates its generality and its
capacity to support multiple ILP workflows in federated symbolic learning.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Methodology</title>
      <p>We propose a Federated Inductive Logic Programming (FILP) framework that addresses both data privacy
and explainability in a decentralized learning setting. FILP enables multiple clients to collaboratively
induce symbolic rules without sharing raw data, while maintaining interpretability across both the
training and inference phases. To the best of our knowledge, this is the first FL approach that aggregates
ILP hypotheses without requiring numerical model updates.</p>
      <sec id="sec-3-1">
        <title>3.1. Problem Setting</title>
        <p>We consider a scenario with a set of  clients C = { 1,  2, … ,   }. Each client   holds a private local
dataset D = ⟨ +,  −, B ⟩, where  + and  − are the sets of positive and negative examples respectively,
and B is the background knowledge expressed in logic programming form. Background knowledge
remains local and is not shared across clients. The datasets share a common schema but remain
decentralized. A central server  coordinates the process without accessing any raw data. Figure 1
summarizes the FILP framework, which operates in two phases: (a) symbolic hypothesis induction at
the client side, and (b) decentralized prediction via logic-based aggregation. This setup is formalized
through three main components:
• L — the ILP learning engine (e.g., Popper or Andante) that returns a hypothesis (a set of
clauses/rules).
•  — the aggregation function (e.g., majority vote over hypotheses on the client side).
•  — an FL framework ensuring the collaboration. (e.g., flower).</p>
        <p>Combining ILP with FL introduces unique challenges. Symbolic models, such as those induced
by ILP, are discrete, heterogeneous, and inherently non-diferentiable, meaning they cannot be
optimized through gradient descent or averaged as in traditional neural network-based FL. This
nondiferentiability prevents the use of standard FL aggregation techniques such as federated averaging.
Instead, logic programs require specialized aggregation strategies that respect their syntactic structure
and semantic meaning. To the best of our knowledge, no standard approach exists for aggregating logic
programs in federated settings.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. FILP Overview</title>
        <p>To address these limitations, FILP combines ILP with a distributed, logic-based voting mechanism. As
shown in Algorithm 3, each client independently induces a set of symbolic rules (Hypothesis, also
called Theory in some work) using an ILP learner (e.g., Andante or Popper) over its local examples
and background knowledge. These local hypotheses, expressed as logic programs, are transmitted to a
central server, which aggregates them not by parameter averaging but by redistributing the full set of
candidate hypotheses to all clients. Rather than fusing rules into a single theory, which may introduce
contradictions, we redistribute the ensemble to preserve local semantic consistency and allow flexible
voting.</p>
        <p>During the prediction phase shown in algorithm 4, each client evaluates whether each hypothesis in
the received ensemble entails the query example under its own background knowledge. The predictions
are then combined via a majority vote, producing a federated decision that preserves privacy, supports
symbolic reasoning, and yields interpretable outputs.
3.2.1. Part (a): Federated ILP Training
At the beginning of the learning phase, the server  broadcasts global ILP settings, such as mode
declarations and type constraints, to all clients. Each client   then uses its local dataset D = ⟨ +,  −, B ⟩
to induce a symbolic hypothesis   using the ILP learner, i.e.,   ← L ( +,  −, B ). In our framework,
the ILP engine L can be instantiated with either Andante or Popper, two well-known systems described
in Section 2.4. Each hypothesis   consists of logic rules such as:
  = { complication(X) :- hta(X), diabetes(X)., complication(X) :- age(X,A), A&gt;65. }.</p>
        <p>These hypotheses are sent to the server, which aggregates them into an ensemble H = { 1, … ,   }
and redistributes this set to all clients for inference. Since FILP performs only a single round of federated
training, there is no iterative refinement or communication of gradients, making the process lightweight
and communication-eficient.</p>
        <p>Algorithm 3: The FILP(L ) learning algorithm</p>
        <p>Input: ILP system L (e.g., Andante or Popper)</p>
        <p>Server  and clients { 1, … ,   }
Global ILP settings (mode declarations, type constraints); client B remains local</p>
        <p>Datasets D1, … , D , where D = ⟨ +,  −, B ⟩
Output: An ensemble of logic theories H = { 1, … ,   }
Server executes:
Broadcast ILP settings to each client  
Each Client   executes:</p>
        <p>Use L to learn hypothesis   ← L ( +,  −, B )
Send   to the server
Server executes:</p>
        <p>Collect all   from clients</p>
        <p>Aggregate into hypothesis ensemble H = ( 1, ⋯ ,   )</p>
        <p>Send H to all clients
3.2.2. Part (b): Prediction via Aggregated Hypotheses
Each client performs a local prediction using the received hypothesis ensemble H , its background
knowledge B , and a local test example  . The actual logical inference is delegated to an internal
reasoning module (e.g., a local Prolog engine), which remains fully local and does not require external
communication.</p>
        <p>For each hypothesis   ∈ H , the client queries whether   ∪ B ⊧  , returning   ∈ {true, false}.
This entailment is checked via standard logical inference:</p>
        <p>= PrologQuery(  ∪ B , ).</p>
        <p>This corresponds to checking whether   ∪ B ⊧  . The binary outcome   ∈ {true, false} reflects
whether the example is entailed by that specific hypothesis under the client’s background.</p>
        <p>Both ILP systems supported in FILP (Andante and Popper) rely on a Prolog-based runtime to evaluate
entailment. While their learning strategies difer, their prediction phase shares this symbolic backbone,
ensuring consistency and traceability.</p>
        <sec id="sec-3-2-1">
          <title>Algorithm 4: FILP-Predict(H , B ,  )</title>
          <p>Input: Hypothesis set H = { 1, … ,   }</p>
          <p>Local background knowledge B</p>
          <p>Test example 
Output: Prediction decision for example 
Initialize list of entailment results  ← [ ]
foreach   ∈ H do
// Query if the hypothesis entails the example
  ← PrologQuery(  ∪ B , )</p>
          <p>Append   to  ;
// Aggregate individual entailment decisions
Return ( 1, … ,   ) ;
// true or false
// e.g., majority vote
Local Reasoning and Privacy This reasoning is executed locally on each client, without transmitting
test data or background knowledge. This ensures that privacy is preserved, as only the symbolic
hypotheses are exchanged in the FL process.</p>
          <p>Aggregation via Majority Vote After evaluating all hypotheses, each client applies an aggregation
function (e.g., majority vote) to produce a final prediction:
( 1, … ,   ) = {
true,</p>
          <p>if the majority of hypotheses in H entail ,
false, otherwise.</p>
          <p>Although majority voting is used in this work, other aggregation functions (e.g., weighted voting, rule
selection) are possible and left for future investigation.</p>
          <p>Interpretability at Prediction Time This strategy allows clients to collaboratively reach a decision
based on diverse symbolic perspectives, while retaining traceability. FILP provides interpretability at
two levels:
• Model-level: the learned hypotheses consist of rule-based logic programs that are
humanreadable and can be directly interpreted by clinicians or domain experts.
• Prediction-level: Although each hypothesis   consists of interpretable rules, aggregating their
predictions via majority vote can obscure the specific reasoning behind a classification decision.
To mitigate this, our system supports fine-grained traceability during inference:
– For each test example, clients log which rules from each   entail the target predicate. This
enables local explainability, where domain experts can audit the exact rules that were used.
– We display per-rule coverage and fidelity statistics alongside each example, helping assess
both the predictive and explanatory value of each rule. We also report explainability metrics
such as rule length, number of clauses (or rules), and coverage, as detailed in Section 4.</p>
          <p>By replacing opaque prediction with declarative reasoning, FILP ensures that both the global model
behavior and individual predictions are transparent, explainable, and interpretable.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results</title>
      <p>We empirically evaluate FILP on one real-world clinical dataset and three standard ILP benchmarks.
The evaluation covers both predictive performance and explainability metrics (rule coverage, fidelity,
average rule length, and number of rules), and includes comparisons against a classical machine learning
baseline.</p>
      <sec id="sec-4-1">
        <title>4.1. Datasets and Representations</title>
        <p>We consider four binary classification datasets: Post-Intubation Complications, Mutagenesis,
Carcinogenesis, and Alzheimer Drugs, summarized in Table 1.</p>
        <p>The dataset consists of 118 anonymized patient records, each labeled by the presence or absence of
post-intubation complications. For each patient, structured clinical features are provided, spanning
anatomical factors (e.g., thyromental distance, neck morphology), physiological variables (e.g., age,
obesity), comorbidities (e.g., COPD, diabetes, hypertension), and procedural aspects (e.g., intubation
attempts, operator experience). Due to confidentiality restrictions, the dataset cannot be made publicly
available, but it may be shared upon request under formal institutional data-sharing agreements.
4.1.2. Mutagenesis.</p>
        <p>A relational chemistry dataset [60] where molecules are described as logical structures over atoms,
bonds, and functional groups, with a binary mutagenicity label. The objective is to classify compounds
according to their mutagenic activity. The target predicate is active(Compound), which holds when a
compound induces genetic mutations. The dataset comprises 230 nitroaromatic compounds with rich
relational background knowledge enabling structure–activity reasoning.
4.1.3. Carcinogenesis.</p>
        <p>A relational toxicology dataset [18] where compounds are represented as molecular graphs. The
objective is to predict the carcinogenicity of chemical compounds in rodents. The target predicate
is active(Compound), which holds when a compound exhibits carcinogenic activity. The dataset
comprises 298 compounds described by relational facts encoding molecular substructures, toxicophores,
and physicochemical features.
4.1.4. Alzheimer Drugs.</p>
        <p>A relational pharmacology dataset [61] where compounds are linked to pharmacological properties
through drug–property relations. The objective is to identify molecules with desirable therapeutic
efects for Alzheimer’s treatment, particularly acetylcholinesterase inhibition. The target predicate is
great(Compound,Property), which holds when a compound demonstrates efective pharmacological
activity. The dataset comprises 1,060 molecule–property pairs, described relationally through chemical
groups and structural motifs.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Preprocessing</title>
        <p>For the Medical (Post-Intubation Complications) dataset, preprocessing included cleaning variable
names, handling missing values, and preparing two complementary representations. A tabular view
was derived through categorical encoding, imputation of missing entries, and normalization of numerical
variables to enable decision-tree baselines. In parallel, a logical view was generated by encoding patient
attributes as Prolog predicates consistent with the background knowledge.</p>
        <p>By contrast, the Mutagenesis, Carcinogenesis, and Alzheimer datasets are classical ILP benchmarks
already provided in logical form. For these, preprocessing was limited to minor syntax adaptation
between Popper and Andante, and the derivation of propositional (tabular) versions for baseline
comparison.
4.2.1. Data Representations
Each dataset is prepared in two synchronized views: a Prolog/ILP view, where instances are encoded as
ifrst-order facts with background-knowledge predicates, and a tabular view, where the same information
is propositionalized into a feature matrix for decision trees. The tabular view applies one-hot encoding
to categorical features, imputes missing values with the most frequent value per feature (optionally
standardizing numeric features), removes constant/duplicate columns, and harmonizes the feature
schema across clients. In the ILP representation, the learning task is guided by a declarative bias that
constrains the hypothesis space :
Progol. For Progol (and Andante), the declarative bias is defined by modeh, modeb, and
determination declarations. The modeh declaration specifies the target predicate to be learned (e.g.,
complication(+patient)), while modeb declarations define permissible predicates in rule bodies, and
determination statements control dependency propagation and mode directionality.
Popper. For Popper, the bias is specified through head_pred, which defines the target predicate,
body_pred, which declares the predicates admissible in rule bodies, and direction annotations
(in/out), which constrain the flow of arguments. These syntactic constraints ensure tractability,
prune the search space, and align the induced rules with domain knowledge.</p>
        <p>These declarative biases ensure that hypotheses remain logically tractable and comparable across
clients.
4.2.2. Data Partitioning
Each dataset is first split into a centralized training set (80%) and a held-out global test set (20%), using
stratified sampling on the target label to preserve class balance (details in Appendix A). The centralized
training set is then divided into three disjoint client partitions to simulate a cross-silo federated setting.
The global test set is likewise divided into three disjoint subsets, providing a local test set for each
client. Results are reported both per client on local tests and on the global test, which remains disjoint
from all training data. Evaluating also on the global test allows us to assess performance on a larger,
more diverse sample beyond the small client-specific tests. Centralized baselines train on the entire
centralized training split and are evaluated on the same global test. Unless stated otherwise, background
knowledge (BK) is kept local to each client and not shared.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Experimental Setup</title>
        <p>We simulate a cross-silo federated learning scenario using custom orchestration framework implemented
in Python. The environment consists of three virtual clients, each assigned to a distinct data partition
reflecting a federated setup. The experimental pipeline includes the following components:
• Flower [62]: an open-source federated learning framework used to coordinate the training and
aggregation of decision tree models.
• Andante: a custom-built ILP system based on Progol, featuring an interactive user interface that
facilitates manipulation of background knowledge, examples, and induced rules.
• Popper: a state-of-the-art ILP system based on Answer Set Programming, which leverages a
“learning from failures” strategy to explore the hypothesis space.
• ID3 Decision Tree: a classical decision tree learning algorithm used as a baseline model in both
centralized and federated configurations.</p>
        <p>For ILP-based experiments, each client independently learns a local hypothesis using either Andante
or Popper. The resulting sets of logical rules are then evaluated on the corresponding test examples,
and aggregated predictions are computed using a majority voting scheme over the local hypotheses.
For the federated decision tree baseline, each client trains a local decision tree classifier, and predictions
are likewise aggregated via majority voting.</p>
        <p>All experiments adhere to fixed and consistent train/test splits across methods. ILP systems operate
over structured relational representations encoded in Prolog, while decision trees are trained on
conventional tabular data. Federated training and evaluation are conducted within a fully simulated
local environment using our custom orchestration pipeline in Python.
We first present results on the Post-Intubation Complications dataset, used as a real-world clinical test
case for this work. While limited in size, this dataset provides a realistic clinical setting to illustrate
the behavior of FILP when applied with Popper or Andante. Despite its small scale, this case study
highlights the strength of ILP methods in generalizing from very few examples while maintaining
interpretability.
4.4.1. Quantitative Analysis
Descriptive Accuracy Table 2 reports descriptive accuracy, i.e., how faithfully the induced rules
reproduce the logical training distribution. This metric, also referred to as fidelity, evaluates whether
the hypotheses correctly entail positive training examples and reject negative ones.</p>
        <p>The results show a consistent pattern across methods: false positives are uniformly zero, meaning
neither Popper nor Andante ever misclassifies a negative example as positive. This guarantees that no
false complications are predicted, and precision is therefore always 1.00.</p>
        <p>Diferences instead arise from false negatives, which directly lower recall. Popper achieves higher
recall (0.89–0.94 across clients) and accuracies above 0.87, slightly outperforming its centralized
counterpart. Andante performs moderately in the federated case (recall 0.63–0.78), but collapses when
centralized (accuracy 0.50).</p>
        <p>Finally, Popper consistently outperforms Andante in recall and overall accuracy. This diference is
attributable to their underlying search strategies: Popper (ASP-based) employs a constraint-driven,
optimal search that explores hypotheses more systematically, whereas Andante (Progol-based) relies on
heuristic, greedy refinement. As a result, Popper better captures the positive class while maintaining
compact rules, ofering a more faithful symbolic representation of the data.</p>
        <p>Predictive Accuracy Table 3 reports predictive accuracy for FILP with Popper and Andante, compared
to decision trees. For ID3, the maximum depth was adjusted according to dataset size: for Alzheimer
( &gt; 500 ), depth was set to 9, while for the smaller datasets ( &lt; 500 ), depth varied between 3 and
5. For Popper, the maximum number of clauses and body literals was fixed to 6 in all experiments,
ensuring a controlled hypothesis space across clients.</p>
        <p>Results show that FILP consistently outperforms centralized ILP: Popper and Andante reach 0.60–0.75
accuracy per client, compared to 0.40–0.56 in the centralized setting. Popper exhibits more stable
accuracy across clients, whereas Andante shows larger variability, reflecting their distinct search strategies.
ID3 achieves accuracies ranging from 0.40 to 0.76 across client partitions, but only reaches moderate
performance in the centralized setting (0.60). Their efectiveness is limited by the propositionalization
step, which flattens relational structure into features and prevents trees from exploiting the richer
semantics available to ILP. These results indicate that FILP can match or surpass centralized ILP in
accuracy, while federated aggregation does not degrade model quality.
4.4.2. Qualitative Analysis
Decision trees are often considered interpretable because their feature-split structure yields
humanreadable rules (e.g., IF arret = 0 AND hypotension = 0 THEN complication = 0). However,
these rules remain shallow: they operate on tabular features, optimize impurity reduction, and lack
alignment with domain semantics. In contrast, ILP induces relational rules in first-order logic that
capture clinically meaningful interactions, such as:
complication(X) :- diabetes(X), shortneck(X), intubation_duration(X,T).
complication(X) :- coronary_artery_disease(X), elderly(X).</p>
        <p>Crucially, ILP allows for the incorporation of domain-specific background knowledge directly into
the learning process, which is infeasible for traditional learners. For instance, the following rules encode
clinical facts: if a patient’s LEMON score  is greater than or equal to 5, the intubation is expected to be
dificult; likewise, if the Cormack score  is greater than 3, a dificult laryngoscopy is expected.
hard_intubation(X) :- lemonscore(X,S), S &gt;= 5.</p>
        <p>hard_laryngo(X) :- cormack(X,S), S &gt;= 3.</p>
        <p>Such rules can serve as constraints or inductive biases, enabling the model to generalize from structured
medical concepts rather than from feature statistics alone.</p>
        <p>While both decision trees and ILP produce readable models, ILP ofers deeper plausibility by combining
relational reasoning with explicit domain knowledge. FILP extends this capacity to federated settings,
preserving interpretability and explainability properties essential for adoption in healthcare. An
interesting direction for future evaluation is to involve medical experts to assess whether the induced
rules (from ILP) or paths (from ID3) are more interpretable and clinically useful in practice.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.5. Results on Additional Benchmarks</title>
        <p>To complement the medical case study, we extended our evaluation to three established ILP benchmarks:
Alzheimer, Carcinogenesis, and Mutagenesis. As preliminary experiments indicated that Popper yields
slightly more stable and accurate results than Andante, we report the following benchmark results
using Popper as the ILP engine within FILP.
4.5.1. Explainability Metrics
Descriptive Accuracy. Figure 2 reports descriptive accuracy (fidelity) per client, averaged across
clients, and compared to the centralized baseline.</p>
        <p>Across datasets, client-level fidelity remains consistently high. For example, in Alzheimer (0.67–0.79)
and Carcinogenesis (0.63–0.75), local models capture the training distribution with strong alignment
to domain semantics. Intubation shows near-perfect fidelity (0.87–0.97), indicating that rules almost
completely explain the observed cases. Mutagenesis, by contrast, remains more dificult, with fidelity
around 0.50 across clients, reflecting its higher relational complexity.</p>
        <p>Importantly, the mean fidelity across clients is never lower than the centralized baseline (e.g., 0.73 vs.
0.63 in Alzheimer, 0.70 vs. 0.60 in Carcinogenesis), showing that decentralized learning preserves, and
in some cases enhances, the ability of ILP to faithfully describe training data.
Rule Complexity. Table 4 compares the model complexity of FILP (federated ILP), centralized Popper,
and federated ID3. The comparison is based on two structural metrics: the number of rules and the
average rule length.</p>
        <p>• For ILP systems (FILP/Popper), the number of rules corresponds to the number of induced clauses,
while the average rule length measures the number of body literals per clause. These values are
bounded by the declarative bias parameters (max_clauses, max_body).
• For ID3, the number of rules equals the number of leaves, and the average rule length is the
average depth of root–to–leaf paths. This makes the tree metrics functionally comparable to ILP
parameters: leaves correspond to max_clauses, while path depth parallels max_body.</p>
        <p>Table 4 highlights three findings. First, FILP induces fewer and shorter rules than centralized Popper
(e.g., Carcinogenesis: 12 vs. 27 clauses), showing that federated partitioning regularizes ILP instead of
inflating it. Second, compared to federated ID3, FILP achieves greater compactness: while ID3 often
yields slightly shorter paths, it does so at the cost of generating more rules (e.g., Mutagenesis: 2.67
vs. 7.33).</p>
        <p>Finally, FILP’s relational rules retain semantic alignment with domain knowledge, making them more
interpretable than the propositional splits of decision trees.
Alzheimer
Mutagenesis
Carcinogenesis
Intubation</p>
        <p>Popper (Centralized ILP)
num_rules rule_len
num_rules
rule_len</p>
        <p>num_rules rule_len
4.5.2. Predictive Accuracy
Figure 3 reports predictive accuracy for FILP across four datasets, comparing per-client performance,
the federated mean, and the centralized baseline. FILP achieves accuracies in the federated setting that
are comparable to, and in some cases higher than, centralized training. For instance, on Alzheimer (0.60
vs. 0.54) and Intubation (0.64 vs. 0.60), the federated mean exceeds the centralized baseline, showing
that distributed ILP does not compromise predictive utility.</p>
        <p>Accuracy varies across clients, particularly in Carcinogenesis (0.35–0.62), reflecting heterogeneity
in local partitions. Aggregation by majority voting mitigates this variability, producing federated
accuracies that remain close to the centralized baseline. Larger datasets (Alzheimer, Carcinogenesis)
provide more stable estimates, whereas smaller datasets (Mutagenesis, Intubation) exhibit higher
variance due to limited test sizes.</p>
        <p>Overall, these results suggest that ILP retains predictive robustness under federated constraints, with
federated accuracies closely matching or slightly exceeding centralized baselines. However, variability
across clients, particularly in smaller datasets, indicates that dataset scale and partitioning strongly
afect generalization performance.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work, we introduced FILP, a novel framework that integrates Inductive Logic Programming
into Federated Learning to enable symbolic, interpretable, and privacy-preserving model induction.
FILP allows each client to learn logic-based rules from local data using ILP systems such as Andante
(Progol-based) and Popper (ASP-based), and combines these symbolic models via majority voting to
derive a global consensus theory.</p>
      <p>Our evaluation on a real-world clinical task illustrates the trade-ofs between symbolic and statistical
learners. Decision trees achieve competitive predictive performance, but their efectiveness is
constrained by the propositionalization step and their lack of semantic alignment with domain knowledge.
We chose decision trees as a baseline because they are the most established interpretable, rule-based
machine learning model, providing a natural comparison point for ILP in terms of both predictive
power and explanation style. ILP methods, in contrast, yield relational rules that align with clinical
semantics and domain knowledge, ofering transparency and explanatory depth. Such models are
naturally aligned with expert validation and reasoning, a critical requirement in high-stakes fields like
medicine.</p>
      <p>FILP represents a step toward trustworthy and interpretable federated AI. The central contribution
of this study is to investigate whether ILP can be efectively integrated within a federated setting,
and whether this integration provides tangible benefits in terms of accuracy and interpretability.
Interestingly, our results indicate that in some cases federated ILP performs better than centralized ILP,
which may be explained by symbolic learners generalizing more efectively from smaller, structurally
coherent subsets. In contrast, centralized aggregation may introduce heterogeneity that impairs the
induction of compact logical rules.</p>
      <p>As future work, we aim to (i) broaden the evaluation to further ILP benchmarks beyond the ones
already considered, including comparisons against other rule-based and explainable learners beyond
decision trees, (ii) investigate symbolic aggregation strategies beyond majority voting, (iii) explore
richer integration of clinician-defined background knowledge to further enhance semantic fidelity and
generalization, and to evaluate potential gains in predictive accuracy, and (iv) assess the scalability and
robustness of FILP on larger and more diverse datasets.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The authors thank the University of Namur for its support and the Walloon Region for partial support
through the Ariac project (convention 210235) and the CyberExcellence project (convention 2110186).
Special thanks go to Dr. Khaled Ouertani for sharing the medical dataset.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to perform grammar and
spelling checks. After using this tool, the author(s) reviewed and edited the content as needed and take
full responsibility for the publication’s content.
[4] N. Gahlan, D. Sethia, Federated learning in emotion recognition systems based on physiological
signals for privacy preservation: a review, Multimedia Tools and Applications (2024) 1–69.
[5] M. T. Ribeiro, S. Singh, C. Guestrin, ”why should i trust you?”: Explaining the predictions of
any classifier, in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge
discovery and data mining, 2016, pp. 1135–1144.
[6] N. Rieke, J. Hancox, W. Li, F. Milletari, H. R. Roth, S. Albarqouni, S. Bakas, M. Galtier, B. Landman,
K. H. Maier-Hein, et al., The future of digital health with federated learning, NPJ Digital Medicine
3 (2020) 1–7.
[7] T. Li, A. K. Sahu, A. Talwalkar, V. Smith, Federated learning: Challenges, methods, and future
directions, IEEE Signal Processing Magazine 37 (2020) 50–60.
[8] M. J. Sheller, B. Edwards, G. A. Reina, J. Martin, S. Pati, A. Kotrotsou, M. Milchenko, W. Xu,
D. Marcus, R. R. Colen, et al., Federated learning in medicine: facilitating multi-institutional
collaborations without sharing patient data, in: Scientific Reports, volume 10, Nature Publishing
Group, 2020, pp. 1–12.
[9] J. Xu, B. S. Glicksberg, C. Su, P. Walker, J. Bian, F. Wang, Federated learning for healthcare
informatics, Journal of Healthcare Informatics Research 5 (2021) 1–19.
[10] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, B. A. y. Arcas, Communication-eficient
learning of deep networks from decentralized data, in: Artificial Intelligence and Statistics, PMLR,
2017, pp. 1273–1282.
[11] S. R. Heiyanthuduwage, I. Altas, M. Bewong, M. Z. Islam, O. B. Deho, Decision trees in federated
learning: Current state and future opportunities, IEEE Access (2024).
[12] A. Holzinger, C. Biemann, C. S. Pattichis, D. B. Kell, What do we need to build explainable ai
systems for the medical domain?, arXiv preprint arXiv:1712.09923 (2017).
[13] B. Goodman, S. Flaxman, European union regulations on algorithmic decision-making and a “right
to explanation”, AI Magazine 38 (2017) 50–57.
[14] S. Muggleton, Inductive logic programming, New generation computing 8 (1991) 295–318.
[15] A. Cropper, S. Dumančić, S. H. Muggleton, Turning 30: New ideas in inductive logic programming,
arXiv preprint arXiv:2002.11002 (2020).
[16] R. D. King, A. Srinivasan, Prediction of rodent carcinogenicity bioassays from molecular structure
using inductive logic programming., Environmental Health Perspectives 104 (1996) 1031–1040.
[17] R. D. King, S. Muggleton, R. A. Lewis, M. Sternberg, Drug design by machine learning: the use
of inductive logic programming to model the structure-activity relationships of trimethoprim
analogues binding to dihydrofolate reductase., Proceedings of the national academy of sciences 89
(1992) 11322–11326.
[18] A. Srinivasan, R. D. King, S. H. Muggleton, M. J. Sternberg, Carcinogenesis predictions using ilp
(1997) 273–287.
[19] Q. Li, Z. Wen, Z. Wu, S. Hu, N. Wang, Y. Li, X. Liu, B. He, A survey on federated learning systems:
Vision, hype and reality for data privacy and protection, IEEE Transactions on Knowledge and
Data Engineering 35 (2019) 3347–3366.
[20] B. Yuan, S. Ge, W. Xing, A federated learning framework for healthcare iot devices, arXiv preprint
arXiv:2005.05083 (2020).
[21] M. H. U. Rehman, W. Hugo Lopez Pinaya, P. Nachev, J. T. Teo, S. Ourselin, M. J. Cardoso, Federated
learning for medical imaging radiology, The British Journal of Radiology 96 (2023) 20220890.
[22] A. Qayyum, K. Ahmad, M. A. Ahsan, A. Al-Fuqaha, J. Qadir, Collaborative federated learning
for healthcare: Multi-modal covid-19 diagnosis at the edge, IEEE Open Journal of the Computer
Society 3 (2022) 172–184.
[23] J. Li, Y. Meng, L. Ma, S. Du, H. Zhu, Q. Pei, X. Shen, A federated learning based privacy-preserving
smart healthcare system, IEEE Transactions on Industrial Informatics 18 (2021).
[24] E. Mosca, F. Szigeti, S. Tragianni, D. Gallagher, G. Groh, Shap-based explanation methods: a review
for nlp interpretability, in: Proceedings of the 29th international conference on computational
linguistics, 2022, pp. 4593–4603.
[25] N. K. Alshammari, A. A. Alhusaini, A. Pasha, S. S. Ahamed, T. R. Gadekallu, M. Abdullah-Al-Wadud,
R. A. Ramadan, M. H. Alrashidi, Explainable federated learning for enhanced privacy in autism
prediction using deep learning, Journal of Disability Research 3 (2024) 20240081.
[26] D. Janzing, L. Minorics, P. Blöbaum, Feature relevance quantification in explainable ai: A causal
problem, in: International Conference on artificial intelligence and statistics, PMLR, 2020, pp.
2907–2916.
[27] C. Xu, G. Chen, C. Li, Federated learning for interpretable short-term residential load forecasting
in edge computing network, Neural Computing and Applications 35 (2023) 8561–8574.
[28] S. N. Zeleke, M. Bochicchio, Towards explainable federated learning in healthcare: A study on
heart arrhythmia detection (2024).
[29] C. Rudin, Stop explaining black box machine learning models for high stakes decisions and use
interpretable models instead, Nature machine intelligence 1 (2019) 206–215.
[30] P. Xing, S. Lu, H. Yu, Federated neuro-symbolic learning, arXiv preprint arXiv:2308.15324 (2023).
[31] M. Billa, Symbolic regression for medical scoring systems: a bayesian and multi-objective
approach, in: Sistemi Evoluti per Basi di Dati, 2024. URL: https://api.semanticscholar.org/CorpusID:
271866221.
[32] J. A. Scott, H. Zakerinia, C. Lampert, Pefll: Personalized federated learning by learning to learn,
in: 12th International Conference on Learning Representations, 2024.
[33] Q. Li, W. Zhaomin, Y. Cai, C. M. Yung, T. Fu, B. He, et al., Fedtree: A federated learning system for
trees, Proceedings of Machine Learning and Systems 5 (2023).
[34] S. Truex, N. Baracaldo, A. Anwar, T. Steinke, H. Ludwig, R. Zhang, Y. Zhou, A hybrid approach to
privacy-preserving federated learning, in: Proceedings of the 12th ACM workshop on artificial
intelligence and security, 2019, pp. 1–11.
[35] T. Dong, S. Li, H. Qiu, J. Lu, An interpretable federated learning-based network intrusion detection
framework, arXiv preprint arXiv:2201.03134 (2022).
[36] K. Gupta, P. Kumar, S. Upadhyaya, M. Poriye, S. Aggarwal, Fuzzy logic and machine learning
integration: Enhancing healthcare decision-making, International Journal of Computer Information
Systems and Industrial Management Applications 16 (2024) 20–20.
[37] G. Castellano, R. Scaringi, G. Vessio, G. Zaza, et al., Integrating graph neural networks and
fuzzy logic to enhance deep learning interpretability, in: Proceedings of the Fourth International
Workshop on Multilingual Semantic Web (MSW 2024), Paris, France, 2024, pp. 11–13.
[38] D. Połap, Fuzzy consensus with federated learning method in medical systems, IEEE Access 9
(2021) 150383–150392.
[39] V. Srivastava, V. Lamba, V. S. Mathada, C. Bulla, N. Gupta, P. Veeramanikandan, et al., An
iotbased framework employing fuzzy logic and federated learning for decentralized decision-making,
International Journal of Information Technology (2025) 1–7.
[40] W. Jiang, Y. Zhang, H. Han, X. Liu, J. Gwak, W. Gu, A. Shankar, C. Maple, Fuzzy ensemble-based
federated learning for eeg-based emotion recognition in internet of medical things, Journal of
Industrial Information Integration 44 (2025) 100789.
[41] A. Wilbik, P. Grefen, Towards a federated fuzzy learning system, 2021.
[42] P. Ducange, F. Marcelloni, A. Renda, F. Rufini, Federated learning of xai models in healthcare: a
case study on parkinson’s disease, Cognitive Computation 16 (2024) 3051–3076.
[43] J. L. C. Bárcena, P. Ducange, F. Marcelloni, A. Renda, Increasing trust in ai through privacy
preservation and model explainability: Federated learning of fuzzy regression trees, Information
Fusion 113 (2025) 102598.
[44] Z. C. Lipton, The mythos of model interpretability: In machine learning, the concept of
interpretability is both important and slippery., Queue 16 (2018) 31–57.
[45] S. Muggleton, L. De Raedt, Inductive logic programming: Theory and methods, The Journal of</p>
      <p>Logic Programming 19 (1994) 629–679.
[46] A. Cropper, R. Morel, Learning programs by learning from failures, Machine Learning 110 (2021)
801–856.
[47] A. Cropper, Forgetting to learn logic programs, in: Proceedings of the AAAI Conference on</p>
      <p>Artificial Intelligence, volume 34, 2020, pp. 3676–3683.
[48] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, S. J. Gershman, Building machines that learn and think
like people, Behavioral and brain sciences 40 (2017).
[49] I. Bratko, S. Muggleton, Applications of inductive logic programming, Communications of the</p>
      <p>ACM 38 (1995) 65–70.
[50] P. Sen, B. W. de Carvalho, R. Riegel, A. Gray, Neuro-symbolic inductive logic programming
with logical neural networks, in: Proceedings of the AAAI conference on artificial intelligence,
volume 36, 2022, pp. 8212–8219.
[51] Z. Zhang, L. Yilmaz, B. Liu, A critical review of inductive logic programming techniques for
explainable ai, IEEE transactions on neural networks and learning systems 35 (2023) 10220–10236.
[52] T. Dreossi, et al., Bridging deep learning and logic programming for explainability through ilp,</p>
      <p>ELECTRONIC PROCEEDINGS IN THEORETICAL COMPUTER SCIENCE 416 (2025) 314–323.
[53] Z. Ma, Y. Zhuang, P. Weng, H. H. Zhuo, D. Li, W. Liu, J. Hao, Learning symbolic rules for
interpretable deep reinforcement learning, arXiv preprint arXiv:2103.08228 (2021).
[54] B. Wolfson, E. Acar, Diferentiable inductive logic programming for fraud detection, arXiv preprint
arXiv:2410.21928 (2024).
[55] A. Siromoney, L. Raghuram, A. Siromoney, I. Korah, G. Prasad, Inductive logic programming for
knowledge discovery from mri data, IEEE Engineering in Medicine and Biology Magazine 19
(2000) 72–77.
[56] S. Muggleton, Inverse entailment and progol, New Generation Computing 13 (1995) 245–286.
[57] J. Cussens, Part-of-speech tagging using progol, in: International conference on inductive logic
programming, Springer, 1997, pp. 93–108.
[58] C. Helma, T. Cramer, S. Kramer, L. De Raedt, Data mining and machine learning techniques for
the identification of mutagenicity inducing substructures and structure activity relationships of
noncongeneric compounds, Journal of chemical information and computer sciences 44 (2004)
1402–1411.
[59] A. Cropper, S. H. Muggleton, Learning eficient logical robot strategies involving composable
objects, in: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence
(IJCAI), 2015, pp. 3423–3429.
[60] A. Srinivasan, Mutagenesis: Ilp experiments in a non-determinate biological domain, in: Proc. of
the Fourth International Workshop on Inductive Logic Programming (ILP’94), 1994.
[61] R. D. King, M. J. Sternberg, A. Srinivasan, Relating chemical activity to structure: an examination
of ilp successes, New Generation Computing 13 (1995) 411–433.
[62] D. J. Beutel, T. Topal, A. Mathur, X. Qiu, J. Fernandez-Marques, Y. Gao, L. Sani, H. L. Kwing,
T. Parcollet, P. P. d. Gusmão, N. D. Lane, Flower: A friendly federated learning research framework,
arXiv preprint arXiv:2007.14390 (2020).
Dataset partitions used for federated training and evaluation. Each dataset is first split into:
• Global split: 80% training vs. 20% global test (e.g., complication80/complication20,
muta80/muta20).
• Client folds: the 80% training portion is further divided into three client folds (e.g.,
• Local tests: The global test set is partitioned into three disjoint subsets, yielding one local test
set per client (e.g., testcom1--3, mutagentest1--3, test1--3, testalzheimer1--3).
This organization ensures stratified splits for both federated clients and global evaluation.
(a) Post-Intubation Complications Dataset Partitions
(b) Alzheimer Acetyl Dataset Partitions</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Topol</surname>
          </string-name>
          ,
          <article-title>High-performance medicine: the convergence of human and artificial intelligence</article-title>
          ,
          <source>Nature medicine 25</source>
          (
          <year>2019</year>
          )
          <fpage>44</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajkomar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Oren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hajaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Marcus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          , et al.,
          <article-title>Scalable and accurate deep learning with electronic health records</article-title>
          ,
          <source>NPJ digital medicine 1</source>
          (
          <year>2018</year>
          )
          <fpage>18</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Esteva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kuprel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Swetter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Blau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thrun</surname>
          </string-name>
          ,
          <article-title>Dermatologist-level classification of skin cancer with deep neural networks</article-title>
          ,
          <source>nature</source>
          <volume>542</volume>
          (
          <year>2017</year>
          )
          <fpage>115</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>