<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>F. D'Andreamatteo);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabio D'Andreamatteo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea D'Angelo</string-name>
          <email>andrea.dangelo6@graduate.univaq.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Stilo</string-name>
          <email>giovanni.stilo@univaq.it</email>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>KAN, MLP, Machine Learning, Universal Approximation Theorem</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>In recent years, Kolmogorov-Arnold Networks (KANs), based on the homonymous theorem, have been explored as an alternative to the classic Multi-Layer Perceptron (MLPs) architecture for deep neural networks. Despite showing some promising results for specific tasks, KANs are still limited by pathological cases where poor gradient behavior causes the network to fail. Moreover, the tuning and training requirements of KANs can be higher than those of MLPs. In this paper, we introduce a novel architecture called HybridKAN, which retains the overall structure of KANs but replaces the B-spline functions with sub-MLPs that approximate them. Our architecture is grounded in the Universal Approximation Theorem (UAT) and avoids abnormal gradient behavior by relying on sub-MLPs instead of B-spline functions. We test our new architecture under the same experimental setting as the original KAN paper and record significant improvements in both accuracy and training time.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>HybridKAN: Leveraging</title>
    </sec>
    <sec id="sec-2">
      <title>Multi-Sized Sub-MLPs for</title>
    </sec>
    <sec id="sec-3">
      <title>Enhanced Performance</title>
      <sec id="sec-3-1">
        <title>1. Introduction</title>
        <p>CEUR
Workshop</p>
        <p>ISSN1613-0073</p>
        <p>With this change, we shift the underlying approximation theoretical basis back to the Universal
Approximation Theory (UAT), and improve the model’s expressive power and flexibility, allowing it
to generalize better across more complex, nonlinear relationships in the data, avoiding pathological
cases. HybridKAN merges the best of the two worlds: the more efective architecture of KANs with the
reliability of MLPs. Figure 1 shows a comparison between KAN and HybridKAN architectures.
-vAndolmgrK
yHbrid
architecture of a KAN network but substitutes B-splines, located on the connections between nodes of diferent
layers, with Sub-MLPs. The Figure on the left is taken from [ 4].</p>
        <p>Specifically, Figure 1 shows that a HybridKAN of the same shape of the corresponding KAN maintains
the same architecture, but replaces B-spline functions with Sub-MLPs, small MLPs whose aim is to
approximate the B-spline function they replace.</p>
        <p>In order to evaluate our proposed architecture in comparison to KAN, we conduct several experiments
on a dataset of physics equations, the same dataset used in the original KAN paper [4]. Specifically: (i).
We perform an in-depth grid search over the parameters to find the best configurations for both KAN
and HybridKAN, and we report the best-performing architecture for each equation in the dataset. (ii).
We report the mean performance across all configurations to assess the stability of each model under
suboptimal settings. (iii). We analyze the loss surfaces to understand the optimization landscape of
HybridKAN compared to KAN.</p>
        <p>Our main contributions are the following.</p>
        <p>• We propose HybridKAN, a novel architecture that replaces spline-based functions in KANs with
small trainable MLP subnetworks (sub-MLPs).
• We empirically demonstrate that HybridKAN outperforms KANs in accuracy, training time, and
• We analyze the loss surfaces, showing more reliable optimization for HybridKAN compared to
robustness to pathological cases.</p>
        <p>KAN.</p>
        <p>The remainder of the paper is structured as follows. In Section 3, we lay out preliminaries on KANs
and B-spline functions. In Section 4, we illustrate our proposed HybridKAN architecture. In Section 5
we list all the settings for the experiments discussed in the following Section 6. Lastly, in Section 7, we
discuss conclusions.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2. Related Work</title>
        <p>Kolmogorov–Arnold Networks (KANs) were first proposed in [ 4] as an interpretable alternative to
traditional Multi-Layer Perceptrons (MLPs). Inspired by the Kolmogorov–Arnold representation
theorem [7][8], KANs replace standard activation functions with learnable univariate functions and utilize
ifxed sparse linear connections for improved interpretability. Since their introduction, KANs have been
applied across several domains, including time-series analysis [9], image classification [ 10, 11], and
tabular data [4]. In addition to architectural innovations, KANs have begun to attract attention for
their robustness and explainability. For example, recent studies have evaluated KANs under adversarial
conditions and shown that they may ofer improved resistance to certain types of attacks compared to
standard MLPs [11]. Furthermore, the structure of KANs lends itself well to post-hoc explainability
analyses [4].</p>
        <p>Despite these advances, KANs remain a relatively young architecture, and their performance in certain
domains has not yet reached parity with more mature models such as convolutional or transformer-based
networks, as reported in diferent benchmarks [ 12, 13]. Nonetheless, the architecture has undergone
rapid iteration, with several extensions and enhancements being actively explored [14].</p>
        <p>For instance, FastKAN [15] introduces optimizations that significantly reduce training time and
improve scalability, thereby enhancing the feasibility of KANs for larger datasets. However, because
FastKAN approximates the original KAN formulation rather than preserving its exact architecture, it
does not provide a direct baseline for our comparative evaluation and is therefore excluded from the
scope of this study. Similarly, [16] approximates the B-splines with wavelet functions. Convolutional
KANs (CKANs) [17] integrate convolutional priors into the KAN framework to better handle spatial
information. Autoencoder variants of KANs have also been proposed [18], aiming to leverage KANs’
interpretability in unsupervised learning and representation learning settings.</p>
        <p>However, direct comparisons between KANs and MLPs have consistently highlighted the instability
and inconsistency of KANs, largely due to their problematic gradient behavior. In particular,
B-splinebased KANs often sufer from poor convergence and sensitivity to initialization, leading to divergent
loss trajectories or numerical instability during training. As shown in [5], while KANs can reduce
parameter counts, they struggle with higher training times and degraded performance—especially after
symbolic conversion—on complex classification tasks, and often require significantly more
computational resources. Similarly, [6] shows that certain KAN variants exhibited non-smooth loss landscapes
and divergent gradients (e.g., NaNs), especially with high-order polynomials or deeper architectures,
reinforcing the critical role of gradient behavior in their training stability .</p>
        <p>In this paper, we propose a novel MLP-based architecture that replaces the spline functions in the
original KAN design with small MLPs (sub-MLPs). This approach retains the structural benefits of the
KAN framework while mitigating issues related to unstable gradients during training and significantly
improving training time.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3. Preliminaries</title>
        <p>KAN Background. The Kolmogorov-Arnold Representation Theorem, also known as Superposition
Theorem, was introduced in 1957 by Andrey Kolmogorov and later refined by his student Vladimir
Arnold [8, 7]. The theorem states: ”If f is a multivariate continuous function on a bounded domain, then
f can be written as a finite composition of continuous univariate functions and the binary operation of
addition”.</p>
        <p>Considering a continuous function  ∶ [0, 1]  → ℝ, according to the theorem, there exist continuous
univariate functions  , ∶ [0, 1] → ℝ and Φ ∶ ℝ → ℝ such that:
2+1</p>
        <p>=1</p>
        <p>
          =1
 () =  ( 1, ...,   ) = ∑ Φ (∑  , (  ))
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
        </p>
        <p>In other words, the theorem states that any continuous multivariate function can be expressed as a
ifnite sum of compositions of continuous univariate functions. This powerful result lays the theoretical
foundation for architectures such as Kolmogorov–Arnold Networks (KANs), which model multivariate
mappings using learnable univariate transformations.</p>
        <p>The fundamental diference between Multi-Layer Perceptrons (MLPs) and Kolmogorov-Arnold
Networks (KANs) lies in how the networks handle non-linear transformations. In MLPs, nodes have fixed
activation functions to introduce nonlinearity, while edges carry the learnable linear weights. As
opposite, KANs place learnable activation functions on edges, parametrized as one-dimensional splines,
which processes a single input variable.
and the binary operation of sum on nodes, without additional processing other than simply summing
incoming signals. The purpose of introducing nonlinearity is then shifted from nodes to edges, where
in KANs there are no linear weight matrices since each weight is replaced by a learnable function  ,
1, j) is denoted as:  ,, ,  = 0, ...,  − 1,  = 1, ..., 
,  = 1, ...,  +1 .</p>
        <p />
        <p>A KAN Layer with   -dimensional inputs and   -dimensional outputs can be expressed as a matrix
of learnable 1D functions: Φ = [ , ],  = 1, 2, ...,   ;  = 1, 2, ...,   . Generally, the shape of a KAN
network is represented by an integer array [ 0,  1, ...,   ]. Between two consecutive layers, say l and
(l+1), there are (  ⋅  +1 ) activation functions, where the activation function that connects (l, i) and (l +
 ̃,, =  ,,</p>
        <p>( , ), which serves as part of pre-activation value for the next layer  +1,, .</p>
        <p>Each activation function  ,, processes a pre-activation value   and produces a post-activation value
B-splines. Since all learned functions are univariate, it is possible to parameterize them as B-spline
curves, with learnable coeficients of local B-Spline basis functions.</p>
        <p>B-Spline Curves are defined by a set
of Control Points, which are going to represent the learned parameters. Unlike other composite curves,
control points in B-Splines do not afect the entire curve but only provide local control, meaning that
modifying a single control point influences only the portion of the curve that it is associated with.</p>
        <p>In the KAN architecture, the B-Spline functions are constructed using basis functions b(x), which are
similar to residual connections, such that the activation function () is the sum of the basis functions
b(x) and the spline function:
where:
() =   () +   ()
(2)
() = () =</p>
        <p>(1 +  − )
() =
∑     ()

activation function.</p>
        <p>The spline function is expressed as a sum of weighted basis functions   , where   are the scalar
coeficients corresponding to each B-Spline basis function
  that determine its contribution to the
overall spline. These coeficients are learned and adjusted during training to fit the B-Spline to the
target function that needs to be approximated. In addition to the coeficients
  , the parameters  
and   are learned to better control the overall contribution of the basis and spline components to the</p>
      </sec>
      <sec id="sec-3-4">
        <title>4. HybridKAN - Approximating Splines with sub-MLPs</title>
        <p>To address the limitations of KANs mentioned in Section 2, we propose a new architecture, called the
Hybrid Kolmogorov-Arnold Network - HybridKAN, which is a modified version of the Kolmogorov-Arnold
Network that replaces the splines-based learnable activation functions on edges with small trainable
MLP subnetworks. With this new approach, our aim is to maintain the advantages of the structural
composition of KANs, which mainly consists of isolated transformations of the input variables and
pairwise connections to the aggregating nodes, while trying to improve its expressive power, adaptive
learning, and scalability through the use of sub-MLPs rather than spline-based functions.</p>
        <p>HybridKAN diverges from the Kolmogorov-Arnold Representation Theorem (KART), which forces
the decomposition of the underlying multivariate function into a sum of univariate functions, to return
to the classic Universal Approximation Theorem. The aim of shifting the underlying approximation
theory is to entirely avoid the possibility that some multivariate functions may be decomposed into
non-smooth or irregular functions, efectively enhancing the model’s performance and flexibility. While
KANs explicitly follow the theorem by ensuring that each spline-based edge approximates a single
univariate function, HybridKAN replaces this mechanism with small subnetworks, increasing the
expressive power of the network and allowing it to detect more complex patterns and relationships
within the data.</p>
        <p>
          As shown in Figure 2, the HybridKAN architecture consists of multiple stacked HybridKAN Layers.
Each layer contains a set of independent sub-MLPs that transform single input features separately
usbMLP
,1
usbMLP
usbMLP
X 1
2,1
,1
X 1
~ (0)
X
12,
usbMLP
~ (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
X
,2
,2
the same shape; however, every B-spline is replaced by a sub-MLP, in this case of shape [1,3,3,1]. The aim of
these sub-MLPs is to approximate the original B-splines.
before aggregating them on the nodes. The number of subMLPs per layer is determined by the number
of input nodes and output nodes, with each subnetwork assigned to process one specific input node
and contribute to one specific output node. Consider a layer l that has  () input nodes and  ( + 1)
output nodes. In this layer, the input vector   ∈   () is processed to generate an aggregated output
vector  +1 ∈   (+1) .
        </p>
        <p>To achieve this, the layer is composed of  () ×  ( + 1)
sub-MLPs. These subnetworks are denoted
as   
corresponds to the index of the output node. Each sub-MLP applies a transformation to its assigned
input variable, and the outputs of all sub-MLPs corresponding to a given node are summed to produce
,() , where  ∈ {1, 2, ...,  −1 } represents the index of the input node and  ∈ [1, 2, ...,   ]
the aggregated output of that node.</p>
        <p>As an example, in Figure 2, layer 0 has two input nodes ( 0,1 and  0,2) and three output nodes
( 1,1,  1,2,  1,3). As a result, layer 0 is composed of 6 sub-MLPs.</p>
        <p>The most important diference with MLPs occurs mainly in the first layer, where each input feature
undergoes a separate transformation before aggregation, enabling modular and independent feature
representations. The subsequent layers then refine these representations through additional
transformations and summations. Formally, let  = ( 1,  2, ...,   )be the input vector. Each feature   is passed
to a set of sub-MLPs, one for each node in the next layer. The transformed output of the sub-MLP
associated with the  ℎ input feature and to the  ℎ node in the layer is the following:
After independent transformations of the input variables, the layer nodes aggregate these transformed
values by summing them pairwise. The output of the  ℎ node in the layer is computed as:
()
 ̃,
Each node in the subsequent layers receives the transformed signals from the previous layer outputs,
which were already aggregated from multiple input features. This implies that while the first layer
processes raw input features independently, all subsequent layers work on already transformed inputs
to refine the representation learned in the previous layers, making the network able to learn increasingly
complex function approximations. For the final output node 
 () :
,(−1) (</p>
        <p>
          (−1) )
2,1 ( 2(0))
(0)
2,2 ( 2(0))
(0)
2,3 ( 2(0))
(0)
 1(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) =  
 2(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) =  
 3(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) =  

() =

 −1
∑  
=1
        </p>
        <p>
          Consider a 2-layer HybridKAN network with shape [2,3,1], like the one in Figure 2. Then, the first
layer computes:
1,1 ( 1(0)) +  
(0)
1,2 ( 1(0)) +  
(0)
1,3 ( 1(0)) +  
(0)
=  [ 1,1 ( 1(0)) +  1(
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          )] +  [ 2,1 ( 2(0)) +  2(
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          )]
        </p>
        <p>(0) (0)
=  [ 1,2 ( 1(0)) +  1(0,2)] +  [ 2,2 ( 2(0)) +  2(0,2)]</p>
        <p>(0) (0)
=  [ 1,3 ( 1(0)) +  1(0,3)] +  [ 2,3 ( 2(0)) +  2(0,3)]
(0) (0)
(5)
(6)
(7)
(8)
(9)</p>
      </sec>
      <sec id="sec-3-5">
        <title>5. Experimental Settings</title>
        <p>To assess the performance of the HybridKAN model, we tested both HybridKAN and KAN and compared
their performance on a regression task to determine whether the modifications introduced in the
The final output is then computed on the previous representations as:
 1(2)
within the data, avoiding all pathological cases of non-smooth or irregular univariate functions inherited
by KART. HybridKANs are no longer constrained by the structural limitations of summing univariate
functions, allowing the model to dynamically approximate more complex relationships and adapt
to a broader range of function approximations. In particular, the use of sub-MLPs in the first layer
can provide an adaptive feature transformation, meaning that each input feature can be processed
independently in a more flexible way before being aggregated. Only in the subsequent layers aggregated
features can be refined to better approximate the function.</p>
        <p>Additionally, HybridKANs introduce a certain degree of regularization through the summation
mechanism at nodes. Since each node aggregates transformed signals from multiple sub-MLPs, it
can reduce redundant information and the risk of over-reliance on individual features, leading to an
improved generalization, and making the model more robust to noise and more suitable for real-world
applications where the input often has dependencies between features that are dificult to catch explicitly.</p>
        <p>Computational Complexity. In terms of computational complexity, both models have an
asymptotic complexity of (
2), where</p>
        <p>denotes the width of the layers. The models share a similar
complexity: HybridKAN’s complexity is expressed as (
2 )
, where  denotes the number of layers
and  represents the number of internal parameters of sub-MLP, while KAN’s complexity is expressed
as (
2
)</p>
        <p>, where  denotes the number of control points in splines. Typically,  is smaller than
 . However, the increased computational cost required for updating the more complex spline-based
functions can deny its parameter advantage. In contrast, even if HybridKAN involves a larger set of
parameters to update, these parameters are simple linear weights. As a result, HybridKAN’s overall
computational cost often remains comparable, or even lower, to the computational cost of KAN.
HybridKAN model can actually lead to enhanced performance compared to the standard KAN model in
terms of accuracy and/or training time. Both architectures were trained and tested using the same data
samples and a fixed number of epochs to ensure reproducibility of the results.</p>
        <p>Dataset. The task is designed to replicate the experiment performed by the authors of KAN [4] who
evaluated their model on a set of 27 artificially generated datasets from diferent physics equations,
selected from a larger collection [19]. They were chosen to represent a variety of functional forms and
complexities and used to synthetically generate datasets divided into 10,000 training samples and 10,000
testing samples. In our study, for a fair comparison, we select the same subset of equations that was
selected by the original authors of KAN. In the following, we will reference the equations by their ID
in the original Feynman dataset, which implies some IDs could be missing. See Appendix A for more
details.</p>
        <p>Goal. Since the datasets are inherently free of noise, where every data point is taken from a
distribution which follows a well-defined underlying multivariate formula, our primary goal was to
check the models’ ability to approximate these underlying functions as precisely as possible in a fair
amount of training time, where any deviation between the model prediction and the true function is
only due to the model’s ability to precisely approximate the function and minimize loss rather than
handling the variability and presence of noise within the data.</p>
        <p>Hardware. All runs were computed on the Cluster Caliban HPC of the University of l’Aquila, in
isolation on a Rocky Linux 8 server with 32-core Intel(R) Xeon(R) CPU @ 2.30GHz, up to 64GB RAM,
NVIDIA A100 GPU.</p>
        <p>Parameters. The experiment involved an extensive grid search on hyperparameters, exploring
diferent configurations of macro-structures, such as variations in the number of layers and nodes in
each layer, as well as diferent combinations of lower-level parameters such as neurons, grid points,
and spline orders (k values). The research is focused on identifying the optimal configurations for
the models to compare them in a fair way and to understand the extent to which these settings afect
performance. For the macro-structures, to determine to what extent various depths and widths afect,
ifve diferent configurations were tested: [(   , 3, 1), (  , 5, 1), (  , 3, 3, 1), (  , 5, 3, 1), (  , 5, 5, 1)],
Additionally, various lower-level hyperparameters were explored. For the HybridKAN model, each
sub-MLP is composed of two layers with a set of predefined combinations of neurons: [(4, 6), (4, 8), (4,
10), (6, 6), (6, 8) (6, 10), (8, 6), (8, 8), (8, 10), (10, 6), (10, 8), (10, 10)]. Given the unprecedented performance
of the HybridKAN model, a broader range of hyperparameter configurations was chosen in order to
understand its variability and potential. All combinations were made using four only values: 4, 6, 8,
10, allowing to include mirror pairs to check whether an expansion or contraction in the number of
neurons between the layers of the subMLP can have a diferent impact on the model’s behavior. For
the KAN model instead, which was designed specifically for symbolic functions approximation, it was
performed a grid search on the number of control points for the B-Spline and on the K value (spline
order). The grid points were set to 3, 5, 10, and 15, while the K value was set to 3 and 5. For both models,
the value of learning rate was set to 1 due to the usage of the LBFGS optimizer, which fine-tunes the
learning rate over epochs to facilitate convergence in a full-batch approach.</p>
      </sec>
      <sec id="sec-3-6">
        <title>6. Experimental Analysis</title>
        <p>In this section, we present a comprehensive analysis of the experimental results obtained by comparing
KAN and HybridKAN architectures. The main objective of this analysis is to evaluate the performance,
stability, and optimization characteristics of the proposed HybridKAN model relative to KAN across a
diverse set of equations from the Feynman dataset.</p>
        <p>We structure our analysis into three main parts. First, we identify and compare the best-performing
configurations for each model on each equation to assess their peak predictive capabilities (Section
6.1). Second, we evaluate model stability and training eficiency by analyzing aggregated statistics,
such as average accuracy and training times across all tested configurations, to understand each model’s
robustness and sensitivity to hyperparameter choices (Section 6.2). Finally, we investigate the loss
surfaces of both models, examining how diferent macro- and micro-level parameter configurations
influence their optimization landscapes and performance, thereby shedding light on their generalization
behavior (Section 6.3).</p>
        <p>Through this structured analysis, we aim to provide a comprehensive evaluation of HybridKAN’s
advantages over KAN in terms of performance, reliability, and architectural flexibility.</p>
        <sec id="sec-3-6-1">
          <title>6.1. Best performing architecture per equation</title>
          <p>We now report the performance of the best performing networks on the selected equations.</p>
          <p>As detailed in Section 5, we performed a grid search over several hyperparameters for both KAN
and HybridKAN. The optimal configuration for the models was selected based on the highest accuracy
achieved, while keeping training time as a tie-breaker when diferences in terms of accuracy were
minimal.</p>
          <p>The individual best performance and the corresponding parameters are reported in Appendix B.</p>
          <p>Figure 3 shows the R2 score (where higher is better) achieved by the best models of KAN and
HybridKAN for each equation. The HybridKAN architecture outperforms KANs on nearly all equations,
as validated by the Average across all equations shown in dashed circles. Moreover, the poor gradient
behavior inherent in KANs is clearly demonstrated by their inability to achieve acceptable performance
on Equation 5. In contrast, HybridKAN efectively mitigates this pathological behavior, producing
consistent and reliable results on the same equation, thus highlighting its improved stability and
robustness.</p>
          <p>Figure 4 shows the test loss (Figure 4a, where lower is better) and the training times (Figure 4b, where
lower is better) of the best performing KAN and HybridKAN architectures over a grid search.</p>
          <p>As Figure 4a shows, the HybridKAN architecture generally achieves lower test losses (aside from a
couple of hiccups in Equations 29,30,56) with respect to the KAN architecture, which was specifically
tested on these equations by the original authors. The mean test loss, shown as dashed circles, is also
lower for the HybridKan architecture, validating the superior performance of HybridKAN.</p>
          <p>Figure 4b shows that the training time for HybridKANs is generally lower than the ones for KANs, as
indicated by the dashed circle showing the average training times. HybridKAN shows more consistent
and regular training times, whereas KANs sometimes abnormally peak (as in Equations 2, 29, 51, 82).</p>
          <p>These results show that the HybridKAN architecture is better both in terms of performance (R2 score,
test loss) and training times, taking generally less time to train to achieve best performance.
(a) Test losses across equations
(smaller covered area -&gt; better performance)
(b) Training times across equations
(smaller covered area -&gt; better performance)</p>
        </sec>
        <sec id="sec-3-6-2">
          <title>6.2. Evaluating Model Stability and Training Eficiency</title>
          <p>In this phase of the analysis, we focus on evaluating aggregated statistics—including the average,
standard deviation, minimum, and maximum—of key performance indicators across all configurations
tested in the grid search. By analyzing these metrics, we aim to understand how diferent architectural
and hyperparameter choices influence the performance of both models. This allows us to evaluate each
model’s sensitivity to such variations and to identify trade-ofs between metrics across configurations.</p>
          <p>Figure 5 shows the 2 scores across all models tested on the grid search. The color of the equation
IDs on the X-axis show the best performing architecture. HybridKAN surpasses the KANs on the
vast majority of equations. Again, KANs collapse on Equation 5, corroborating pathological behavior.
Moreover, KAN exhibits significantly larger error bars for most of the equations, indicating that it is
the model which is subject the most to variability in performance and suggesting it is more sensitive
to diferent initializations or specific equation structures. HybridKAN, on the other hand, has smaller
error bars, meaning that its performance is stable across diferent configurations, making the model
more flexible regarding particular conditions.</p>
          <p>Figure 6 shows training times in a similar fashion. Even if the highest peaks are reached by the
HybridKAN model, a closer look at error bars reveals that KAN exhibited greater variability in training
time across all equations, with larger error bars for most of the cases. Despite occasional high peaks,
HybridKAN’s overall training times are actually more consistent, highlighting that the model is more
predictable and reliable also when it comes to running times compared to the more erratic performance
of KAN.</p>
          <p>Overall, these results demonstrate that HybridKAN not only achieves superior performance compared
to KAN across the majority of equations but also exhibits greater robustness and stability. While KAN
models display substantial variability in both accuracy and training times—indicating sensitivity to
initialization and equation structure—HybridKAN maintains consistent performance and training
eficiency across configurations.</p>
        </sec>
        <sec id="sec-3-6-3">
          <title>6.3. Understanding loss surfaces for KAN and HybridKAN</title>
          <p>This final part of the analysis investigates the impact of diferent initialization choices on model
performance, focusing specifically on test loss and accuracy. In particular, the study distinguishes
between the efects of macro-level parameters, which refer to architectural aspects such as the number
of layers and nodes per layer, and micro-level parameters, which include lower-level configuration
settings—such as the number of neurons in each sub-MLP for HybridKAN, or the number of control
points (grid size) for splines in KAN. By examining the influence of both parameter levels, we seek to
understand how their combinations afect each model’s ability to approximate the underlying functions,
as measured by test loss.</p>
          <p>Figure 7 show the test loss surface and heatmap for HybridKAN across diferent configurations.
Specifically, width refers to the shape of the larger model, while Neurons is the number of neurons
in each sub-MLP. HybridKAN displays a more structured and layered test loss surface, with marked
variations in loss depending on the architecture and hyperparameters. In particular, lower test losses
(yellow regions) are observed for deeper architecture with more neurons per subMLP. This suggests
that increasing the number of neurons in HybridKAN notably improves the performance of the model.
However, there are diminishing returns in increasing the size of the subMLPs since improvements in
loss caused by added neurons become minimal, even if they can still be somehow relevant. Higher
(a) Test Loss Surface for HybridKAN
(b) Test Loss Heatmap for HybridKAN
peaks, indicated by purple regions, are located in correspondence of shallower networks with fewer
neurons, meaning that, in general, under-parametrized configurations struggle to generalize across
multiple equations. This can ultimately suggest that the HybridKAN network may benefit and scale
well with added complexity to function smoothly, allowing for more control over variability, which is
also suggested by the lack of peaks during the descent.</p>
          <p>In KAN, as shown in Figure 8, the surface of the test loss is more irregular, with sharp spikes from
high to low losses at certain configurations, suggesting that the model may be more sensitive to grid
size and architectural widths. Unlike HybridKAN, where increasing both width and depth of the model
always reduces test loss, KAN exhibit a diferent pattern for various grid sizes, indicating that the
choices on this hyperparameter depends on the task at hand where too small or too large grid sizes
may lead to higher test losses and, therefore, reduced performance.</p>
          <p>In fact, there are regions where adding or removing grid points worsen performance, likely due to
overfitting or insuficient capacity in representing the function. From these results it is possible to
say that, unlike HybridKAN, where increases in neurons consistently improve performance, KAN
requires specific combinations of grid sizes for each architectural choice, as poor choices can worsen
performance. For better visualization, to the surface plots of both models are added the corresponding
heatmaps of the average test loss values computed from all equations across all combination of macro
and micro-level configurations.</p>
        </sec>
      </sec>
      <sec id="sec-3-7">
        <title>7. Conclusion and Future Work</title>
        <p>In this work, we proposed a novel MLP-based architecture called HybridKAN, which retains the
overall structure of Kolmogorov-Arnold Networks (KANs) while replacing their B-spline functions with
trainable sub-MLPs. This architectural shift grounds the model back in the Universal Approximation
Theorem, addressing the pathological cases observed in KANs and resulting in improved performance
with more stable and generally shorter training times. We conducted an exhaustive grid search over
network and subMLP configurations, providing detailed results and ablation studies to evaluate the
impact of diferent architectural choices. We test both KAN and HybridKAN on the same 27 equations
that the original KAN paper sampled from the Feynman dataset, showing HybridKAN’s superiority.
Finally, our analysis of the loss surfaces corroborates that HybridKAN not only achieves higher r2
scores on regression tasks but also exhibits more consistent optimization behavior across diverse tasks.</p>
        <p>This study represents a first investigation into the potential of HybridKAN, and future work will
extend this analysis to include classification tasks, additional benchmark datasets, and a significantly
broader range of experiments to further validate and refine the model’s capabilities across various
machine learning domains. While our current evaluation focuses on KAN architectures, future work
will include comparisons with standard MLPs to contextualise HybridKAN’s general performance
advantages. It must be noted that, while HybridKAN outperforms KAN in terms of performance, it
remains less explainable. A deeper study on the explainability of HybridKAN will be subject of future
work.</p>
      </sec>
      <sec id="sec-3-8">
        <title>Declaration on Generative AI</title>
        <p>During the preparation of this work, the author(s) used ChatGPT-4o in order to: Grammar and spelling
check. After using these tools, the author(s) reviewed and edited the content as needed and take full
responsibility for the publication’s content.</p>
      </sec>
      <sec id="sec-3-9">
        <title>Acknowledgements</title>
        <p>The numerical simulations have been realized on the HPC cluster of the Department of Information
Engineering, Computer Science and Mathematics (DISIM) at the University of L’Aquila. The work
is partially funded by the European Union - NextGenerationEU under the Italian Ministry of
University and Research (MUR) National Innovation Ecosystem grant ECS00000041 - VITALITY - CUP
E13C22001060006, by National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza,
PNRR) - Project: “SoBigData.it - Strengthening the Italian RI for Social Mining and Big Data Analytics”
- Prot. IR0000013 - Avviso n. 3264 del 28/12/2021, and by the “ICSC – Centro Nazionale di Ricerca in
High Performance Computing, Big Data and Quantum Computing.”
[2] H. Mu, B. Ul Tayyab, N. Chua, SpiralMLP: A Lightweight Vision MLP Architecture , in: 2025
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), IEEE Computer
Society, Los Alamitos, CA, USA, 2025, pp. 8627–8637. URL: https://doi.ieeecomputersociety.org/10.
1109/WACV61041.2025.00836. doi:10.1109/WACV61041.2025.00836.
[3] D. Alvarez-Melis, T. S. Jaakkola, Towards robust interpretability with self-explaining neural
networks, 2018. URL: https://arxiv.org/abs/1806.07538. arXiv:1806.07538.
[4] Z. Liu, Y. Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Soljačić, T. Y. Hou, M. Tegmark, Kan:</p>
        <p>Kolmogorov-arnold networks, 2025. URL: https://arxiv.org/abs/2404.19756. arXiv:2404.19756.
[5] V. D. Tran, T. X. H. Le, T. D. Tran, H. L. Pham, V. T. D. Le, T. H. Vu, V. T. Nguyen,
Y. Nakashima, Exploring the limitations of kolmogorov-arnold networks in classification: Insights
to software training and hardware implementation, 2024. URL: https://arxiv.org/abs/2407.17790.
arXiv:2407.17790.
[6] K. Shukla, J. D. Toscano, Z. Wang, Z. Zou, G. E. Karniadakis, A comprehensive and fair
comparison between mlp and kan representations for diferential equations and operator
networks, Computer Methods in Applied Mechanics and Engineering 431 (2024) 117290. URL:
https://www.sciencedirect.com/science/article/pii/S0045782524005462. doi:https://doi.org/10.
1016/j.cma.2024.117290.
[7] V. I. Arnold, On the representation of continuous functions of several variables by superpositions
of continuous functions of one variable, Uspekhi Mat. Nauk 18 (1963). An influential contribution
to the development of the theorem.
[8] A. N. Kolmogorov, On the representation of continuous functions of several variables by
superpositions of continuous functions of one variable and addition, Doklady Akademii Nauk SSSR 114
(1957). Original paper (in Russian).
[9] C. J. Vaca-Rubio, L. Blanco, R. Pereira, M. Caus, Kolmogorov-arnold networks (kans) for time
series analysis, 2024. URL: https://arxiv.org/abs/2405.08790. arXiv:2405.08790.
[10] R. C. Yu, S. Wu, J. Gui, Residual kolmogorov-arnold network for enhanced deep learning, 2025.</p>
        <p>URL: https://arxiv.org/abs/2410.05500. arXiv:2410.05500.
[11] A. Jamali, S. K. Roy, D. Hong, B. Lu, P. Ghamisi, How to learn more? exploring
kolmogorovarnold networks for hyperspectral image classification, 2024. URL: https://arxiv.org/abs/2406.15719.
arXiv:2406.15719.
[12] E. Poeta, F. Giobergia, E. Pastor, T. Cerquitelli, E. Baralis, A benchmarking study of
kolmogorovarnold networks on tabular data, in: 2024 IEEE 18th International Conference on Application of
Information and Communication Technologies (AICT), 2024, pp. 1–6. doi:10.1109/AICT61888.
2024.10740444.
[13] A. Dahal, S. A. Murad, N. Rahimi, Eficiency bottlenecks of convolutional kolmogorov-arnold
networks: A comprehensive scrutiny with imagenet, alexnet, lenet and tabular classification, 2025.</p>
        <p>URL: https://arxiv.org/abs/2501.15757. arXiv:2501.15757.
[14] S. Somvanshi, S. A. Javed, M. M. Islam, D. Pandit, S. Das, A survey on kolmogorov-arnold
network, ACM Comput. Surv. (2025). URL: https://doi.org/10.1145/3743128. doi:10.1145/3743128,
just Accepted.
[15] Z. Li, Kolmogorov-arnold networks are radial basis function networks, 2024. URL: https://arxiv.</p>
        <p>org/abs/2405.06721. arXiv:2405.06721.
[16] S. T. Seydi, Z. Bozorgasl, H. Chen, Unveiling the power of wavelets: A wavelet-based
kolmogorovarnold network for hyperspectral image classification, 2024. URL: https://arxiv.org/abs/2406.07869.
arXiv:2406.07869.
[17] A. D. Bodner, A. S. Tepsich, J. N. Spolski, S. Pourteau, Convolutional kolmogorov-arnold networks,
2024. URL: https://arxiv.org/abs/2406.13155. arXiv:2406.13155.
[18] M. Moradi, S. Panahi, E. Bollt, Y.-C. Lai, Kolmogorov-arnold network autoencoders, 2024. URL:
https://arxiv.org/abs/2410.02077. arXiv:2410.02077.
[19] Kaggle, Kaggle: Your machine learning and data science community, 2025. URL: https://www.
kaggle.com.</p>
        <p>2
3
5
12
14
17
20
21
26
27
29
30
31
38
43
48
51
52
56
62
64
80
82
84
90
91
98</p>
        <p>(−
(−</p>
        <p>Formula</p>
      </sec>
      <sec id="sec-3-10">
        <title>A. Equations</title>
        <p>As mentioned in Section 5, for this study, we selected the same 27 equations sampled from the Feynman
dataset by the authors of the original KAN paper [4]. Here we report in detail what those equations are
and their IDs.</p>
      </sec>
      <sec id="sec-3-11">
        <title>B. Detailed performance for best hyperparameters</title>
        <p>architectures for each equation, along with their hyperparameters. See section 5 for more details on the
hyperparameter space.</p>
        <p>Eq</p>
        <p>Width</p>
        <p>HybridKAN</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Sun,
          <article-title>Sa-mlp: A low-power multiplication-free deep network for 3d point cloud classification in resource-constrained environments</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2409.
          <year>01998</year>
          . arXiv:
          <fpage>2409</fpage>
          .
          <year>01998</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>