<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J. Kupka);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Remarks on the Universal Approximation Property of Feedforward Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiří Kupka</string-name>
          <email>Jiri.Kupka@osu.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zahra Alijani</string-name>
          <email>Zahra.Alijani@osu.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petra Števuliáková</string-name>
          <email>Petra.Stevuliakova@osu.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Universal Approximation Theorem, Neural Network, Activation Function</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ITAT'25: Information Technologies - Applications and Theory</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute for Research and Applications of Fuzzy Modeling, University of Ostrava, Centre of Excellence IT4Innovations</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>00</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper presents a structured overview and novel insights into the universal approximation property of feedforward neural networks. We categorize existing results based on the characteristics of activation functions - ranging from strictly monotonic to weakly monotonic and continuous almost everywhere - and examine their implications under architectural constraints such as bounded depth and width. Building on classical results by Cybenko [1], Hornik [2], and Maiorov [3], we introduce new activation functions that enable even simpler neural network architectures to retain universal approximation capabilities. Notably, we demonstrate that single-layer networks with only two neurons and fixed weights can approximate any continuous univariate function, and that two-layer networks can extend this capability to multivariate functions. These findings refine the known lower bounds of neural network complexity and ofer constructive approaches that preserve strict monotonicity, improving upon prior work that relied on relaxed monotonicity conditions. Our results contribute to the theoretical foundation of neural networks and open pathways for designing minimal yet expressive architectures.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In this paper, we would like to contribute to mathematical theoretical backgrounds of neural networks,
which are nowadays used in various parts of our lives, not only in industrial applications (e.g. through
image and video processing tools) but also in many aspects of real life, like automated medical and
psychological diagnosis [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5, 6</xref>
        ], automated detection of mental conditions, etc. We want to discuss a
bit one of the fundamental results in neural networks, namely the universal approximation theorem.
This theorem, in its purest form, states that a feedforward neural network with a single hidden layer
can approximate any continuous function on a compact subset of ℝ , provided it has a suficient number
of neurons in the hidden layer. However, these results generally do not specify how many neurons are
required to achieve this approximation.
      </p>
      <p>Most universal approximation results fall into one of two categories:</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073

=1

=1
establishing a lower bound for width only requires identifying one function in the class that the neural
network architecture cannot approximate when its width is below the bound. Therefore, proving upper
bounds for the minimum width is generally regarded as more dificult than demonstrating lower bounds.</p>
      <p>While Section 2 provides some fundamental results and notation, Section 3 consists of a short survey
of classical and recent universal approximation theorems, classified mainly due to the type of activation
function, which is a kind of new approach, to our best knowledge. In the last section, we mentioned,
without proofs, our recent results [7], which enrich some of the results mentioned in Section 3. In brief,
our results provide new lower boundaries of the complexity of feedforward single-hidden-layer and
two-hidden-layer neural networks, still providing the property of universal approximation.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Notation and a Fundamental Result</title>
      <p>In this section, we briefly recall the basic concepts and fundamental facts about neural networks,
activation functions, and universal approximation properties. Our objective is to understand and apply
the universal approximation property using a wide range of activation functions, without restricting
ourselves to specific structural assumptions. The following types of neural networks are the basic
objects of our study.</p>
      <p>Single-Layer Feedforward Neural Network (SLFN):
  ()̄ ∶= ∑    ( ̄
 ⋅  −̄   )
where:
•  ∈̄ ℝ  is the input vector.
•  ̄  ∈ ℝ are the weights for the  -th neuron in the hidden layer.
•   ∈ ℝ are bias terms.
•   ∈ ℝ are output weights.</p>
      <p>•  (⋅) is an univariate activation function.</p>
      <p>Two-Layer Feedforward Neural Network (TLFN):

=1
  
()̄ ∶= ∑    (∑ 
  ( ̄  ⋅  −̄   ) −   ) ,
where , ̄ ̄  ∈ ℝ ; 
 ,</p>
      <p>,   ,   ∈ ℝ and  (⋅) is the fixed univariate activation function.</p>
      <p>
        The foundational results on universal approximation property of neural networks are from 1989 due
to Cybenko [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Funahashi [8] and Hornik et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The authors independently proved that a neural
network with a single hidden layer can approximate any continuous function on a compact domain. In
addition, Cybenko [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] used a continuous sigmoidal activation function, Funahashi [8] worked with a
continuous activation function that is nonconstant, bounded and monotone increasing, and similarly
Hornik et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] used a continuous nonconstant activation function. Two years later, Hornik (in [9])
proved that not the specific choice of the activation function but rather the feed-forward architecture
ensures the required property. The results presented by Cybenko, Funahashi and Hornik et al. are not
constructive in a simple way, constructive approximations were first presented in [ 10, 11].
      </p>
      <p>
        Below we denote the space of continuous functions  ∶ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]  → ℝ by ([
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]  ), and by  ∞ map we
denote a map for which every derivative exists and is continuous.
      </p>
      <p>
        Theorem 2.1 (Cybenko, 1989). For any continuous function  ∈ ([
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]  )and any  &gt; 0 , there exists a
neural network with a finite number of neurons such that:
      </p>
      <p>
        This result relies on:
• The Stone-Weierstrass theorem, ensuring the density of polynomials in ([
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]  ).
• The discriminatory property of sigmoidal functions, which guarantees their ability to separate
measures.
      </p>
      <p>The case of arbitrary depth has been extensively studied by many authors, particularly in the context
of neural networks with ReLU activation functions (see, e.g., [12], [13], [14], [15]). More recently, Kidger
generalized these results to a broader class of activation functions in [16], extending the applicability of
universal approximation theorems beyond ReLU-based architectures. Building on this, recent work
[17] introduces fractional order derivatives into activation functions, ofering tunable flexibility that
helps networks better capture complex patterns and improve learning performance.</p>
      <p>An intriguing and novel approach was introduced in [18], where the authors demonstrated that
universal approximation can be achieved using a finite set of mappings. This vocabulary-based method
opens new perspectives on the structure and design of neural networks for function approximation.</p>
    </sec>
    <sec id="sec-3">
      <title>3. State-of-the-Art Knowledge - Structured According to Properties of</title>
    </sec>
    <sec id="sec-4">
      <title>Activation Functions</title>
      <p>Quite a number of theorems on universal approximation can be found in the literature, and it seems that
the ideas in the proofs vary a lot. In the following, we structure theorems on the universal approximation
property according to the properties of activation functions. The idea is to focus mainly on the case of
neural networks with bounded depth and width.</p>
      <sec id="sec-4-1">
        <title>3.1. Strictly Monotone Continuous Activation Functions</title>
        <p>In the bounded-width setting, neural networks are restricted to a fixed number of neurons per layer
but can have arbitrary depth. Kidger and Lyons [16] showed that such deep networks, with width as
small as  +  + 2 (where  is the input dimension and  the output dimension), are still universal
approximators—provided the activation function is continuous, non-afine, and non-polynomial. Their
result significantly generalizes earlier work, which primarily focused on ReLU-based architectures, by
extending the universality to a broader class of activation functions.</p>
        <p>
          In the bounded-depth setting, the depth of the neural network is fixed, while the width is allowed to
grow. Classical results by Cybenko [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and Hornik et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] demonstrate that shallow
networks—specifically, those with a single hidden layer—can approximate any continuous function on a compact domain,
provided the width is suficiently large. These foundational results highlight the expressive power of
wide and shallow architectures.
        </p>
        <p>More recently, Cai [18] introduced a novel constructive approach to universal approximation using
a finite set of mappings, referred to as a ”vocabulary”. This method enables the approximation of
continuous functions by composing a fixed set of nonlinear transformations, ofering a compact and
interpretable representation of the function space. While the vocabulary-based method is not limited
to shallow networks, it provides new insights into how expressive power can be achieved even under
architectural constraints.</p>
        <p>Additionally, Kratsios [19] provided a general characterization of universal approximation under
various architectural constraints, including bounded depth. His work shows that with appropriate
modifications—such as sparse connectivity or shifted activation functions—universal approximation
can still be achieved, even when depth is limited.</p>
        <p>
          Maiorov et al. in 1999 ([
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]) proved that there exists an analytic, strictly increasing sigmoidal activation
function for which the neural network with limited width and depth is a universal approximator.
Firstly, they proved that there exists an activation function for which single hidden layer neural
network approximation is essentially identical (same approximation order) to that of ridge function
restricted to the unit ball   in ℝ with a boundary  −1 defined as follows:
approximation. The theoretical lower bound is given by the approximation order of the manifold  
  = {∑   ( ̄ ⋅  |̄  ̄  ∈  −1 ,   ∈ ([
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ]))} .
        </p>
        <p>(1)
 ̄  ∈  −1 such that
for all  ∈̄   .</p>
        <p>
          Theorem 3.1. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] There exists an activation function  which is real analytic, strictly increasing, and
sigmoidal satisfying the following. Given  ∈   and  &gt; 0 , there exist constants   , integers   and vectors

=1
3
∑    ( ̄
=1
| ( )̄ −
        </p>
        <p>⋅  −̄   )| &lt;</p>
        <p>
          Secondly in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], they considered the neural network with two hidden layers. They showed that for
their constructed activation function, any continuous function on the unit cube in ℝ can be uniformly
approximated with any error.
        </p>
        <p>
          Theorem 3.2. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] There exists an activation function  which is real analytic, strictly increasing, and
sigmoidal, and has the following property. For any  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]  and  &gt; 0 , there exist real constants
  ,   , 
 ,   , and vectors  ̄  ∈ ℝ for which
6+3
=1
3
=1
| ( )̄ −
∑   (∑    ( ̄  ⋅  −̄   ) −   )| &lt;
        </p>
        <p>
          for all  ∈̄ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]  .
        </p>
        <p>Proof. The proof is based on an improved version of Kolmogorov Superposition Theorem and the
activation function constructed by polynomials with rational coeficients.</p>
        <p>Remark 1. If there is a will to replace the demand of analyticity of  by only  ∞, then Theorem 3.2 can be
stated with 2 + 1 units in the first layer and 4 + 3 units in the second layer. Similarly, for the demand of
strict monotonicity and sigmoidality, Theorem 3.2 can be proven with  units in the first layer and
2 + 1
units in the second layer. The restriction of Theorem 3.2 to the unit cube is for convenience only. The same
result holds over any compact subset of ℝ .</p>
        <p>Remark 2. The activation function used in the above results is pathological and this demonstrates that the
properties of being analytic, strictly monotone, and sigmoidal may not be as significant as is often assumed.
Essentially, these pathologies can be hidden even within functions that possess such desirable characteristics
because powerful tools like translation and composition can still introduce them.</p>
        <p>Remark 3. The activation function  is wonderfully smooth, but unacceptably complex. Theoretical results
such as these have a diferent purpose. They are meant to tell us what is possible and, sometimes more
importantly, what is not. They are also meant to explain why certain things are or are not possible by
highlighting their salient characteristics.</p>
        <p>
          Remark 4. Proposition 1 in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] provides a weaker result than Theorem 2 (in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]), namely, the universal
approximation property (for  ∈
        </p>
        <p>
          ) is proved for an activation function  being  ∞, strictly increasing,
and sigmoidal. The proof of this proposition provides a construction, which was later, to a huge extent, used
in [20]; however, the authors in [20] lost the strict monotonicity. The construction needs a fact that there
exists a dense  ∞ family of functions in ([
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ]) .
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Weakly Monotone Continuous Activation Functions</title>
        <p>In Guliyev’s research [20, 21], the notion of  -monotonicity is a relaxed version of traditional
monotonicity, which is central to constructing the sigmoidal activation functions used in the approximation
theorems. This concept allows for slight deviations from strict monotonicity while maintaining enough
structure to support function approximation in neural networks.</p>
        <p>Definition 3.3. A function  ∶  → ℝ , where  ⊆ ℝ , is said to be  -monotone if there exists a strictly
monotonic function  ∶  → ℝ such that:</p>
        <p>| () − ()| ≤  ∀ ∈  ,
where  &gt; 0 is a small, positive real number that quantifies the allowable deviation from strict monotonicity.</p>
        <p>In essence:
• For  = 0 ,  () coincides with () and is strictly monotonic.
• For  &gt; 0 ,  () can exhibit small oscillations around () , but the deviation is controlled and
bounded by  .</p>
        <p>The Role of  -Monotonicity in Guliyev’s Papers</p>
        <p>In Guliyev’s construction of the sigmoidal activation function  () ,  -monotonicity is a critical
property that balances the following:
• The constructed  () is  ∞ (infinitely diferentiable), which is crucial for neural network
applications. Unlike strict monotonicity,  -monotonicity allows the function to be flexible and
computationally eficient.
•  -monotonicity ensures that the activation function behaves like a monotonic function, with
deviations limited by the parameter  . This prevents erratic behavior while retaining flexibility.
• The sigmoidal function  () constructed with  -monotonicity satisfies:
and
Here, ℎ() is a strictly increasing auxiliary sigmoidal function defined as:
ℎ() &lt;  () &lt; 1,</p>
        <p>∀ ∈ [, +∞),
| () − ℎ()| ≤ .
ℎ() = 1 −</p>
        <p>min{1/2, }
1 + log( −  + 1)
.</p>
        <p>This inequality ensures that  () closely follows ℎ(), with  -bounded deviations, making  () a
suitable choice for neural network activation.</p>
        <sec id="sec-4-2-1">
          <title>3.2.1. Bounded Width of NN</title>
          <p>In scenarios where the width of the neural network is bounded, the depth must increase to maintain
approximation capabilities. Guliyev’s activation functions are particularly useful here because their
smoothness and controlled deviation from monotonicity allow for eficient composition across layers.
This means that even with a fixed number of neurons per layer, the network can still approximate
complex functions by increasing the depth of the neural network.</p>
          <p>This is consistent with the findings of Ohn and Kim [ 22], who showed that deep networks with
general smooth activation functions can achieve minimax optimal approximation rates for Hölder
continuous functions. Their results emphasize the importance of smoothness over strict monotonicity
in deep architectures.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>3.2.2. Bounded Depth of NN</title>
          <p>When the depth is fixed, the network must rely on increased width to achieve approximation. In this
case, the flexibility of  -monotonic activation functions becomes crucial. Their ability to approximate
strictly increasing functions with bounded error allows for constructing wide, shallow networks that
still achieve good approximation performance.</p>
          <p>Recent work by Biswas et al. [23] introduced a smooth, non-monotonic activation function (Sqish)
that performs well in both standard and adversarial settings. This supports the idea that relaxing
monotonicity constraints can lead to practical and efective activation functions. Similarly, Sartor et al.
[24] demonstrated that constrained monotonic neural networks with saturating activations can still
achieve universal approximation, further validating the theoretical foundation of  -monotonicity.</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>3.2.3. Bounded Width and Depth of NN</title>
          <p>The first paper of Guliyev and Ismailov [ 20] addresses the problem of approximating continuous
functions using single hidden layer feedforward neural networks (SLFNs) with fixed weights. The goal
is to show that such networks, with only two neurons in the hidden layer and fixed weights set to 1,
can approximate any continuous univariate function in a compact interval.</p>
          <p>Theorem 3.4. [20] For any continuous univariate function  () and any  &gt; 0 , there exists an SLFN with
ifxed weights such that:

=1
| () −</p>
          <p>∑    (  ⋅  +   )| &lt;  ∀ ∈ [, ].</p>
          <p>This holds for univariate functions, but for multivariate functions, no such approximation is generally
possible with fixed weights, e.g., see the arguments in the last part of [ 20].</p>
          <p>Limitation. SLFNs cannot approximate multivariate functions due to a lack of suficient capacity to
model interactions between multiple variables.</p>
          <p>According to another paper of Guliyev and Ismailov [21], the networks characterized by a two-layer
architecture and fixed weights can approximate any continuous multivariate function over compact
domains. This gives them substantially greater expressive power compared to single-layer networks.
The extra layer allows for the modeling of more intricate non-linear combinations of input variables,
which tend to be challenging for single-layer networks to handle.</p>
          <p>Theorem 3.5. [21] For any continuous multivariate function  ( 1, … ,   ) and any  &gt; 0 , there exist
constants   ,  
,   ,   such that:
2+2
=1</p>
          <p>=1
| () −
∑    (∑</p>
          <p>(  ⋅  −   ) −   )| &lt; 
for all  ∈ [, ]</p>
          <p>.</p>
          <p>TLFNs with fixed weights can approximate any continuous multivariate function  ( 1,  2, … ,   )to
arbitrary precision on a compact domain. The network uses fixed weights as unit coordinate vectors,
and a sigmoidal activation function is constructed to achieve this approximation.</p>
          <p>Advantage. Introducing a second hidden layer improves the understanding of the network of complex
interactions between various variables, thereby allowing it to approximate multivariate functions more
efectively. Unlike SLFNs, which have fixed weights and are restricted to approximating univariate
functions, TLFNs with their two-layer architecture and fixed weights can approximate any continuous
multivariate function on compact sets. This design significantly boosts their expressive capacity in
comparison to single-layer networks. The added complexity stems from the additional layer, which
enables more intricate non-linear combinations of input variables — a capability that single-layer
networks find dificult to achieve.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Continuous Activation Functions</title>
        <p>When there are no constraints on the architecture of the neural network, the classical universal
approximation theorem applies. It states that a feedforward neural network with a continuous,
nonpolynomial activation function can approximate any continuous function on a compact subset of ℝ to
arbitrary precision.</p>
        <p>This result holds for a wide class of continuous activation functions, including:
• Sigmoidal functions (e.g., logistic, tanh)
• Smooth approximations of ReLU (e.g., softplus)
• Weakly monotonic or  -monotonic functions [20, 21]
• General smooth activations as studied in [22]</p>
        <p>When the depth is fixed, the network must rely on increased width to achieve universal approximation.
This setting is more restrictive, but still allows for the approximation of continuous functions under
certain conditions. For example, Lu et al. [25] showed that ReLU networks with fixed depth can
approximate any continuous function if the width is suficiently large.</p>
        <p>Smooth activation functions can also be used in this setting. Zhang et al. [26] demonstrated that
networks using a wide range of smooth activations (e.g., softplus, GELU, Swish) can approximate ReLU
networks with only modest increases in width and depth. This implies that smooth activations retain
expressive power even in shallow architectures, provided the network is wide enough.</p>
        <p>
          The expressive power of deep neural networks with fixed width has been a subject of significant
interest. In particular, Hanin [27] showed that ReLU networks with a width as small as  +1 are suficient
to arbitrarily approximate any continuous convex function on the unit cube [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] . Furthermore, for
general continuous functions, a width of  + 3 sufices.
        </p>
        <p>Another specification of width bounds is based on the input and output dimensions of the networks.
Johnson [28] derived the lower bound of width with uniformly continuous activations when the output
dimension is one. Later, Cai [29] reached the optimal minimal width based on the input and output
dimensions over all classes of activations. Recently, Rochau et al. in [30] generalized the universal
approximation results of Johnson [28] to higher output dimensions and achieved the lower bound of
width even tighter than the one stated by Cai [29].</p>
      </sec>
      <sec id="sec-4-4">
        <title>3.4. Continuous Almost Everywhere Activation Functions</title>
        <p>
          Leshno et al. [31] investigate the universal approximation capabilities of neural networks and analyze
the criteria for single-layer networks (SLFNs) to approximate continuous functions. Their work
demonstrates that single-layer feedforward neural networks (SLFNs) with a continuous activation function 
have the universal approximation property if and only if  is not a polynomial. This finding extends
previous results, such as the one by Cybenko [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], by explicitly outlining the necessary and suficient
conditions for universal approximation. Theorem 1 in [31] states:
Theorem 3.6. Let  ∈  , where  is the set of locally bounded, piecewise continuous functions whose
discontinuity points have a closure of zero Lebesgue measure. Define:
  = span{ ( ⋅  + ) ∶  ∈ ℝ
 ,  ∈ ℝ}.
        </p>
        <p>Then   is dense in (ℝ  )if and only if  is not an algebraic polynomial (almost everywhere).
Proof. Now we provide a sketch of the proof from [31].</p>
        <p>1. If  is a polynomial,   cannot be dense in (ℝ  ).
2. If  is not a polynomial,   is dense in (ℝ  ).</p>
        <p>The argument relies on:
• Density properties of function spaces.
• Weierstrass’s theorem for polynomial approximation.</p>
        <p>• Functional analysis to construct approximations.</p>
        <p>STEP 1. If  is a polynomial,   is not dense. Assume  is a polynomial of degree  , i.e.,  () =</p>
        <p>+ ⋯ +  0. Then  ( ⋅  + )
The span of  ( ⋅  + )
Thus,   cannot be dense in (ℝ  ).</p>
        <p>is also a polynomial of degree  , as it is a transformation of  .</p>
        <p>will consist of polynomials of degree ≤  . This span cannot approximate
functions that require higher complexity, such as non-polynomial or discontinuous functions.
STEP 2. If   is dense in (ℝ) , it is dense in (ℝ  )Consider the space  = span{ ( ⋅ ) ∶  ∈ ℝ
 ,  ∈
(ℝ)}. Known results in functional analysis state that  is dense in (ℝ  )(using ridge functions).
Given  ∈ (ℝ  ), for any compact set  ⊂ ℝ  , there exist  ∈ (ℝ) and directions  1, … ,   such
If  is dense in (ℝ) , it can approximate any  (  ⋅ ) , making   dense in (ℝ  ).</p>
        <p>STEP 3. If  is smooth and not a polynomial,  1 is dense in (ℝ) . Assume  is  ∞, meaning it has
derivatives of all orders, and it is not a polynomial. For any  ∈ (ℝ) , Weierstrass’s theorem
ensures that polynomials are dense in (ℝ) on compact sets. The derivatives of  ( ⋅  + )
are:
that:
(ℝ) .</p>
        <p>() ≈

∑  (  ⋅ ).</p>
        <p>=1</p>
        <p>(  + ) =</p>
        <p>() (  + ).</p>
        <p>Since  is not a polynomial, at least one derivative  () is non-zero for some  . Hence,  1 can
generate polynomials of all degrees. Combining this with Weierstrass’s theorem,  1 is dense in
STEP 4. Approximation of Continuous Functions: The argument is extended to locally bounded,
piecewise continuous  (not necessarily smooth): For any  ∈ (ℝ) and any compact interval
[, ] , convolve  with a smooth function  of compact support:
( ∗ )() =</p>
        <p>∫ ( )( −  ) .</p>
        <p>The convolution  ∗  is smooth, and  can approximate it because  is dense in  ∞ class.
STEP 5. Non-polynomiality is Necessary: Assume  is non-polynomial. For any  ∈ (ℝ) , construct
an approximation using:

=1
() ≈</p>
        <p>If  were polynomial, this representation would restrict  to a finite-dimensional polynomial
space, which contradicts  ’s generality in (ℝ) . Thus,  must be non-polynomial for   to be
dense.</p>
        <p>This theorem implies that the multilayer feedforward network with a non-polynomial activation
function and thresholds in each neuron can approximate any continuous function on a compact domain
to arbitrary precision, given enough hidden units. Polynomial activation functions constrain the
network to operate within a finite-dimensional space of polynomial functions, which is insuficient for
approximating the infinite-dimensional space (ℝ  )of continuous functions. This is why the universal
approximation property requires non-polynomial activation functions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Interesting Facts and New Results</title>
      <p>In this section, we highlight interesting facts that are fundamental for understanding constrains in
universal approximation theorems. In addition, we present here our results that provide new lower
limits of the complexity of neural networks with universal approximation properties.</p>
      <sec id="sec-5-1">
        <title>4.1. Fundamental Reasons Why a Polynomial Activation Function Fails to Guarantee the Universal Approximation Property</title>
        <p>This is due to the restricted expressive ability of polynomials. Here is the explanation:
Polynomials Are Finite-Dimensional. A polynomial of degree  is a function of the form:
 () =</p>
        <p>+  −1  −1 + ⋯ +  0,
where   are coeficients. The space of all polynomials of degree
≤  is a finite-dimensional vector
space. For example, in ℝ, it has dimension  +1 . If the activation function  () is a polynomial, any
function generated by a neural network with this activation is a linear combination of polynomial
terms of  . This means that the network output is constrained to a space of polynomials.
Polynomials Cannot Approximate Non-Polynomial Functions. Universal approximation
requires the ability to represent any continuous function on a compact domain  ⊂ ℝ  . By the
Weierstrass approximation theorem, polynomials can approximate continuous functions on
compact sets. However, this property applies only if the degree of the polynomial is unbounded. If
 () is a fixed polynomial, the neural network is limited to the finite-dimensional space spanned
by  and its transformations. Thus, it cannot approximate functions requiring higher complexity
Restrictive Ridge Function Combinations. Neural networks with a single hidden layer approximate
or non-polynomial behavior.
functions using combinations of ridge functions:

=1
 () ≈</p>
        <p>If  () is polynomial, this combination is restricted to polynomial ridge functions, which cannot
form a dense set in (ℝ  ).</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Main Results</title>
        <p>
          Within this subsection, we would like to emphasize our new results. As we could see from the previous
part of this paper, although the first Cybenko’s result and its proof are quite easy and straightforward,
the universal approximability can be studied from many points of view, and some of the open directions
are not simple at all. We looked to some classic ([
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]) and also recent results ([20, 21]) in which the
simplest possible structures of feedforward neural networks are considered. Naturally, the neural
network possessing a simple structure must have a complex activation function in order to provide rich
approximation ability.
        </p>
        <p>
          Recently in [7], we provided new constructions of activation functions  and Θ diferent from those
mentioned in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. This allowed us to preserve the universal approximation property for even simpler
neural networks that it has been previously known. And we achieved this not only for  ∞ functions,
but also for analytical functions. In the following, we assume the manifold   defined by ( 1).
for any  &gt; 0 there exist constants   , integers   and vectors  ̄  ∈  −1 ,  = 1, 2, … , 2 , such that
Theorem 4.1. Let  ∈ 
 and let  ∶ ℝ → [
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ] be a  ∞, strictly increasing, sigmoidal function. Then
 ⋅  −̄   )| &lt; 
for all  ̄ from the unit ball   .
        </p>
        <p>
          Theorem 4.2. Let  ∈   and let Θ ∶ ℝ → [
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ] be an analytic, strictly increasing, sigmoidal function.
Then for any  &gt; 0 , there are constants   , integers   and vectors  ̄  ∈  −1 ,  = 1, 2, … , 2 , satisfying
for all  ̄ from the unit ball   .
        </p>
        <p>Those results provide an approximation property for functions of just one variable. Based on these
results, we are able to provide improved results for multivariate functions as well, with the help of the
Kolmogorov Representation Theorem.</p>
        <p>
          Theorem 4.3. Let  ∈ ([
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]  )and let Θ ∶ ℝ → [
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ] be an analytic, strictly increasing, sigmoidal
function. Then for any  &gt; 0 , there exist constants   ,   ,   ,   and vectors  ̄  ∈ ℝ ,  = 1, 2, … , 4 + 2 ,
 = 1, 2, … , 2 , satisfying
        </p>
        <p>
          It should be noted that there are not only positive news coming from our results. Maiorov et al in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
relied on proofs which are valid in a more general setting of normed linear spaces. We could simplify
the structure of neural network, but we have lost the validity for normed linear spaces.
        </p>
        <p>However, there are some positive consequences. In the above, we discuss the role of  -monotonicity
in two papers by Guliyev and Ismailov ([20, 21]). They provided a nice constructive approach, with
the help of monic polynomials, which led to universal approximators of both uni- and multi-variate
continuous functions. The price of this was that the strict monotonicity was replaced by a weaker
monotonicity, namely  -monotonicity. With the help of our newly constructed activation function 
[7], we could preserve the nice constructive approach motivated by Guliyev and Ismailov ([20, 21]) and
still have a strictly increasing activation function.</p>
        <p>Using this function, we prove the following results.</p>
        <p>
          Theorem 4.4. Let  ∈ ([
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ]) and  &gt; 0 . Then there exist constants  1,  2,  ∈ ℝ such that
| () −  1(− − ) −  2( + ) | &lt; ,
∀ ∈ [
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ].
        </p>
        <p>This result implies that a neural network with a single hidden layer and only two neurons, using the
activation function  , can approximate any continuous function on a compact interval arbitrarily well.
An analogous result is proved for multivariate functions.</p>
        <p>
          Theorem 4.5. Let  ∈ ([
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ]  )and  &gt; 0 . Then there exist constants   ,   ,   ,   and vectors  ̄  ∈ ℝ ,
 = 1, 2, … , 4 + 2 ,  = 1, 2, … , 2 , such that
for all  ∈̄ [
          <xref ref-type="bibr" rid="ref1">−1, 1</xref>
          ]  , where the weights  ̄  ,  = 1, … , 2, are fixed as follows:
 ̄ 1 = (1, 0, … , 0),
        </p>
        <p>… ,   ̄ = (0, 0, … , 1),
 ̄ +1 = (−1, 0, … , 0),</p>
        <p>… ,  2̄ = (0, 0, … , −1).</p>
        <p>
          The results obtained by Maiorov et al. in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] are quite old and describe the situation for quite simple
function spaces, they were further extended and transferred to more complex function spaces. It will be
a part of our future research to study how our new results afect current knowledge in such spaces.
        </p>
        <p>Moreover, it should be highlighted that our results are purely theoretical and, as for other purely
theoretical results, we did not consider any application so far, neither our results were motivated by
any application.</p>
        <p>As for possible implementations, any reader can follow the part presented in the last section of our
manuscript, where we are motivated by Guliyev and Ismailov’s constructive approach, to which they
provide a code as well. We do not provide any source code; any reader can follow our recommendations
to prepare its own one.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>
        In the presented study, we provided a short survey on universal approximation theorems, structured
according to properties of activation functions. We also mentioned our new results using strictly
increasing activation functions and improving previous constructive results that used  -monotonicity.
Those results are, in some sense, consequences of ideas improving older classic results by Maiorov et
al. ([
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]), in a sense that we showed that even simpler neural networks can still have the property of
universal approximation.
      </p>
      <p>The natural continuation can be that we will try to extend our results to other, more complicated
space, since so far we dealt with compact domains in ℝ .</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>J. Kupka and Z. Alijani have been supported by the project ”Research of Excellence on Digital
Technologies and Wellbeing CZ.02.01.01/00/22_008/0004583”, which is co-financed by the European Union.</p>
      <p>The work of P. Števuliáková was financially supported by the European Union under the
REFRESH – Research Excellence For REgion Sustainability and High-tech Industries project number
CZ.10.03.01/00/22_003/0000048 via the Operational Programme Just Transition.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[6] H. Phan, F. Andreotti, N. Cooray, O. Y. Chén, M. De Vos, Joint classification and prediction cnn
framework for automatic sleep stage classification, IEEE Transactions on Biomedical Engineering
66 (2018) 1285–1296.
[7] J. Kupka, Z. Alijani, P. Števuliáková, Simple neural networks do have universal approximation
property, 2025. Manuscript, submitted to Neural Networks.
[8] K.-I. Funahashi, On the approximate realization of continuous mappings by neural networks,</p>
      <p>Neural networks 2 (1989) 183–192.
[9] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Networks 4
(1991) 251–257.
[10] E. K. Blum, L. K. Li, Approximation theory and feedforward networks, Neural Networks 4 (1991)
511–515.
[11] V. Kůrková, Kolmogorov’s theorem and multilayer neural networks, Neural Networks 5 (1992)
501–506.
[12] G. Gripenberg, Approximation by neural networks with a bounded number of nodes at each level,</p>
      <p>Journal of approximation theory 122 (2003) 260–266.
[13] D. Yarotsky, Universal approximations of invariant maps by neural networks, Constructive</p>
      <p>Approximation 55 (2022) 407–474.
[14] Z. Lu, H. Pu, F. Wang, Z. Hu, L. Wang, The expressive power of neural networks: A view from the
width, Advances in neural information processing systems 30 (2017).
[15] B. Hanin, M. Sellke, Approximating continuous functions by relu nets of minimal width, arXiv
preprint arXiv:1710.11278 (2017).
[16] P. Kidger, T. Lyons, Universal approximation with deep narrow networks, in: Conference on
learning theory, PMLR, 2020, pp. 2306–2327.
[17] V. Molek, Z. Alijani, Fractional concepts in neural networks: Enhancing activation functions,</p>
      <p>Pattern Recognition Letters 174 (2025) 151–158.
[18] Y. Cai, Vocabulary for universal approximation: A linguistic perspective of mapping compositions,
arXiv preprint arXiv:2305.12205 (2023).
[19] A. Kratsios, The universal approximation property: Characterization, construction, representation,
and existence, Annals of Mathematics and Artificial Intelligence 89 (2021) 435–469.
[20] N. J. Guliyev, V. E. Ismailov, On the approximation by single hidden layer feedforward neural
networks with fixed weights, Neural Networks 98 (2018) 296–304.
[21] N. J. Guliyev, V. E. Ismailov, Approximation capability of two hidden layer feedforward neural
networks with fixed weights, Neurocomputing 316 (2018) 262–269.
[22] I. Ohn, Y. Kim, Smooth function approximation by deep neural networks with general activation
functions, Entropy 21 (2019) 627.
[23] K. Biswas, M. Karri, U. Bağcı, A non-monotonic smooth activation function, ArXiv (2023).
[24] D. Sartor, A. Sinigaglia, G. A. Susto, Advancing constrained monotonic neural networks: Achieving
universal approximation beyond bounded activations, arXiv preprint arXiv:2505.02537 (2025).
[25] Z. Lu, H. Pu, F. Wang, Z. Hu, L. Wang, The expressive power of neural networks: A view from the
width, Advances in neural information processing systems 30 (2017).
[26] S. Zhang, J. Lu, H. Zhao, Deep network approximation: Beyond relu to diverse activation functions,</p>
      <p>Journal of Machine Learning Research 25 (2024) 1–39.
[27] B. Hanin, Universal function approximation by deep neural nets with bounded width and relu
activations, Mathematics 7 (2019) 992.
[28] J. Johnson, Deep, skinny neural networks are not universal approximators, ArXiv abs/1810.00393
(2018).
[29] Y. Cai, Achieve the minimum width of neural networks for universal approximation, ICLR2023
camera ready arXiv:2209.11395 (2023).
[30] D. Rochau, R. Chan, H. Gottschalk, New advances in universal approximation with neural networks
of minimal width, arXiv preprint arXiv:2411.08735 (2024).
[31] M. Leshno, V. Y. Lin, A. Pinkus, S. Schocken, Multilayer feedforward networks with a
nonpolynomial activation function can approximate any function, Neural networks 6 (1993) 861–867.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cybenko</surname>
          </string-name>
          ,
          <article-title>Approximation by superpositions of a sigmoidal function</article-title>
          ,
          <source>Mathematics of control, signals and systems 2</source>
          (
          <year>1989</year>
          )
          <fpage>303</fpage>
          -
          <lpage>314</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hornik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stinchcombe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <article-title>Multilayer feedforward networks are universal approximators</article-title>
          ,
          <source>Neural networks 2</source>
          (
          <year>1989</year>
          )
          <fpage>359</fpage>
          -
          <lpage>366</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Maiorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pinkus</surname>
          </string-name>
          ,
          <article-title>Lower bounds for approximation by mlp neural networks</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>25</volume>
          (
          <year>1999</year>
          )
          <fpage>81</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Guntuku</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Yaden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Kern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Ungar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Eichstaedt</surname>
          </string-name>
          ,
          <article-title>Detecting depression and mental illness on social media: an integrative review</article-title>
          ,
          <source>Current Opinion in Behavioral Sciences</source>
          <volume>18</volume>
          (
          <year>2017</year>
          )
          <fpage>43</fpage>
          -
          <lpage>49</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Duerichen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Marberger</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. Van Laerhoven</surname>
          </string-name>
          ,
          <article-title>Introducing wesad, a multimodal dataset for wearable stress and afect detection</article-title>
          ,
          <source>in: Proceedings of the 20th ACM international conference on multimodal interaction</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>400</fpage>
          -
          <lpage>408</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>