=Paper=
{{Paper
|id=Vol-3819/short3
|storemode=property
|title=What makes Models Compositional? A Neuro-Symbolic Theoretical View (Extended Abstract)
|pdfUrl=https://ceur-ws.org/Vol-3819/short3.pdf
|volume=Vol-3819
|authors=Parikshit Ram,Tim Klinger,Alexander G. Gray
|dblpUrl=https://dblp.org/rec/conf/lnsai/RamKG24
}}
==What makes Models Compositional? A Neuro-Symbolic Theoretical View (Extended Abstract)==
<pdf width="1500px">https://ceur-ws.org/Vol-3819/short3.pdf</pdf>
<pre>
                         What makes Models Compositional? A Neuro-Symbolic
                         Theoretical View (Extended Abstract)
                         Parikshit Ram1,* , Tim Klinger1 and Alexander G. Gray2,3
                         1
                           IBM Research
                         2
                           Centaur AI Institute
                         3
                           Purdue University


                                        Abstract
                                        Compositionality is thought to be a key component of language, and various compositional benchmarks have
                                        been developed to empirically probe the compositional generalization of existing sequence processing models.
                                        These benchmarks often highlight failures of existing models, but it is not clear why these models fail in this way.
                                        In this paper, we seek to theoretically understand the role the compositional structure of the models plays in
                                        these failures and how this structure relates to their expressivity and sample complexity. We propose a general
                                        neuro-symbolic definition of compositional functions and their compositional complexity. We then show how
                                        various existing general and special purpose sequence processing models (such as recurrent, convolution and
                                        attention-based ones) fit this definition and use it to analyze their compositional complexity.


                         1. Introduction
                         Compositionality is assumed to be integral to language processing [1]. Generalizing in a compositional
                         manner or compositional generalization is of high interest when learning with sequences since it can
                         enable a (learned) model to generalize well to a possibly infinite domain of sequences while learning
                         from only a small number of examples. With this motivation, there has been interest in quantifying the
                         compositional generalization (CG) of sequence or language models. This has led to various language
                         modeling benchmarks [2, 3, 4, 5]. These benchmarks empirically probe the compositionality of language
                         models, often demonstrating the lack of CG. These highlight examples on which the models fail, but
                         there is no precise understanding of why the failures occur. Nonetheless, various novel methods with
                         improved compositional generalization (as measured by these benchmarks) have been developed [6, 7, 8],
                         utilizing specialized models with compositional inductive biases. However, the area of CG is still lacks a
                         mathematical definition and measure of compositionality.

                         Our contributions. Inspired by existing discussions, and recent solutions for CG benchmarks, We
                         make the following contributions: 1
                         • We propose a general modular definition of “compositional functions” to facilitate concrete under-
                           standing of the expressiveness and generalization of such functions, and propose the notion of
                           “compositional complexity” to quantify the complexity of such functions. Our definition teases out
                           what we believe to be the “neural” (continuous functions) and the “symbolic” (discrete functions)
                           parts of a model.
                         • We demonstrate the flexibility of this definition by highlighting how various existing sequence
                           processing models such as recurrent models [11], convolutional models and transformer based
                           models [12] fit this definition, and how complex their compositions are.


                          LNSAI 2024: First International Workshop on Logical Foundations of Neuro-Symbolic AI, August 05, 2024, Jeju, South Korea
                         *
                           Corresponding author.
                          $ parikshit.ram@ibm.com (P. Ram); tklinger@us.ibm.com (T. Klinger); skirmilitor@gmail.com (A. G. Gray)
                           https://research.ibm.com/people/parikshit-ram (P. Ram); https://research.ibm.com/people/tim-klinger (T. Klinger);
                          https://www.linkedin.com/in/alexander-gray-b554b64/ (A. G. Gray)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                         1
                             A preliminary version of this work [9] was previously presented at the KBCG@IJCAI23 workshop, while the extended
                             version for this submission [10] can be found at https://www.arxiv.org/abs/2405.02350.

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                  0:1    1:1   2:1    3:1          0:1
                                                  0:2    1:2                       0:2   1:1    2:1
  0:1   1:1               0:1                     0:3    1:3   2:2    3:2   4:1    0:3
  0:2                     0:2   1:1   2:1   3:1   0:4    1:4   2:3                 0:4   1:2    2:2   3:1    4:1
  0:3   1:2   2:1   3:1   0:3                     0:5    1:5   2:4                 0:5   1:3    2:3
  0:4                     0:4   1:2               0:6                              0:6   1:4    2:4   3:2    4:2
  0:5                     0:5                     0:7                              0:7   1:5

Figure 1: cDAGs for 𝑓 (𝑋) (left) and 𝑓 (𝑋 ¯ ) (center left) in Example 1 and 𝑓 (𝑋) (center right) and 𝑓 (𝑋
                                                                                                         ¯ ) (right)
in Example 2. Nodes are labeled 𝑙:𝑖 (level 𝑙, index 𝑖). Sources are Fuchsia, sinks Sepia, and internal nodes Blue.


2. Defining and Quantifying Compositionality
We define compositional functions 𝑓 : 𝒳 → 𝒴 with the domain 𝒳 of input sequences 𝑋 = {𝑥1 , . . . , 𝑥𝐿 }
of atoms or tokens 𝑥𝑖 ∈ ℐ from an input dictionary ℐ. The range 𝒴 of 𝑓 can be R for regression, {0, 1}
for binary classification, or ℐ for next token prediction.

Definition 1. To define 𝑓 , we need the following components:
• Token encoder 𝑒 : ℐ × N → ℋ (latent space), with 𝑒𝑖 = 𝑒(𝑥𝑖 , 𝑖) ∈ ℋ encoding the 𝑖th token in 𝑋 ∈ 𝒳 .
• A computation directed acyclic graph (DAG) or cDAG 𝐷 : 𝒳 → 𝒟, where 𝒟 is the space of DAGs, and
  𝐷(𝑋) defines the hierarchical processing of a sequence 𝑋. 𝐷(𝑋) can also be viewed as the trace of
  program used by function 𝑓 to process 𝑋. We will describe this in further detail soon.
• Span processor 𝑔 : ℋ𝑘 → ℋ maps 𝑘 terms in the latent space into a new term in the latent space.
• Read-out function ℎ : ℋ𝑚 → 𝒴 maps the final set of terms in the latent space to the output space 𝒴.
With 𝑔 ⊗𝐷(𝑋) denoting the recursive operation of 𝑔 over 𝐷(𝑋), we define a compositional function as:
                                       (︁                                   )︁
                            𝑓 (𝑋) = ℎ 𝑔 ⊗𝐷(𝑋) (𝑒(𝑥1 , 1), . . . , 𝑒(𝑥𝐿 , 𝐿)) .                       (1)

   A computation DAG or cDAG 𝐷(𝑋) ≜ {𝑁 (𝑋), 𝐸(𝑋)} for a specific input sequence 𝑋 ∈ 𝒳 can
depend on 𝑋 or be pre-specified. This cDAG is a leveled DAG with set of nodes 𝑁 (𝑋) and edges 𝐸(𝑋).
Each node 𝑛 ≜ (𝑙 : 𝑖) ∈ 𝑁 (𝑋) has a level 𝑙 and index 𝑖. The recursive application of 𝑔 over 𝐷(𝑋)
induces a value 𝑣𝑙:𝑖 ∈ ℋ for each internal node 𝑛 ∈ 𝑁 (𝑋). The sources is 𝑁 (𝑋) have level 0, and
there is one source for each 𝑥𝑖 ∈ 𝑋, 𝑖 ∈ J𝐿K ≜ {1, . . . , 𝐿} with index 𝑖 and value 𝑣0:𝑖 = 𝑒(𝑥𝑖 , 𝑖) ∈ ℋ.
There are 𝑚 sinks in 𝑁 (𝑋), and at most 𝑘 incoming edges and 𝑞 outgoing edges at any node. For an
internal node 𝑛 ∈ 𝑁 (𝑋) with 𝑘 parents 𝑃 (𝑛), the value 𝑣𝑙:𝑖 = 𝑔(𝑣𝑙1 :𝑖1 , . . . , 𝑣𝑙𝑘 :𝑖𝑘 ) ∈ ℋ where 𝑣𝑙𝑗 :𝑖𝑗 is
the value of the 𝑗 th parent in 𝑃 (𝑛). One way to interpret this cDAG is as the trace of “forward-pass”
for inference.
   We consider the explicit cDAG because it allows us to see how the different elements 𝑥𝑖 , 𝑖 ∈ J𝐿K of
the input sequence 𝑋 are hierarchically composed to obtain the output. This will allow us to study
the complexity of any compositional function. A “simple” cDAG, where all source nodes just connect
to a single sink node, would be “applicable” to all functions, but it does not allow us to study it in an
interesting manner. When we study the compositional functions induced by general purpose models
(such as recurrent, convolutional or transformer models), we will see that some models have explicit
cDAGs with more structure, while others have less structured explicit cDAGs, but there are implicit
structures induced in the cDAG; whenever possible, we will explicitly state this implicit structure and
study its properties. From a neuro-symbolic perspective [13, 14], this explicit cDAG can be seen as the
symbolic part, while the 𝑒, 𝑔, ℎ are the neural; note that, in some models, this symbolic cDAG might be
created with neural elements, while in others, the cDAG might be obtained with a symbolic grammar.
This neuro-symbolic view offers a novel theoretical understanding of compositionality.
   The span processor 𝑔 : ℋ𝑘 → ℋ takes as input 𝑘 elements from the latent space ℋ and outputs an
element in ℋ. While the definition implies that the same 𝑔 needs to be operated recursively over the
cDAG 𝐷(𝑋), there is no restriction on the inputs and output of 𝑔 regarding the information encoded
in the latent space. For example, if the level 𝑙 of any node 𝑙:𝑖 is encoded into its value 𝑣𝑙:𝑖 , then the 𝑔
will behave differently across levels (level-dependent); if the index 𝑖 of the node 𝑙:𝑖 is encoded into its
value, then 𝑔 will be sensitive to the positional information (order-dependent); if the value of a node
includes the type of the node (for example, a non-terminal in a grammar), then 𝑔 can be type-dependent.
Our definition states that the arity of the span processor 𝑔 : ℋ𝑘 → ℋ is 𝑘. We do so for the ease of
exposition, though our definition can incorporate more flexible span processors (see Ram et al. [10,
Appendix A.2]).
    The read-out function ℎ : ℋ𝑚 → 𝒴 finally maps 𝑚 elements in the latent space to the output space
𝒴. This separation between 𝑔 and ℎ was necessary in our proposed definition because we require 𝑔 to
be operable recursively, and thus 𝑔 can operate in a latent space ℋ distinct from 𝒴. In some applications,
ℋ ⊇ 𝒴, in which case, ℎ can be an identity function. There are couple of aspects of this read-out
function we wish to discuss explicitly – (i) We assume that ℎ is specifically non-compositional and
processes its input without breaking it up into any sub-problems; we explicitly define the compositional
function 𝑓 separating out 𝑔, 𝐷, ℎ, where 𝑔 (neural) and 𝐷 (symbolic) represent the compositional part.
(ii) We require ℎ to have a fixed-arity of 𝑚 since 𝑔 and 𝐷 are aggregating the information over the
input.
    In the following, we will illustrate Definition 1 with a couple of examples:

  Example 1. Figure 1 (left) shows the cDAG 𝐷(𝑋) for a compositional 𝑓 on 𝑋 = [𝑥1 , . . . , 𝑥5 ],
  with 𝑓 (𝑋) = ℎ (𝑔 (𝑔 (𝑒1 , 𝑒2 ) , 𝑔 (𝑔(𝑒3 , 𝑒4 ), 𝑒5 ))), 𝑘 = 2 in-degree, 𝑞 = 1 out-degree, 𝑚 = 1
  sink, 𝑒𝑖 = 𝑒(𝑥𝑖 , 𝑖) ∈ ℋ, span-processor 𝑔 : ℋ2 → ℋ, and read-out function ℎ : ℋ → 𝒴. The
  values 𝑣0:𝑖 = 𝑒𝑖 for sources 0:𝑖, 𝑖 ∈ {1, . . . , 5}, and the internal node values are: 𝑣1:1 ← 𝑔(𝑒1 , 𝑒2 ),
  𝑣1:2 ← 𝑔(𝑒3 , 𝑒4 ), 𝑣2:1 ← 𝑔(𝑣1:2 , 𝑒5 ), 𝑣3:1 ← 𝑔(𝑣1:1 , 𝑣2:1 ). ℎ operates on 𝑣3:1 at sink 3:1. Figure 1
  (center left) shows the cDAG 𝐷(𝑋    ¯ ) of the same 𝑓 on 𝑋   ¯ ̸= 𝑋 with the same 𝑘 = 2, 𝑞 = 1, 𝑚 = 1
            ′
  and 𝑓 (𝑋 ) = ℎ (𝑔(𝑔(𝑒1 , 𝑔(𝑒2 , 𝑒3 )), 𝑔(𝑒4 , 𝑒5 ))). Note that 𝐷(𝑋) is not the same as 𝐷(𝑋  ¯ ).


  Example 2. Figure 1 (center right) shows the cDAG D(𝑋) for a compositional f on 𝑋 = [𝑥1 , . . . , 𝑥7 ],
  with f(𝑋) = h (𝑣4:1 , 𝑣3:1 ), 𝑘 = 3 maximum in-degree, 𝑞 = 3 maximum out-degree, 𝑚 = 2
  sinks, 𝑒𝑖 = 𝑒(𝑥𝑖 , 𝑖) ∈ ℋ, span processor g : ℋ3 → ℋ, and read-out function h : ℋ2 →
  𝒴. The source values 𝑣0:𝑖 = 𝑒𝑖 for each 𝑖 ∈ {1, . . . , 7}, and the internal node values are:
  𝑣1:1 ← g(𝑒1 , 𝑒2 , 𝑒3 ), 𝑣1:2 ← g(𝑒2 , 𝑒3 , 𝑒4 ), 𝑣1:3 ← g(𝑒3 , 𝑒5 , 𝑒7 ), 𝑣1:4 ← g(𝑒4 , 𝑒5 , 𝑒6 ), 𝑣1:5 ←
  g(𝑒5 , 𝑒6 , 𝑒7 ), 𝑣2:1 ← g(𝑣1:1 , 𝑣1:2 , 𝑣1:3 ), 𝑣2:2 ← g(𝑣1:1 , 𝑣1:3 , 𝑣1:4 ), 𝑣2:3 ← g(𝑣1:2 , 𝑣1:4 , 𝑣1:5 ),
  𝑣2:4 ← g(𝑣1:3 , 𝑣1:4 , 𝑣1:5 ), 𝑣3:1 ← g(𝑣2:1 , 𝑣2:2 , 𝑣2:3 ), 𝑣3:2 ← g(𝑣2:2 , 𝑣2:3 , 𝑣2:4 ), 𝑣4:1 ←
  g(𝑣3:2 , 𝑣2:3 , 𝑣2:4 ). h operates on 𝑣3:1 and 𝑣4:1 at sinks 3:1 and 4:1. Figure 1 (right) shows the
  cDAG D(𝑋     ¯ ) of the same f on 𝑋
                                    ¯ ̸= 𝑋 with the same 𝑘 = 3, 𝑞 = 3, 𝑚 = 2.

While Example 1 is a simple compositional function on a sequence, Example 2 is a more sophisticated
one. This is to highlight that our proposed Definition 1 can handle functions which require more
complex interactions between the tokens in a sequence. Example 1 has a cDAG with a maximum
out-degree 𝑞 = 1, implying a single path from any source to a sink. Example 2 has a cDAG with a
maximum out-degree 𝑞 = 3 across all levels in the DAG, implying that there can be a large number of
paths to any sink from a source. This allows the definition to include functions where certain tokens in
the sequence are of much higher importance to the output than others. These examples also highlight
that edges in the cDAG are allowed to skip levels, and the sinks can be from different levels, further
highlighting the compositional flexibility.
   We like to remark on a couple of points here: (i) Through these examples, we show that our definition
explicitly considers how the problem of sequence processing is broken up into sub-problems – the
cDAG embodies how disjoint or intertwined these “sub-problems” are by explicitly considering the
computation hierarchy. (ii) For input sequences 𝑋, 𝑋    ¯ from the same problem domain, and the same
compositional function 𝑓 , we allow the cDAG to be different – cDAG 𝐷(𝑋) can be input-dependent –
thereby allowing different input sequences to have different sub-problem hierarchies. At a non-technical
level, we also believe that our proposed Definition 1 connects intuitively to existing definitions:
              The meaning of the whole is a ⏟function
                                                ⏞     of the meanings of the parts
              ⏟          ⏞                              ⏟           ⏞
                         𝑓 :𝒳 →𝒴                 ℎ:ℋ𝑚 →𝒴                𝑔:ℋ𝑘 →ℋ

                                                      and of the way they are syntactically combined.
                                                            ⏟                   ⏞
                                                                                  𝐷:𝒳 →𝒟


Both Examples 1 and 2 can be seen as compositional functions, but Example 2 is clearly a more complex
composition. In addition to its intuitive nature, our proposed definition allows us to understand how
complex the compositionality is beyond just stating if a function is compositional. The compositional
complexity of 𝑓 depends on the functions 𝑔, ℎ, 𝑒 as well as the cDAG function 𝐷 that drives the
computation. For a sequence 𝑋 of length 𝐿, 𝐷(𝑋) has 𝐿 source nodes, maximum in-degree of 𝑘
(controlling the span size for 𝑔), 𝑚 sink nodes (controlling the capacity of ℎ), maximum out-degree of 𝑞
(quantifying the “localism” of the effect of a node). However, these do not explicitly incorporate the fact
that changes to nodes at lower levels of the cDAG can have a larger effect on the output than changes
to nodes at higher levels of the cDAG. We propose a new quantification – the locus of influence (LoI):
Definition 2 (LoI of a source node). Consider a function 𝑓 with components 𝑒, 𝐷, 𝑔, ℎ (Definition 1). Let
(𝑣𝑛1 , . . . , 𝑣𝑛𝑗 , . . . , 𝑣𝑛𝑘 ) ∈ ℋ𝑘 be any input to the span processor 𝑔, with 𝑣𝑛 = 𝑔(𝑣𝑛1 , . . . , 𝑣𝑛𝑗 , . . . , 𝑣𝑛𝑘 )
its output. Let 𝜀 ∈ ℋ be a “perturbation” to the 𝑗 th argument to 𝑔, 𝑗 ∈ J𝑘K, resulting in the perturbed
output
⃦          𝑣𝑛𝑗 (𝜀)⃦ = 𝑔(𝑣𝑛1 , . . . , 𝑣𝑛𝑗 + 𝜀, . . . , 𝑣𝑛𝑘 ). Let 𝑐 > 0 be a constant such that ∀𝑗 ∈ J𝑘K, ∀𝜀 ∈ ℋ,
            𝑗
⃦𝑣𝑛 − 𝑣𝑛 (𝜀)⃦ ≤ 𝑐‖𝜀‖. For a sequence 𝑋 ∈ 𝒳 of length 𝐿, and a source node 0:𝑖 in 𝐷(𝑋), let 𝑃 (𝑥𝑖 ) be
⃦                 ⃦

the
∑︀ set of all |𝑃  unique paths from 0:𝑖 to any of the sink nodes in 𝐷(𝑋). The absolute LoI of index       ∑︀𝑖 is 𝛿𝑖 =
                𝑐   | , with |𝑃 | as the length of a path 𝑃 ∈ 𝑃 (𝑥 ), and the relative LoI is 𝛽 = 𝛿 /
   𝑃 ∈𝑃 (𝑥𝑖 )                                                         𝑖                          𝑖      𝑖      𝑗∈J𝐿K 𝛿𝑗 .

   This definition of the complexity of composition incorporates both the complexity of the cDAG 𝐷(𝑋)
and the complexity of the span processor 𝑔 : ℋ𝑘 → ℋ in terms of its smoothness, with higher values
of 𝑐 indicating more complex (less smooth) 𝑔. The absolute LoI 𝛿𝑖 incorporates the effect of longer
paths, with the effect growing with path length, and corresponds to the sensitivity of the compositional
function output to any one input token in the sequence.
   The smaller the absolute LoI 𝛿𝑖 of any input index 𝑖, more local its effect, and thus more structure that
can be transferred between examples if 𝑥𝑖 is replaced with something else. A relative LoI 𝛽𝑖 greater than
1/𝐿 denotes that the input index 𝑖 (and thus input token 𝑥𝑖 ) has an out-sized effect on 𝐷(𝑋) (and thus
the computation) compared to the other indices (tokens). In Example 1 (left), 𝛿1 = 𝑐2 , 𝛽1 = 1/2𝑐+3 < 1/5
while 𝛿3 = 𝑐3 , 𝛽3 = 𝑐/2𝑐+3 > 1/5, implying that 𝑥3 has more influence (absolute and relative) function
than 𝑥1 (assuming 𝑐 > 1). In Example 2 (left), 𝛿1 = 𝑐4 + 2𝑐3 , 𝛽1 = 𝑐+2/27𝑐+39 ≈ 1/22 < 1/7, while
𝛿5 = 7𝑐4 + 9𝑐3 , 𝛽5 = 7𝑐+9/27𝑐+39 ≈ 1/4 > 1/7, hence 𝑥5 has a significantly larger influence than 𝑥1 . We
utilize the LoI to define the complexity of a compositional function, and a class of such compositional
functions:
Definition 3. A function 𝑓 : 𝒳 → 𝒴 with components 𝑔, ℎ, 𝑒, 𝐷 is (𝑘, 𝑞, 𝑚, 𝛿, 𝛽)-compositional if, for
any 𝑋 ∈ 𝒳 of length 𝐿, the cDAG 𝐷(𝑋) has a in-degree of 𝑘, maximum outgoing degree of 𝑞, and
𝑚 sink nodes, and for ∀𝑖 ∈ J𝐿K, 𝛿𝑖 ≤ 𝛿, and 𝛽𝑖 ≤ 𝛽 ∈ [1/𝐿, 1). We denote with ℱ a class of such
(𝑘, 𝑞, 𝑚, 𝛿, 𝛽)-compositional functions.
  A small 𝛿 and a 𝛽 close to 1/𝐿 signifies a function that possesses a high level of localism across all
input sequences and tokens in its domain. While this function has the most structure, it might not
be suitable for practical purposes. A high 𝛿 and a 𝛽 close to 1/𝐿 signifies a very complex function
where there is a lot of interaction between all the input tokens in all input sequences, making it hard to
exploit any compositional structure in the function. A high 𝛿 and a 𝛽 significantly higher than 1/𝐿
indicates an interesting class of functions where, some input tokens can have a high influence over
the function computation, but, for most tokens, there is a compositional structure in the function that
can be exploited. This intuitively seems to be an interesting and more practical class of compositional
functions since assuming all tokens have an equal level of relative influence seems quite restrictive.
                                                             0:1                  0:1     1:1       2:1
                                                             0:2   1:1            0:2     1:2       2:2
                                                             0:3                  0:3     1:3       2:3
   0:1   1:1   2:1   3:1      0:1    1:1    2:1   3:1        0:4   1:2   2:1      0:4     1:4       2:4
   0:2                        0:2                            0:5                  0:5     1:5       2:5
   0:3                        0:3                            0:6   1:3            0:6     1:6       2:6
   0:4                        0:4    1:2    2:2   3:2        0:7                  0:7     1:7       2:7   3:1

   (a) Unidirectional rec.    (b) Bidirectional rec         (c) Convolve + Pool         (d) Transformer
         Figure 2: The cDAG for various existing sequence processing models such as the unidirectional and
         bidirectional recurrence models, convolutional models and attention based transformer models.


 Model                                      Input-dependent cDAG Arbitrary Len      (𝑘, 𝑞, 𝑚)   𝛿         𝛽
                                                                                                𝐿−1
 Unidirectional RNN (2a)                                ✗                  ✓        (2, 1, 1)   𝑐         1/2
 Bidirectional RNN (2b)                                 ✗                  ✓        (2, 2, 2)   𝑐𝐿−1      1/4
                                                                                                           2/𝐿
 Convolve (width 𝑤) + pool (width 𝑝) (2c)               †                  ‡      (𝑤+𝑝, 𝑤, 𝑚) 𝑐log 𝐿      1+1/𝑝

 Transformer (2d)                                       ✗                  ✓        (𝐿, 𝐿, 1)   (𝐿𝑐)𝑀  1/𝐿
 Transformer w/ 𝐾-sparse attention                      ✓                  ✓        (𝐾, 𝐿, 1)   𝐿(𝐾𝑐)𝑀 1/𝐾


Table 1
Compositional complexities of existing models with input sequences of length 𝐿. LoI is specified approximately
for the ease of exposition. †: Convolve+Pool induces input-dependent cDAGs for max/min-pool, not for avg/sum-
pool. ‡: The number of sinks 𝑚 needs to be specified for Convolve+pool, and the model can handle arbitrary
length if it can recursively apply Convolve+pool until the number of nodes at a level is reduced to 𝑚.


   In Fig. 2, we re-express existing sequence processing models as per our definition, teasing out the
symbolic cDAG (and the neural 𝑔, ℎ), and we present their (simplified) compositional complexity in
Table 1 assuming that all models classes utilize span processors 𝑔 with the same smoothness parameter
𝑐 for the ease of exposition. This highlights the flexibility and utility of our proposed quantification of
compositionality (see Ram et al. [10, Section 4] for more details and models). Beyond the properties
of the cDAG (the in-degree 𝑘, out-degree 𝑞 and number of sink nodes 𝑚) and the upper bounds on
the absolute LoI 𝛿 and relative LoI 𝛽, we also highlight two properties: (i) Whether the model utilizes
(implicitly or explicitly) an input-dependent cDAG (that is, 𝐷(𝑋) is not the same DAG for all 𝑋 of length
𝐿), and (ii) Whether the same model is able to operate on arbitrary length input sequences. The use of
input-dependent cDAG has implications in terms of the expressivity of the model – it can be shown
that functions from a model class (with compositional complexites 𝛿, 𝛽) with a fixed input-agnostic
cDAG cannot approximate functions from a model class of matching compositional complexity (that
is, same compositional complexity 𝛿, 𝛽) that utilize input-dependent cDAGs. Ram et al. [10, Theorem
1] show that the approximation is lower and upper bounded by 𝒪(𝛿) and 𝒪(𝛿/𝛽) respectively. This
implies that a higher value of absolute compositional complexity 𝛿, and a smaller relative compositional
complexity 𝛽 adversely affect the approximation guarantees. The absolute compositional complexity 𝛿
has been shown to be directly tied to the generalization gap for a learned compositional function, with
higher 𝛿 implying worse systematic generalization guarantee [10, Theorem 2]. The ability to operate
on arbitrary length sequences is a prerequisite to the ability of a model to possess length generalization
or productivity – the ability to generalize to sequences of larger lengths than those seen during training.
We will be pursuing length generalization in our future work.


3. Conclusion
In this paper, we briefly present our novel definition of compositional functions that explicitly separates
out the neural and symbolic aspects of a model for the ease of analysis. We also present a notion of
compositional complexity that quantifies the complexity with which the tokens in an input sequence
are put together to get to the output. We briefly highlight the generality and utility of this definition by
demonstrating how existing sequence processing models fit into this definition.


References
 [1] P. Pagin, D. Westerståhl, Compositionality I: Definitions and variants, Philosophy Compass 5 (2010)
     250–264. URL: https://compass.onlinelibrary.wiley.com/doi/abs/10.1111/j.1747-9991.2009.00228.x.
 [2] B. Lake, M. Baroni, Generalization without systematicity: On the compositional skills of sequence-
     to-sequence recurrent networks, in: International Conference on Machine Learning, PMLR, 2018,
     pp. 2873–2882. URL: https://proceedings.mlr.press/v80/lake18a.html.
 [3] N. Kim, T. Linzen, COGS: A compositional generalization challenge based on semantic interpreta-
     tion, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
     (EMNLP), 2020, pp. 9087–9105. URL: https://aclanthology.org/2020.emnlp-main.731.pdf.
 [4] D. Hupkes, V. Dankers, M. Mul, E. Bruni, Compositionality decomposed: how do neural networks
     generalise?, Journal of Artificial Intelligence Research 67 (2020) 757–795. URL: https://jair.org/
     index.php/jair/article/view/11674.
 [5] T. Klinger, D. Adjodah, V. Marois, J. Joseph, M. Riemer, A. S. Pentland, M. Campbell, A study
     of compositional generalization in neural models, arXiv preprint arXiv:2006.09437 (2020). URL:
     https://arxiv.org/abs/2006.09437.
 [6] Y. Li, L. Zhao, J. Wang, J. Hestness, Compositional generalization for primitive substitutions, in:
     Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and
     the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019,
     pp. 4284–4293. URL: https://aclanthology.org/D19-1438/.
 [7] Q. Liu, S. An, J.-G. Lou, B. Chen, Z. Lin, Y. Gao, B. Zhou, N. Zheng, D. Zhang, Compo-
     sitional generalization by learning analytical expressions, Advances in Neural Information
     Processing Systems 33 (2020). URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/
     83adc9225e4deb67d7ce42d58fe5157c-Paper.pdf.
 [8] M. Nye, A. Solar-Lezama, J. Tenenbaum, B. M. Lake,                   Learning compositional rules
     via neural program synthesis,           Advances in Neural Information Processing Systems
     33 (2020) 10832–10842. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/
     7a685d9edd95508471a9d3d6fcace432-Paper.pdf.
 [9] P. Ram, T. Klinger, A. G. Gray, How compositional is a model?, in: International Joint Conference
     on Artificial Intelligence 2023 Workshop on Knowledge-Based Compositional Generalization, 2023.
     URL: https://openreview.net/forum?id=OImyRhNLv3.
[10] P. Ram, T. Klinger, A. G. Gray, What makes Models Compositional? A Theoretical View: With
     Supplement, arXiv preprint arXiv:2405.02350 (2024). URL: https://arxiv.org/abs/2405.02350.
[11] S. Hochreiter, J. Schmidhuber, Lstm can solve hard long time lag problems, Advances in neural
     information processing systems 9 (1996). URL: https://proceedings.neurips.cc/paper/1996/file/
     a4d2f0d23dcc84ce983ff9157f8b7f88-Paper.pdf.
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser,
     I. Polosukhin,         Attention is all you need,         Advances in neural information pro-
     cessing systems 30 (2017). URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/
     3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[13] M. K. Sarker, L. Zhou, A. Eberhart, P. Hitzler, Neuro-symbolic artificial intelligence, AI Communi-
     cations 34 (2021) 197–209. URL: https://arxiv.org/pdf/2105.05330.pdf.
[14] A. d. Garcez, S. Bader, H. Bowman, L. C. Lamb, L. de Penning, B. Illuminoo, H. Poon, C. G. Zaverucha,
     Neural-symbolic learning and reasoning: A survey and interpretation, Neuro-Symbolic Artificial
     Intelligence: The State of the Art 342 (2022) 327. URL: https://arxiv.org/pdf/1711.03902.pdf.

</pre>