1. Introduction

What makes Models Compositional? A Neuro-Symbolic Theoretical View (Extended Abstract)

Parikshit Ram

Tim Klinger

Alexander G. Gray

0 2 0 Centaur AI Institute 1 IBM Research 2 Purdue University

Compositionality is thought to be a key component of language, and various compositional benchmarks have been developed to empirically probe the compositional generalization of existing sequence processing models. These benchmarks often highlight failures of existing models, but it is not clear why these models fail in this way. In this paper, we seek to theoretically understand the role the compositional structure of the models plays in these failures and how this structure relates to their expressivity and sample complexity. We propose a general neuro-symbolic definition of compositional functions and their compositional complexity. We then show how various existing general and special purpose sequence processing models (such as recurrent, convolution and attention-based ones) fit this definition and use it to analyze their compositional complexity.

1. Introduction

0:1 1:1

(¯) (center left) in Example 1 and () (center right) and (¯) (right) in Example 2. Nodes are labeled : (level , index ). Sources are Fuchsia, sinks Sepia, and internal nodes Blue.

2. Defining and Quantifying Compositionality

for binary classification, or ℐ for next token prediction.

We define compositional functions : → with the domain of input sequences = {1, . . . , } of atoms or tokens ∈ ℐ from an input dictionary ℐ. The range of can be R for regression, {0, 1} Definition 1.

To define , we need the following components: • Token encoder : ℐ ×

N → ℋ (latent space), with = (, ) ∈ ℋ encoding the th token in ∈ . • A computation directed acyclic graph (DAG) or cDAG : → , where is the space of DAGs, and () defines the hierarchical processing of a sequence . () can also be viewed as the trace of program used by function to process . We will describe this in further detail soon. • Span processor : ℋ • Read-out function ℎ : ℋ

→ ℋ maps terms in the latent space into a new term in the latent space.

→ maps the final set of terms in the latent space to the output space .

With ⊗ () denoting the recursive operation of over (), we define a compositional function as: () = ℎ ︁( ⊗ ()((1, 1), . . . , (, )) .

︁) (1)

A computation DAG or cDAG () ≜ { (), ()} for a specific input sequence ∈ can depend on or be pre-specified. This cDAG is a leveled DAG with set of nodes () and edges (). Each node ≜ ( : ) ∈ () has a level and index . The recursive application of over () induces a value : ∈ ℋ for each internal node ∈ (). The sources is () have level 0, and there is one source for each ∈ , ∈ JK ≜ {1, . . . , } with index and value 0: = (, ) ∈ ℋ There are sinks in (), and at most incoming edges and outgoing edges at any node. For an . internal node ∈ () with parents (), the value : = (1:1 , . . . , : ) ∈ ℋ where : is the value of the th parent in (). One way to interpret this cDAG is as the trace of “forward-pass” for inference. element in ℋ

We consider the explicit cDAG because it allows us to see how the diferent elements , ∈ JK of the input sequence are hierarchically composed to obtain the output. This will allow us to study the complexity of any compositional function. A “simple” cDAG, where all source nodes just connect to a single sink node, would be “applicable” to all functions, but it does not allow us to study it in an interesting manner. When we study the compositional functions induced by general purpose models (such as recurrent, convolutional or transformer models), we will see that some models have explicit cDAGs with more structure, while others have less structured explicit cDAGs, but there are implicit structures induced in the cDAG; whenever possible, we will explicitly state this implicit structure and study its properties. From a neuro-symbolic perspective [ 13, 14 ], this explicit cDAG can be seen as the symbolic part, while the , , ℎ are the neural; note that, in some models, this symbolic cDAG might be created with neural elements, while in others, the cDAG might be obtained with a symbolic grammar. This neuro-symbolic view ofers a novel theoretical understanding of compositionality.

The span processor : ℋ

→ ℋ takes as input elements from the latent space ℋ and outputs an . While the definition implies that the same needs to be operated recursively over the cDAG (), there is no restriction on the inputs and output of regarding the information encoded in the latent space. For example, if the level of any node : is encoded into its value :, then the will behave diferently across levels ( level-dependent); if the index of the node : is encoded into its value, then will be sensitive to the positional information (order-dependent); if the value of a node includes the type of the node (for example, a non-terminal in a grammar), then can be type-dependent. Our definition states that the arity of the span processor : ℋ → ℋ is . We do so for the ease of exposition, though our definition can incorporate more flexible span processors (see Ram et al. [10, Appendix A.2]).

The read-out function ℎ : ℋ → finally maps elements in the latent space to the output space . This separation between and ℎ was necessary in our proposed definition because we require to be operable recursively, and thus can operate in a latent space ℋ distinct from . In some applications, ℋ ⊇ , in which case, ℎ can be an identity function. There are couple of aspects of this read-out function we wish to discuss explicitly – (i) We assume that ℎ is specifically non-compositional and processes its input without breaking it up into any sub-problems; we explicitly define the compositional function separating out , , ℎ, where (neural) and (symbolic) represent the compositional part. (ii) We require ℎ to have a fixed-arity of since and are aggregating the information over the input.

In the following, we will illustrate Definition 1 with a couple of examples: Example 1. Figure 1 (left) shows the cDAG () for a compositional on = [1, . . . , 5], with () = ℎ ( ( (1, 2) , ((3, 4), 5))), = 2 in-degree, = 1 out-degree, = 1 sink, = (, ) ∈ ℋ, span-processor : ℋ2 → ℋ, and read-out function ℎ : ℋ → . The values 0: = for sources 0:, ∈ {1, . . . , 5}, and the internal node values are: 1:1 ← (1, 2), 1:2 ← (3, 4), 2:1 ← (1:2, 5), 3:1 ← (1:1, 2:1). ℎ operates on 3:1 at sink 3:1. Figure 1 (center left) shows the cDAG (¯) of the same on ¯ ̸= with the same = 2, = 1, = 1 and (′) = ℎ (((1, (2, 3)), (4, 5))). Note that () is not the same as (¯). Example 2. Figure 1 (center right) shows the cDAG D() for a compositional f on = [1, . . . , 7], with f() = h (4:1, 3:1), = 3 maximum in-degree, = 3 maximum out-degree, = 2 sinks, = (, ) ∈ ℋ, span processor g : ℋ3 → ℋ, and read-out function h : ℋ2 → . The source values 0: = for each ∈ {1, . . . , 7}, and the internal node values are: 1:1 ← g(1, 2, 3), 1:2 ← g(2, 3, 4), 1:3 ← g(3, 5, 7), 1:4 ← g(4, 5, 6), 1:5 ← g(5, 6, 7), 2:1 ← g(1:1, 1:2, 1:3), 2:2 ← g(1:1, 1:3, 1:4), 2:3 ← g(1:2, 1:4, 1:5), 2:4 ← g(1:3, 1:4, 1:5), 3:1 ← g(2:1, 2:2, 2:3), 3:2 ← g(2:2, 2:3, 2:4), 4:1 ← g(3:2, 2:3, 2:4). h operates on 3:1 and 4:1 at sinks 3:1 and 4:1. Figure 1 (right) shows the cDAG D(¯) of the same f on ¯ ̸= with the same = 3, = 3, = 2.

While Example 1 is a simple compositional function on a sequence, Example 2 is a more sophisticated one. This is to highlight that our proposed Definition 1 can handle functions which require more complex interactions between the tokens in a sequence. Example 1 has a cDAG with a maximum out-degree = 1, implying a single path from any source to a sink. Example 2 has a cDAG with a maximum out-degree = 3 across all levels in the DAG, implying that there can be a large number of paths to any sink from a source. This allows the definition to include functions where certain tokens in the sequence are of much higher importance to the output than others. These examples also highlight that edges in the cDAG are allowed to skip levels, and the sinks can be from diferent levels, further highlighting the compositional flexibility.

We like to remark on a couple of points here: (i) Through these examples, we show that our definition explicitly considers how the problem of sequence processing is broken up into sub-problems – the cDAG embodies how disjoint or intertwined these “sub-problems” are by explicitly considering the computation hierarchy. (ii) For input sequences , ¯ from the same problem domain, and the same compositional function , we allow the cDAG to be diferent – cDAG () can be input-dependent – thereby allowing diferent input sequences to have diferent sub-problem hierarchies. At a non-technical level, we also believe that our proposed Definition 1 connects intuitively to existing definitions: The meaning of the whole is a function of the meanings of the parts ⏟ : →⏞ ⏟ℎ:ℋ→⏞ ⏟

:ℋ→⏞ ℋ and of the way they are syntactically combined.

⏟ : →⏞ Both Examples 1 and 2 can be seen as compositional functions, but Example 2 is clearly a more complex composition. In addition to its intuitive nature, our proposed definition allows us to understand how complex the compositionality is beyond just stating if a function is compositional. The compositional complexity of depends on the functions , ℎ, as well as the cDAG function that drives the computation. For a sequence of length , () has source nodes, maximum in-degree of (controlling the span size for ), sink nodes (controlling the capacity of ℎ), maximum out-degree of (quantifying the “localism” of the efect of a node). However, these do not explicitly incorporate the fact that changes to nodes at lower levels of the cDAG can have a larger efect on the output than changes to nodes at higher levels of the cDAG. We propose a new quantification – the locus of influence (LoI): Definition 2 (LoI of a source node). Consider a function with components , , , ℎ (Definition 1). Let (1 , . . . , , . . . , ) ∈ ℋ be any input to the span processor , with = (1 , . . . , , . . . , ) its output. Let ∈ ℋ be a “perturbation” to the th argument to , ∈ JK, resulting in the perturbed output () = (1 , . . . , + , . . . , ). Let > 0 be a constant such that ∀ ∈ JK, ∀ ∈ ℋ, ⃦ ⃦ ⃦⃦ − ()⃦⃦ ≤ ‖‖. For a sequence ∈ of length , and a source node 0: in (), let () be the set of all unique paths from 0: to any of the sink nodes in (). The absolute LoI of index is = ∑︀ ∈ () | |, with | | as the length of a path ∈ (), and the relative LoI is = / ∑︀∈JK .

This definition of the complexity of composition incorporates both the complexity of the cDAG () and the complexity of the span processor : ℋ → ℋ in terms of its smoothness, with higher values of indicating more complex (less smooth) . The absolute LoI incorporates the efect of longer paths, with the efect growing with path length, and corresponds to the sensitivity of the compositional function output to any one input token in the sequence.

The smaller the absolute LoI of any input index , more local its efect, and thus more structure that can be transferred between examples if is replaced with something else. A relative LoI greater than 1/ denotes that the input index (and thus input token ) has an out-sized efect on () (and thus the computation) compared to the other indices (tokens). In Example 1 (left), 1 = 2, 1 = 1/2+3 < 1/5 while 3 = 3, 3 = /2+3 > 1/5, implying that 3 has more influence (absolute and relative) function than 1 (assuming > 1). In Example 2 (left), 1 = 4 + 23, 1 = +2/27+39 ≈ 1/22 < 1/7, while 5 = 74 + 93, 5 = 7+9/27+39 ≈ 1/4 > 1/7, hence 5 has a significantly larger influence than 1. We utilize the LoI to define the complexity of a compositional function, and a class of such compositional functions: Definition 3. A function : → with components , ℎ, , is (, , , , )-compositional if, for any ∈ of length , the cDAG () has a in-degree of , maximum outgoing degree of , and sink nodes, and for ∀ ∈ JK, ≤ , and ≤ ∈ [1/, 1). We denote with ℱ a class of such (, , , , )-compositional functions.

A small and a close to 1/ signifies a function that possesses a high level of localism across all input sequences and tokens in its domain. While this function has the most structure, it might not be suitable for practical purposes. A high and a close to 1/ signifies a very complex function where there is a lot of interaction between all the input tokens in all input sequences, making it hard to exploit any compositional structure in the function. A high and a significantly higher than 1/ indicates an interesting class of functions where, some input tokens can have a high influence over the function computation, but, for most tokens, there is a compositional structure in the function that can be exploited. This intuitively seems to be an interesting and more practical class of compositional functions since assuming all tokens have an equal level of relative influence seems quite restrictive. for the ease of exposition. †: Convolve+Pool induces input-dependent cDAGs for max/min-pool, not for avg/sumpool. ‡: The number of sinks needs to be specified for Convolve+pool, and the model can handle arbitrary

In Fig. 2, we re-express existing sequence processing models as per our definition, teasing out the symbolic cDAG (and the neural , ℎ), and we present their (simplified) compositional complexity in for the ease of exposition. This highlights the flexibility and utility of our proposed quantification of compositionality (see Ram et al. [10, Section 4] for more details and models). Beyond the properties of the cDAG (the in-degree , out-degree and number of sink nodes ) and the upper bounds on the absolute LoI and relative LoI , we also highlight two properties: (i) Whether the model utilizes (implicitly or explicitly) an input-dependent cDAG (that is, ( ) is not the same DAG for all of length ), and (ii) Whether the same model is able to operate on arbitrary length input sequences. The use of input-dependent cDAG has implications in terms of the expressivity of the model – it can be shown that functions from a model class (with compositional complexites , ) with a fixed input-agnostic cDAG cannot approximate functions from a model class of matching compositional complexity (that is, same compositional complexity , ) that utilize input-dependent cDAGs. Ram et al. [10, Theorem 1] show that the approximation is lower and upper bounded by ( ) and ( / ) respectively. This implies that a higher value of absolute compositional complexity , and a smaller relative compositional complexity adversely afect the approximation guarantees. The absolute compositional complexity has been shown to be directly tied to the generalization gap for a learned compositional function, with higher implying worse systematic generalization guarantee [10, Theorem 2]. The ability to operate on arbitrary length sequences is a prerequisite to the ability of a model to possess length generalization or productivity – the ability to generalize to sequences of larger lengths than those seen during training. We will be pursuing length generalization in our future work.

3. Conclusion

In this paper, we briefly present our novel definition of compositional functions that explicitly separates out the neural and symbolic aspects of a model for the ease of analysis. We also present a notion of compositional complexity that quantifies the complexity with which the tokens in an input sequence are put together to get to the output. We briefly highlight the generality and utility of this definition by demonstrating how existing sequence processing models fit into this definition.

[1]

Pagin ,

Westerståhl , Compositionality

: Definitions and variants , Philosophy Compass 5 ( 2010 ) 250 - 264 . URL: https://compass.onlinelibrary.wiley.com/doi/abs/10.1111/j.1747- 9991 . 2009 . 00228 .x.

[2]

Lake ,

Baroni , Generalization without systematicity: On the compositional skills of sequenceto-sequence recurrent networks , in: International Conference on Machine Learning, PMLR , 2018 , pp. 2873 - 2882 . URL: https://proceedings.mlr.press/v80/lake18a.html.

[3]

Kim , T. Linzen, COGS: A compositional generalization challenge based on semantic interpretation , in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2020 , pp. 9087 - 9105 . URL: https://aclanthology.org/ 2020 .emnlp-main. 731 .pdf.

[4]

Hupkes ,

Dankers ,

Mul , E. Bruni, Compositionality decomposed: how do neural networks generalise? , Journal of Artificial Intelligence Research 67 ( 2020 ) 757 - 795 . URL: https://jair.org/ index.php/jair/article/view/11674.

[5]

Klinger ,

Adjodah ,

Marois ,

Joseph ,

Riemer ,

A. S.

Pentland ,

Campbell , A study of compositional generalization in neural models , arXiv preprint arXiv: 2006 . 09437 ( 2020 ). URL: https://arxiv.org/abs/ 2006 .09437.

[6]

Li ,

Zhao ,

Wang ,

Hestness , Compositional generalization for primitive substitutions , in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , 2019 , pp. 4284 - 4293 . URL: https://aclanthology.org/D19-1438/.

[7]

Liu ,

An ,

J.-G.

Lou ,

Chen ,

Lin ,

Gao ,

Zhou ,

Zheng ,

Zhang , Compositional generalization by learning analytical expressions , Advances in Neural Information Processing Systems 33 ( 2020 ). URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/ 83adc9225e4deb67d7ce42d58fe5157c-Paper.pdf.

[8]

Nye ,

Solar-Lezama ,

Tenenbaum ,

B. M.

Lake , Learning compositional rules via neural program synthesis , Advances in Neural Information Processing Systems 33 ( 2020 ) 10832 - 10842 . URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/ 7a685d9edd95508471a9d3d6fcace432-Paper.pdf.

[9]

Ram ,

Klinger ,

A. G.

Gray , How compositional is a model? , in: International Joint Conference on Artificial Intelligence 2023 Workshop on Knowledge-Based Compositional Generalization , 2023 . URL: https://openreview.net/forum?id= OImyRhNLv3 .

[10]

Ram ,

Klinger ,

A. G.

Gray , What makes Models Compositional? A Theoretical View: With Supplement , arXiv preprint arXiv:2405.02350 ( 2024 ). URL: https://arxiv.org/abs/2405.02350.

[11]

Hochreiter ,

Schmidhuber , Lstm can solve hard long time lag problems , Advances in neural information processing systems 9 ( 1996 ). URL: https://proceedings.neurips.cc/paper/1996/file/ a4d2f0d23dcc84ce983f9157f8b7f88-Paper.pdf.

[12]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , Attention is all you need , Advances in neural information processing systems 30 ( 2017 ). URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

[13] M. K. Sarker , L.

Zhou , A.

Eberhart , P.

Hitzler , Neuro-symbolic artificial intelligence , AI Communications 34 ( 2021 ) 197 - 209 . URL: https://arxiv.org/pdf/2105.05330.pdf.

[14]

A. d.

Garcez ,

Bader ,

Bowman ,

L. C.

Lamb , L. de Penning,

Illuminoo ,

Poon ,

C. G.

Zaverucha , Neural-symbolic learning and reasoning: A survey and interpretation , Neuro-Symbolic Artificial Intelligence: The State of the Art 342 ( 2022 ) 327 . URL: https://arxiv.org/pdf/1711.03902.pdf.