What makes Models Compositional? A Neuro-Symbolic Theoretical View (Extended Abstract)

Defining and Quantifying Compositionality

We define compositional functions 𝑓 : 𝒳 → 𝒴 with the domain 𝒳 of input sequences 𝑋 = {𝑥 1 , . . . , 𝑥 𝐿 } of atoms or tokens 𝑥 𝑖 ∈ ℐ from an input dictionary ℐ. The range 𝒴 of 𝑓 can be R for regression, {0, 1} for binary classification, or ℐ for next token prediction.

Definition 1. To define 𝑓 , we need the following components:

• Token encoder 𝑒 : ℐ × N → ℋ (latent space), with 𝑒 𝑖 = 𝑒(𝑥 𝑖 , 𝑖) ∈ ℋ encoding the 𝑖 th token in 𝑋 ∈ 𝒳 .

• A computation directed acyclic graph (DAG) or cDAG 𝐷 : 𝒳 → 𝒟, where 𝒟 is the space of DAGs, and 𝐷(𝑋) defines the hierarchical processing of a sequence 𝑋. 𝐷(𝑋) can also be viewed as the trace of program used by function 𝑓 to process 𝑋. We will describe this in further detail soon. • Span processor 𝑔 : ℋ 𝑘 → ℋ maps 𝑘 terms in the latent space into a new term in the latent space.

• Read-out function ℎ : ℋ 𝑚 → 𝒴 maps the final set of terms in the latent space to the output space 𝒴. With 𝑔 ⊗𝐷(𝑋) denoting the recursive operation of 𝑔 over 𝐷(𝑋), we define a compositional function as:

𝑓 (𝑋) = ℎ (︁ 𝑔 ⊗𝐷(𝑋) (𝑒(𝑥 1 , 1), . . . , 𝑒(𝑥 𝐿 , 𝐿)) )︁ .(1)

A computation DAG or cDAG 𝐷(𝑋) ≜ {𝑁 (𝑋), 𝐸(𝑋)} for a specific input sequence 𝑋 ∈ 𝒳 can depend on 𝑋 or be pre-specified. This cDAG is a leveled DAG with set of nodes 𝑁 (𝑋) and edges 𝐸(𝑋). Each node 𝑛 ≜ (𝑙 : 𝑖) ∈ 𝑁 (𝑋) has a level 𝑙 and index 𝑖. The recursive application of 𝑔 over 𝐷(𝑋) induces a value 𝑣 𝑙:𝑖 ∈ ℋ for each internal node 𝑛 ∈ 𝑁 (𝑋). The sources is 𝑁 (𝑋) have level 0, and there is one source for each 𝑥 𝑖 ∈ 𝑋, 𝑖 ∈ 𝐿 ≜ {1, . . . , 𝐿} with index 𝑖 and value 𝑣 0:𝑖 = 𝑒(𝑥 𝑖 , 𝑖) ∈ ℋ. There are 𝑚 sinks in 𝑁 (𝑋), and at most 𝑘 incoming edges and 𝑞 outgoing edges at any node. For an internal node 𝑛 ∈ 𝑁 (𝑋) with 𝑘 parents 𝑃 (𝑛), the value 𝑣 𝑙:𝑖 = 𝑔(𝑣 𝑙 1 :𝑖 1 , . . . , 𝑣 𝑙 𝑘 :𝑖 𝑘 ) ∈ ℋ where 𝑣 𝑙 𝑗 :𝑖 𝑗 is the value of the 𝑗 th parent in 𝑃 (𝑛). One way to interpret this cDAG is as the trace of "forward-pass" for inference.

We consider the explicit cDAG because it allows us to see how the different elements 𝑥 𝑖 , 𝑖 ∈ 𝐿 of the input sequence 𝑋 are hierarchically composed to obtain the output. This will allow us to study the complexity of any compositional function. A "simple" cDAG, where all source nodes just connect to a single sink node, would be "applicable" to all functions, but it does not allow us to study it in an interesting manner. When we study the compositional functions induced by general purpose models (such as recurrent, convolutional or transformer models), we will see that some models have explicit cDAGs with more structure, while others have less structured explicit cDAGs, but there are implicit structures induced in the cDAG; whenever possible, we will explicitly state this implicit structure and study its properties. From a neuro-symbolic perspective [13,14], this explicit cDAG can be seen as the symbolic part, while the 𝑒, 𝑔, ℎ are the neural; note that, in some models, this symbolic cDAG might be created with neural elements, while in others, the cDAG might be obtained with a symbolic grammar. This neuro-symbolic view offers a novel theoretical understanding of compositionality.

The span processor 𝑔 : ℋ 𝑘 → ℋ takes as input 𝑘 elements from the latent space ℋ and outputs an element in ℋ. While the definition implies that the same 𝑔 needs to be operated recursively over the cDAG 𝐷(𝑋), there is no restriction on the inputs and output of 𝑔 regarding the information encoded in the latent space. For example, if the level 𝑙 of any node 𝑙:𝑖 is encoded into its value 𝑣 𝑙:𝑖 , then the 𝑔 will behave differently across levels (level-dependent); if the index 𝑖 of the node 𝑙:𝑖 is encoded into its value, then 𝑔 will be sensitive to the positional information (order-dependent); if the value of a node includes the type of the node (for example, a non-terminal in a grammar), then 𝑔 can be type-dependent. Our definition states that the arity of the span processor 𝑔 : ℋ 𝑘 → ℋ is 𝑘. We do so for the ease of exposition, though our definition can incorporate more flexible span processors (see Ram et al. [10,Appendix A.2]).

The read-out function ℎ : ℋ 𝑚 → 𝒴 finally maps 𝑚 elements in the latent space to the output space 𝒴. This separation between 𝑔 and ℎ was necessary in our proposed definition because we require 𝑔 to be operable recursively, and thus 𝑔 can operate in a latent space ℋ distinct from 𝒴. In some applications, ℋ ⊇ 𝒴, in which case, ℎ can be an identity function. There are couple of aspects of this read-out function we wish to discuss explicitly -(i) We assume that ℎ is specifically non-compositional and processes its input without breaking it up into any sub-problems; we explicitly define the compositional function 𝑓 separating out 𝑔, 𝐷, ℎ, where 𝑔 (neural) and 𝐷 (symbolic) represent the compositional part. (ii) We require ℎ to have a fixed-arity of 𝑚 since 𝑔 and 𝐷 are aggregating the information over the input.

In the following, we will illustrate Definition While Example 1 is a simple compositional function on a sequence, Example 2 is a more sophisticated one. This is to highlight that our proposed Definition 1 can handle functions which require more complex interactions between the tokens in a sequence. Example 1 has a cDAG with a maximum out-degree 𝑞 = 1, implying a single path from any source to a sink. Example 2 has a cDAG with a maximum out-degree 𝑞 = 3 across all levels in the DAG, implying that there can be a large number of paths to any sink from a source. This allows the definition to include functions where certain tokens in the sequence are of much higher importance to the output than others. These examples also highlight that edges in the cDAG are allowed to skip levels, and the sinks can be from different levels, further highlighting the compositional flexibility. We like to remark on a couple of points here: (i) Through these examples, we show that our definition explicitly considers how the problem of sequence processing is broken up into sub-problems -the cDAG embodies how disjoint or intertwined these "sub-problems" are by explicitly considering the computation hierarchy. (ii) For input sequences 𝑋, 𝑋 ¯from the same problem domain, and the same compositional function 𝑓 , we allow the cDAG to be different -cDAG 𝐷(𝑋) can be input-dependentthereby allowing different input sequences to have different sub-problem hierarchies. At a non-technical level, we also believe that our proposed Definition 1 connects intuitively to existing definitions: Both Examples 1 and 2 can be seen as compositional functions, but Example 2 is clearly a more complex composition. In addition to its intuitive nature, our proposed definition allows us to understand how complex the compositionality is beyond just stating if a function is compositional. The compositional complexity of 𝑓 depends on the functions 𝑔, ℎ, 𝑒 as well as the cDAG function 𝐷 that drives the computation. For a sequence 𝑋 of length 𝐿, 𝐷(𝑋) has 𝐿 source nodes, maximum in-degree of 𝑘 (controlling the span size for 𝑔), 𝑚 sink nodes (controlling the capacity of ℎ), maximum out-degree of 𝑞 (quantifying the "localism" of the effect of a node). However, these do not explicitly incorporate the fact that changes to nodes at lower levels of the cDAG can have a larger effect on the output than changes to nodes at higher levels of the cDAG. We propose a new quantification -the locus of influence (LoI): This definition of the complexity of composition incorporates both the complexity of the cDAG 𝐷(𝑋) and the complexity of the span processor 𝑔 : ℋ 𝑘 → ℋ in terms of its smoothness, with higher values of 𝑐 indicating more complex (less smooth) 𝑔. The absolute LoI 𝛿 𝑖 incorporates the effect of longer paths, with the effect growing with path length, and corresponds to the sensitivity of the compositional function output to any one input token in the sequence.

The smaller the absolute LoI 𝛿 𝑖 of any input index 𝑖, more local its effect, and thus more structure that can be transferred between examples if 𝑥 𝑖 is replaced with something else. A relative LoI 𝛽 𝑖 greater than 1/𝐿 denotes that the input index 𝑖 (and thus input token 𝑥 𝑖 ) has an out-sized effect on 𝐷(𝑋) (and thus the computation) compared to the other indices (tokens). In Example 1 (left), 𝛿 1 = 𝑐 2 , 𝛽 1 = 1 /2𝑐+3 < 1 /5 while 𝛿 3 = 𝑐 3 , 𝛽 3 = 𝑐 /2𝑐+3 > 1 /5, implying that 𝑥 3 has more influence (absolute and relative) function than 𝑥 1 (assuming 𝑐 > 1). In Example 2 (left), 𝛿 1 = 𝑐 4 + 2𝑐 3 , 𝛽 1 = 𝑐+2 /27𝑐+39 ≈ 1 /22 < 1 /7, while 𝛿 5 = 7𝑐 4 + 9𝑐 3 , 𝛽 5 = 7𝑐+9 /27𝑐+39 ≈ 1 /4 > 1 /7, hence 𝑥 5 has a significantly larger influence than 𝑥 1 . We utilize the LoI to define the complexity of a compositional function, and a class of such compositional functions: Definition 3. A function 𝑓 : 𝒳 → 𝒴 with components 𝑔, ℎ, 𝑒, 𝐷 is (𝑘, 𝑞, 𝑚, 𝛿, 𝛽)-compositional if, for any 𝑋 ∈ 𝒳 of length 𝐿, the cDAG 𝐷(𝑋) has a in-degree of 𝑘, maximum outgoing degree of 𝑞, and 𝑚 sink nodes, and for ∀𝑖 ∈ 𝐿 , 𝛿 𝑖 ≤ 𝛿, and 𝛽 𝑖 ≤ 𝛽 ∈ [1/𝐿, 1). We denote with ℱ a class of such (𝑘, 𝑞, 𝑚, 𝛿, 𝛽)-compositional functions.

A small 𝛿 and a 𝛽 close to 1/𝐿 signifies a function that possesses a high level of localism across all input sequences and tokens in its domain. While this function has the most structure, it might not be suitable for practical purposes. A high 𝛿 and a 𝛽 close to 1/𝐿 signifies a very complex function where there is a lot of interaction between all the input tokens in all input sequences, making it hard to exploit any compositional structure in the function. A high 𝛿 and a 𝛽 significantly higher than 1/𝐿 indicates an interesting class of functions where, some input tokens can have a high influence over the function computation, but, for most tokens, there is a compositional structure in the function that can be exploited. This intuitively seems to be an interesting and more practical class of compositional functions since assuming all tokens have an equal level of relative influence seems quite restrictive. The cDAG for various existing sequence processing models such as the unidirectional and bidirectional recurrence models, convolutional models and attention based transformer models. In Fig. 2, we re-express existing sequence processing models as per our definition, teasing out the symbolic cDAG (and the neural 𝑔, ℎ), and we present their (simplified) compositional complexity in Table 1 assuming that all models classes utilize span processors 𝑔 with the same smoothness parameter 𝑐 for the ease of exposition. This highlights the flexibility and utility of our proposed quantification of compositionality (see Ram et al. [10,Section 4] for more details and models). Beyond the properties of the cDAG (the in-degree 𝑘, out-degree 𝑞 and number of sink nodes 𝑚) and the upper bounds on the absolute LoI 𝛿 and relative LoI 𝛽, we also highlight two properties: (i) Whether the model utilizes (implicitly or explicitly) an input-dependent cDAG (that is, 𝐷(𝑋) is not the same DAG for all 𝑋 of length 𝐿), and (ii) Whether the same model is able to operate on arbitrary length input sequences. The use of input-dependent cDAG has implications in terms of the expressivity of the model -it can be shown that functions from a model class (with compositional complexites 𝛿, 𝛽) with a fixed input-agnostic cDAG cannot approximate functions from a model class of matching compositional complexity (that is, same compositional complexity 𝛿, 𝛽) that utilize input-dependent cDAGs. Ram et al. [10,Theorem 1] show that the approximation is lower and upper bounded by 𝒪(𝛿) and 𝒪(𝛿/𝛽) respectively. This implies that a higher value of absolute compositional complexity 𝛿, and a smaller relative compositional complexity 𝛽 adversely affect the approximation guarantees. The absolute compositional complexity 𝛿 has been shown to be directly tied to the generalization gap for a learned compositional function, with higher 𝛿 implying worse systematic generalization guarantee [10,Theorem 2]. The ability to operate on arbitrary length sequences is a prerequisite to the ability of a model to possess length generalization or productivity -the ability to generalize to sequences of larger lengths than those seen during training. We will be pursuing length generalization in our future work.