1. Introduction

Encoding Temporal Statistical-space Priors via Augmented Representation under Data Scarcity

Insu Choi

Woosung Koh

Gimin Kang

Yuntae Jang

Woo Chang Kim

0 0 Korea Advanced Institute of Science and Technology (KAIST) , Daejeon , Republic of Korea 1 Yonsei University , Seoul , Republic of Korea

Woo Cahng KimModeling time series data is a fundamental challenge across various domains due to the intrinsic temporal dimension. Despite significant strides in time series forecasting, high noise-to-signal ratio, non-normality, non-stationarity, and lack of data continue challenging practitioners. To address these, we introduce a simple representation augmentation technique. Our augmented representation acts as a statistical-space prior encoded at each time step. Accordingly, we term our method Statistical-space Augmented Representation (SSAR). The underlying high-dimensional data-generating process inspires our representation augmentation. We rigorously examine the empirical generalization performance on two data sets with two downstream temporal learning algorithms. Our approach significantly beats all five up-to-date baselines. Furthermore, our approach's modular design facilitates easy adaptation to diverse settings. Lastly, we provide comprehensive theoretical insights throughout the paper to underpin our methodology with a clear and rigorous understanding.

eol>Augmented Representation Spatio-temporal Learning Information Theory Time Series Forecasting

1. Introduction

Time series forecasting is crucial across multiple domains such as finance [ 1], meteorology [2], and manufacturing [3]. Simple time series that are less stochastic and dependent on a tractable number of variables exist. Research primarily targets time series with complex, high-dimensional dependencies. Often, the true set of causal factors, , is intractable— i.e., unknown or known but impractical to compute. On top of this, complex time series structures, (y ∈ |x ∈ ), often exhibit non-stationarity— challenging modeling.

Initially, methods like the vector autoregressive (VAR) model [4, 5] dominated multivariate forecasting. Extensions like the vector error correction model (VECM) [6] addressed some limitations of VAR models, but assumptions like nonstationarity still posed challenges. Despite their widespread use, these statistical models have caveats, particularly vis-àvis their underlying statistical property assumptions. Thus, any analysis using these models requires examination of these assumptions—especially the non-stationary assumption, potentially requiring transformations to the data.

In response, neural network-based sequential models have become popular in the past decade. Their main advantage is that a universal function approximator flexibly captures high-dimensional non-linear dependency structures [7]. The most widely tested and verified for time series forecasting are Recurrent Neural Network (RNN) [8] architectures—with flagship examples being Long ShortTerm Memory (LSTM) [9] and Gated Recurrent Unit (GRU) [10]. Both LSTM and GRU are part of our baseline.

More recently, with the out-performance of attention mechanism-based models like transformers in other sequential tasks such as natural language processing (NLP) [11] and speech recognition [12], numerous transformer-based time series forecasting models have been developed. Some sig• Develop an easily reproducible augmented representation technique, SSAR, that targets modeling complex non-stationary time series • Clear discussion of the theoretical need for augmenting the input space and why it works well against baselines • Theoretical discussion of the method’s inspiration— the data-generating process of high-dimensional time series structure • To our knowledge, first to leverage (asymmetric) information-theoretic measures in modeling the statistical-space • Out-sample improvement vis-à-vis performance and stability against up-to-date baselines: (i) LSTM, (ii) GRU, (iii) Linear, (iv) NLinear, (v) DLinear • Out-sample empirical results tested on two data sets and two downstream temporal graph learning algorithms • Present a theoretically unified view with related work, suggesting that SSAR implicitly smooths stochastic data

2. Related Works

Our work is related to temporal graph learning algorithms as our approach transforms a vector-based time series representation into a graph-based one. Then, a downstream graph learning algorithm is inducted to make predictions. Fundamentally, multi-layer perceptrons (MLPs) are incompatible with graph representations. However, graphs naturally represent various real-world phenomena [19]. E.g., social networks [20], chemical molecules [21], and trafic systems [22] inherently possess graphical structures. Graph Neural Networks (GNNs) bridge this gap, enabling learning directly from graphical structures. Contrary to works with predefined edge sets ℰ , we derive ℰ from historical vertex values . The closest past work is [23], where they generate Pearson correlation-based ℰ with . However, their ℰ specifically proxies inter-company relations, tailored to their domain. Additionally, their ∈ ℰ are non-directed and symmetric. In contrast, our approach is (i) domainagnostic, (ii) employs a simple representation augmentation to surpass state-of-the-art, (iii) modular with broad algorithm compatibility, (iv) incorporates directed asymmetric measures for ℰ , and (v) emphasizes theoretical analysis of the augmentation mechanism. 3. Preliminary: Complex Time Series Modeling complex time series via neural networks presents three key challenges: (i) incomplete modeling, (ii) nonstationarity, and (iii) limited data-generating process access.

Let (| − ) be the true probability structure we want to learn. Here, is defined by the modeler as the variables of interest. Unlike , is intractable for complex problems as (i) it is too large to be computed realistically, but more pressingly (ii) it is unknown a priori. Therefore, we typically use heuristics or empirical evidence to identify ^ . Since we are forecasting, we use lagged values with indicating the temporal magnitude of the most lagged value. Then, with a learner parameterized by , via maximum likelihood estimation we train for ^ (|^ − ) where ^ (|^ − ) ≈ (|^ − ). Often, due to ’s intractability, in vector form, we set x^[, :] := y[, :], using output-space’s lagged values as input-space. We use this heuristic in our study and explain why this is a reasonable assumption in the Appendix. Since x^ is a tractable approximation to the true input-space, we face the partial observation and incomplete modeling problem. This underlies much of the stochasticity and poor performance in forecasting high-dimensional structures. For domains that aggregate information on the global-level—like financial and climate time series, it is fair to assume that | | →− ∞ , dramatically raising the dificulty.

On top of this, we have a second, more pervasive challenge—non-stationarity. Non-stationarity is defined as (| ) ̸= ′ (| ) where ̸= ′. The cause of non-stationarity could be from partial observability. Figure 1’s left diagram summarizes this problem. Note that the distributions are 1-dimensional for a simplified visual depiction. This poses a significant challenge to neuralnetwork-based approximators ^ (| ) as MLPs—the building block—inherently work on stationary data sets.

The final challenge involves neural-network-based function approximators : ↦→ . The cost for a highly lfexible function approximator is the large | |. Consequently, as | | rises, the size of the data set || should also rise, allowing to generalize out-sample better. I.e., better approximate (| ). Ideally, |||| > 0, but raising || arbitrary is often intractable for complex time series. There are cases where reasonable simulators exist for the data-generating process (| ), especially when is tractable and the transition function is well approximated by rules. A representative example is physics simulators in the robotics field [ 24], where the simulator models the real-world, (| ) ≈ (| ). Correspondingly, we require a world simulator for complex time series with an intractably high-dimensional data-generating process. Since we have no world simulator, raising || requires time to pass. Therefore, we are restricted with a finite, lacking .

4. Methodology 4.1. Statistical-space Augmented Representation

In response to these three challenges, we apply our method, SSAR. We rigorously examine how SSAR overcomes each challenge in Section 4.3. A high-level overview of SSAR involves: (i) selecting a statistical measure, (ii) computing this measure (y|x, , ) for each time with sliding window , (iii) generate a graph where vertices ∈ represent variables at , and weight of edges (), where edges ∈ ℰ , represent (y|x, , ). Then, with spatiotemporal graph := ⋃︀ , any temporal graph learning algorithm that makes temporal node prediction can be applied.

As seen in Figure 1, right, SSAR : ↦→ where the original time series data is in vector form d ∈ . The per time Algorithm 1 SSAR step functional view would be ← SSAR(d[− :] ∈ ). Algorithm 1 details the pseudo-code for SSAR(· ). ∀ d is transformed into a weighted, directed graph = ⟨, ℰ , ⟩ where is the set of nodes, || = , ℰ is the set of directed edges, |ℰ | = 2 − , and the weighted adjacency matrix ∈ R× . Here, each node ∈ represents a variable (scalar) in x^ and y. Each ∈ ℰ is a 2-tuple denoted ⟨, ⟩, ̸= , with each tuple corresponding to a permutation pair of nodes. ’s |ℰ | = 2 − as each permutation pair corresponds to a single directed edge, and nodes cannot direct to themselves. I.e., is irreflexive. Given that the size of is computed excluding the diagonal elements, where ∈ , ≥ 0 ∈ R, is equivalent in size to ℰ as each maps to a single . I.e., : ℰ ↦→ . Here, ( → ) ← (| , , ). An intuitive visualization is available in Figure 2. 4.2. Data-generating Process Meta-physics SSAR is inspired by the meta-physics of the data-generating process of complex time series. The data-generating process refers to (· ). Access to (· ) allows for sampling data ∼ (· ) and approximating ^ (· ) through maximum likelihood estimation based on . On a diferent note, this abstracted discussion aims to shed light on how a true (· ) is derived in the real world. I.e., it aims to hypothesize on the mechanisms underlying (· ), then describe how it inspires our approach.

Consider complex time series as described in the preliminary section. | | →− ∞ Definition 4.1. Complex time series, causal in nature, is defined as (| − ) where is intractable—i.e., .

Similar to the theoretical nature of (· ), the concept of is also theoretical, given that its variables are humandefined. This implies that an arbitrary degree of granularity may describe . I.e., | | can be arbitrarily raised larger until we reach the smallest units of the physical world. For instance, a high-level event like COVID-19 is an example of ∈ , which can be further broken down into granular events like patient zero’s contraction of the virus and so forth. Given (y| − ), consider , ′ ⊂ , where the former is digitally measured by humans in time series format and the latter comprises the remaining elements. ∪ ′ ≡ and ∩ ′ = ∅. In the case of learning algorithms that require numerical input and output spaces, naturally, y ∈ ⊆ and x^ ∈ . Define any information transfer within as endogenous and any within ′ as exogenous to the system. As not every real-world physical change is digitally tracked, each endogenous change has its roots in some exogenous change. With this backdrop, all numerical variables available to us digitally is a system that absorbs an arbitrary amount of exogenous shocks ∀.

Let ′ ∈ ′ , and ∈ . Then, a simplified view of the data-generating process can be visualized in Figure 3. Each node at the top of the diagram represents ′ ∈ ′ while each node at the bottom represents ∈ . Within the diagram, |′ | →− ∞ is indicated via "...". Blue and purple edges show causal chains in the real physical world. Each dotted edge represents an exogenous shock to the endogenous system. Non-dotted green and red edges at each time step represent (y| ). x− 1 However, since ∃(x|x−1′ ) which is unknown, (y|x−1) = (y|(x−1|x−2′ )). ( 1 ) Under this view, all complex time series are inherently non-stationary and, consequently, incompatible with models assuming stationarity. Consequently, for models that require stationary data, we require some tractable function (· ), .., (· ) ≈ (x−|x−2′ ).

1 ( 2 ) The next section draws inspiration from the inherently directed graphical nature of the data-generating process, as illustrated in Figure 3, to theoretically unpack our method. 4.3. Prior Encoding: Theoretical View By the universal approximation theorem [25, 26], any stationary mapping can be approximated by neural networks. MLPs and their subsequent architectural innovations implicitly model high-dimensional statistical spaces.

O0 := (W0 X + b0), O1 := (W1 O0 + b1),

O2 := (W2 O1 + b2), where = {⋃︀ W, ⋃︀ b}, X is the input tensor, and , are non-linear activations. Given that neural networks are directed graphs, the explicit representation by SSAR (Figures 1 and 2) can be implicitly captured by ( 3 ). Despite this, we opt for an explicit representation encoded as a Bayesian prior ( ). Under the Bayesian view of learning from data, ( |) := (| )( ) .

() This inductive bias—if accurate, can be helpful for generalized performance when || ≪ ∞ . As noted earlier, complex time series feature finite , making it challenging to increase its size.

Our prior encoding ∀, as visualized in Figure 1, left, aids learning via overcoming non-stationarity. Since we are learning the distribution ^ (|^ − ), to capture the non-stationarity, a natural approach would be to add a second parameter, a regime vector r, resulting in learning ^ (|^ − , r ← (|^ − )). This involves learning (· ). In this case,

^ − , r ← ^ ( |

(|^ − )), ∴ := ∪ , ⇒ | | > | | ∵ | | ̸= ∅.

Given the small size of relative to , increasing degrees of freedom without further sampling ∼ (· ) is not ideal.

An ideal alternative is letting a statistical-space relationship at proxy for (· )—i.e., (y ∈ |x^ ∈ ( 3 ) ( 4 ) ( 5 ) ( 6 ) ^ − , , ) ≈ (y ∈ |x^ ∈ ^ − ). But, like (y|x^), (y|x^, , ) is unknown a priori. In this case, like (· ), we would require a learned approximation (· ), raising the size of aggregate parameters.

A reasonable and tractable approximation known a priori that does not raise the parameter count is,

(y− 1|x^, − 1, ) ≈ (y|x^, , ) ≈ (y|x^). ( 7 ) Assuming suficient granularity in time steps , (y− 1|x^, − 1, ) closely approximates (y|x^, , ). We hypothesize that the trade-of between parameter count and approximation via − 1 is advantageous to the learning system.

Despite identifying a feasible regime-changing approximator, another problem remains. Representing and passing (y− 1|x^, − 1, ) via Euclidean geometry significantly reduces the spatial information inherent to (y− 1|x^, − 1, ). A natural representation is graphical, like Figure 3— therefore, we approximate ( 8 ) with ( 9 ) via ( 10 ), ( 11 ), and ( 12 ). This transformation, which augments the representation, theoretically encapsulates SSAR.

^ (y|x^, r ←

(y|x^)) ≈ ^ (v ∈ |v− ∈ , e− ∈ ℰ ),

v := y, v− := x^, e− ≈ ^r ≈

r, where e− ← (y− 1|x^, − 1, ). ( 8 ) ( 9 ) ( 10 ) ( 11 ) ( 12 )

4.4. Statistical-space Measures

Six methods are used to compute (y− 1|x^, − 1, ). The set of measures and corresponding abbreviation ℳ := {Pearson correlation: Pearson, Spearman rank correlation: Spearman, Kendall rank correlation: Kendall, Granger causality: GC, Mutual information: MI, Transfer entropy: TE}. This set can be divided into correlation-based ℳ and causal-based ℳ measures, which are symmetric and asymmetric, respectively. ℳ := {Pearson, Spearman, Kendall} ⊂ℳ , ℳ := {GC, MI, TE} ⊂ ℳ , ℳ ∪ ℳ ≡ ℳ , ℳ ∩ ℳ = ∅. Symmetric measure refer to ( |) = (| ) ∀⟨, ⟩ ∈ ℰ , ̸= . Asymmetric refers to the case where ( |) ̸= (| ). The asymmetric case is most appropriate for our use case, as it uses only lagged values, making them a proxy for causal efects. Embedding (· ) from ℳ as weights is more natural as : × ↦→ R≥ 0. On the other hand, : × ↦→ [− 1, 1], therefore, we let ← | |. We empirically test all six.

The hyperparameter is inherent to SSAR, as it is required to compute (· ). An additional hyperparameter ∃∀ downstream algorithms— . Scalar represents the number of previous time steps fed into the model. In our case, represents the number of historic graphs as ∃ ∀. Attaching SSAR with a downstream algorithm involves two sliding windows: and . An intuitive visualization is provided in Figure 4. The computational details ∀(· ) ∈ ℳ are available in the Appendix.

5. Empirical Study

5.1. Data To empirically test SSAR, we identify representative data sets that fit the definition of complex time series. We chose ifnancial time series, known for their high stochasticity, nonnormality, and non-stationarity [27, 28, 29]. Consequently, we sourced two data sets: (i) Inter-category and (ii) Intracategory variables. Inter- and Intra-category data sets exhaustively represent most financial time series. Henceforth, we refer to these data sets as Data Set 1 and 2, respectively. Sourced based on the largest international trading volumes, both data sets serve as representative benchmarks applicable to practitioners. The data sourcing and processing methods are detailed in the Appendix. Notably, extensive preliminary statistical tests, detailed in the Appendix, validate the time series’ complexity.

5.2. Experiment Setting

We first apply SSAR to each data set. To examine the sensitivity to the hyperparameter we apply SSAR ∀ ∈ w := {20, 30, 40, 50, 60, 70, 80}. A minimum of 20 ensures stability in information-theoretic measures. Data sets are split into training, validation, and test sets—0.5 × 0.7, 0.5 × 0.3, and 0.5, respectively for Data Set 1, and 0.6, 0.2, and 0.2, respectively for the Data Set 2. These splits simulate potential real-world scenarios.

Five established baselines are included: (i) GRU, (ii) LSTM, (iii) Linear, (iv) NLinear, (v) DLinear, where (iii), (iv), (v) have shown to outperform all state-of-the-art transformerbased architectures. The for baselines corresponds to the temporal dimension size of the input vector. Next, to test the augmented representation, we select two well-known spatio-temporal GNNs—(i) [30]’s Temporal Graph Difusion Convolution Network (difusion t-GCN), (ii) [ 31]’s Temporal Graph Convolution Network (t-GCN). Notably, SSAR works with any downstream models that support spatiotemporal data with directed edges and dynamic weights. The number of compatible downstream models is very large. We arbitrarily let difusion t-GCN be the downstream model for Data Set 1, and t-GCN for Data Set 2.

For ease of replication, we present the tensor operations of difusion t-GCN for our representation in the Appendix. We do not diverge from the original method proposed by the authors for both downstream models. All experimental design choices, such as splits, downstream models, and sample sizes, were chosen a priori and were not changed after inference. Also, each empirical sample is independently trained from a random seed. I.e., no two test samples result from an inference of the same model ^.

The objective function is the mean squared error (MSE) of the prediction of given [-1 : - ]. For a fair empirical study, we systematically tune hyperparameters ℎ ∈ ℋ ∀⟨, method, Data Set⟩ in the train and validation set. Rigorous details of the training, validation, and inference process are provided in the Appendix.

5.3. Results and Ablation

We observe highly encouraging results, summarized in Figure 5 and Table 1. In Table 1, each column represents a method, and each row represents the . Sample sizes are one for Data Set 1 and 50 for Data Set 2 for each ⟨method, ⟩ pair. Note that the sample size for the Constant column does not conform to this pattern as Constant weighted edges are not associated with a . However, to match the sample size for each approach, the Constant column presents the 7-sample and 50-sample mean± 1 results in Data Sets 1 and 2, respectively.

The approaches are divided into (i) SSAR, ours, (ii) baselines, and (iii) ablation. The ablation, Constant, is where edge weights are constant in place of a statistical measure. This setup assesses the utility of graphical structures independent of statistical measures. In Data Set 1, ∀ SSAR achieved the best results. Notably, a significant improvement from baselines → ablation, and another significant improvement from ablation → SSAR. Moreover, across 42sample results for all six SSAR approaches and , all 35samples of baselines are beaten with a 100% beat rate.

For Data Set 2, each 50-sample ⟨method, ⟩ combination enables box-and-whisker plot analysis in Figure 5. Each boxand-whisker aggregates across , i.e., they each represent 7 · 50 = 350 samples. We observe a dramatic improvement in accuracy across SSAR-based approaches. The box-andwhisker plot follows the standard, minimum, quartile-1, median, quartile-3, maximum value. The x-axis is intentionally not scaled to include Linear, NLinear, and DLinear outliers. Scaling would significantly reduce legibility. An enlarged version of Figure 5 is in the Appendix.

6. Discussion 6.1. Statistical Analysis

In aggregate, 7 · 12 (row · column) = 84 random seed outsample results are available for Data Set 1, and 7 · 11 · 50 (row · column · sample-size) = 3850 results are available for SSAR and baselines for Data Set 2. An additional 50 samples for the ablation leads to 3900 result samples for Data Set 2.

The statistical analysis is highly encouraging. First, we examine in aggregate whether the mean of SSARs beats the aggregate mean of the baselines. Data Set 1’s results are 0.7141 ± 0.0253 (42-samples) and 0.8346 ± 0.0179 (35-samples) for SSARs and baselines, respectively. The T-statistic is -23.9022 (P-val →− 0). Data Set 2’s results are 0.8652 ± 0.0022 (2100-samples) and 1.2740 ± 1.9097 (1750-samples) for SSARs and baselines, respectively. The T-statistic is -9.8117 (P-val →− 0). *SSAR: Non-Euclidean input-space, †Baseline: Euclidean input-space, ‡Ablaftion Bold represents the best result across row, and italicized represents the best result across column is 1.2740± 1.9097 → 0.8671± 0.0027 → 0.8652± 0.0022.

This corresponds to a 31.94% reduction in MSE from baselines to Constant and a 0.22% reduction from Constant to SSARs. From baselines to SSARs, a 32.09% reduction is observed. Additional study on larger values, with details in the Appendix, shows that statistical significance remains robust.

The second study focuses on adverse outliers in stateof-the-art methods (Linear, NLinear, DLinear). For robustness, we re-examine statistical results after excluding these models’ adverse outliers. The results are detailed in the Appendix—and the statistical findings remain unchanged.

This observation of significant adverse outliers bodes poorly for the baselines and contrarily emphasizes the stability of our proposed approach. By examining the F-Test on baselines and SSARs, we observe an F-static of 764,534 and a corresponding one-tail F-Critical of 1.08 (P-val →− 0).

The evidence indicates a significant fall in the variance of

SSARs.

Finally, we discuss the implications of setting . A naïve interpretation might attribute SSAR’s improved performance to a larger implicit (based on Figure 4), but this is contradicted by the lack of a significant relationship between and (Figure 6). Moreoever, if this was true, < 0. On the contrary, there seems to be no meaningful relationship between and for the baselines. We present two histograms that summarize (( )|, ∨ ), where ∨ denotes a boolean with some abuse of notation—true: Baseline, false:

SSAR. 6.2. Theoretical Implications

Initially, the performance improvement in the Constant ablation case appears surprising. Based on the theoretical discussion provided by [32], we show that SSAR is not only helpful in modeling the shifting underlying distribution but also implicitly smooths highly stochastic data. These

Baselines SSARs efects are visually summarized in Figure 7. [ 32] shows that when the causal structure is very high-dimensional and therefore highly stochastic, augmenting the training data via smoothing techniques is helpful when noise-to-signal is high. The authors use exponential moving averages to smooth the input and target space. We show that SSAR paired with a temporal graph learning algorithm implicitly makes the same augmentations—explaining the improved performance in the Constant ablation case.

Temporal weighted graph learning algorithms for node prediction aggregate neighbouring weights and nodes each node. Afterwards, this new encoding is fed into some neural network with a sequential encoding (e.g., RNNs, Transformers). In this context, (· ) represents edge weights, and the learning system’s parameters. In its theoretically simplest form, without loss of generality, it aggregates the weights of edges incident to the node, ⎡ ∈ℐ ⎤ ∀, ^ := + ⎣ ∑︁ () ()⎦ , ( 13 ) where ^ is the post-encoding node embedding, ℐ is the set of edges incident to , and is the learned weight parameter. First, we know that (· ) ≥ 0, and ∑︀ () > 0 for both the Constant and SSAR case. Then, whether ^ > or ^ < , and magnitude |^ − | is only dependant on parameter (). This implies, () can learn to de-noise the highly stochastic data. De-noising high noise-to-signal series improves results significantly [ 32]. Essentially, as long as the Constant weight, (· ) := ∈ R̸=0, ( 14 ) ⇒ can implicitly learn to de-noise the input and target space, resulting in improved out-sample performance. This explains why adding no statistical-space prior, but a simple augmented representation with fixed (· ) := > 0, ∀(· ) resulted in improved performance.

This implicit de-noising partially explains the superior performance of SSAR. Remaining improvements are due to approximating Equation ( 8 ) with ( 9 ). In short, SSAR can be decomposed into two efects: (i) SS: statistical-space encoding, which tracks the underlying distribution shift, and (ii) AR: augmented representation, which allows for a learnable function approximator to implicitly de-noise the stochastic data.

Decomposing SSAR into SS and AR, unlike the clearcut efects in Figure 7, is challenging. As seen in Equation ( 13 ), () could not only learn to de-noise the data but also implicitly learn the r ← (|^ − )) in Equation ( 5 ). Also when providng prior ( ) in Equation ( 4 ) via e ∈ ℰ , in which it passed through Equation ( 13 ), there is no clear way of decomposing the two efects. Thus, while the ablation study aids understanding of SSAR’s mechanisms, it is not a rigorous method to quantify the two efects.

6.3. Future Works

Our work, which compares SSAR and Euclidean inputspace-based state-of-the-art models, can be viewed as two ends of the extreme. Euclidean input-space-based models must learn the underlying non-stationary distribution implicitly, while SSAR takes a more deliberate approach.

SSAR explicitly provides a statistical-space approximation ∀, resulting in (i) allowing the neural network to use an approximated regime-vector, and further learn the distribution shift, and (ii) bootstrap the neural network with priors, given that our data is limited. However, in cases where we have access to ∼ (· ), or || is already suficiently large, we can hypothesize that a learned statistical space may be beneficial. I.e., implement Equation ( 5 ) instead of Equation ( 9 ). In this case, the statistical space could be learned implicitly via : · · · × ⟨ → ⟩ × · · · ↦→ R, ̸= where edge weights are initialized () :̸= 0, in Equation ( 13 ). Under the Bayesian view in Equation ( 4 ), this would correspond to the prior being a uniform distribution, ( ) := (· ). Contrarily, the statistical space could be learned explicitly where the weights of the edges are learned explicitly, : · · · × ⟨ → ⟩ × · · · ↦→ R≥ 0, ̸= . This would closely mimic the attention mechanism in transformers.

We encourage future research to explore these middleground approaches within the solution space spectrum presented here. A more nuanced study could theoretically and empirically study which method in the spectrum is most ideal under specific degrees of access to (· ), equivalently, the amount of data available.

A. Assumption: x^ [,:] := y [,:] The assumption that the input-space features are equivalent to the output-space features is highly reasonable. Essentially, when training to predict y, since || ≫ 0 ⇒ y[:,] ≫ 0 ∴ ∃x^[:,] ≫ 0. Even if x^[,:] ̸= y[,:], the method and implications presented in this work hold with trivial modifications in the learning system.

B. Data Source

We use two representative data sets for financial markets. The first is an array of major macroeconomic exchangetraded funds (ETFs) and variables, available in Table 2. These variables are representative as they have been chosen based on the largest worldwide trading volumes. This data set examines the efectiveness of our approach across many ifnancial categories (inter-asset-class). The second data set is an array of major commodity futures available in Table 3. Again, these features are chosen beforehand based on the largest worldwide trading volume. This data set examines the efectiveness of our approach within a financial category (intra-asset-class)—commodity futures market.

Both data sets are easily attainable via public sources. However, we source the data from S&P Capital IQ and Bloomberg for high-quality data that is not adjusted later— to concretely prevent any look-ahead bias. The Bull-Bear Spread is sourced separately from the Investor Sentiment Index of the American Association of Individual Investors (AAII).

The initial time step is set to the date where ∃ valid data points ∀ variable. Data Set 1’s date spans from 2006-04-11 to 2022-07-08 in daily units. Data Set 2’s date spans from 1990-01-01 to 2023-06-26 in daily units.

C. Data Processing

The only data processing done from raw data is transforming price data into return (change) data, and pre-processing non-available (nan) data points. We transform market variables to log return, as typical practice in the financial domain. Log return is used instead of regular diference as log allows for computational convenience. Other data points are transformed to the regular diference approach as their data points are much smaller in magnitude, and require higher levels of precision. The pseudo-code for the data processing is available in Algorithm 2.

D. Computing statistical dependencies

Given n := {− 1, ..., − 1− } and n := {− 1, ..., − 1− }, the six measures are computed as follows. We remove the superscript for improved legibility and let n,n , n ,n , n ,n , denotes Pearson correlation, Spearman rank correlation, and Kendell rank correlation, respectively. n denotes rank for time series if is then

∀D[ ].[] − ← else if . ̸= then ∀D[ ].[] − ← (D[ ].[]/D[ ].[ − 1]) D[ ].[] −

D[ ].[ − 1] ∀D[ ].[] − ←

D[ ].[] −

D[ ].[____ < ] Algorithm 2 Data Process

, ,)( − ,

,) < 0. ,)( , where n¯ denotes the mean of series n, is the number of concordant pairs, and is the number of discordant pairs. A pair ⟨ ,, ,⟩,⟨

,, ,⟩ is concordant if the ranks for both elements agree in their order: (, −

,) > 0, and discordant if they disagree

We used Granger causality [4] based on Geweke’s method [33].

Geweke’s Granger causality (GC) is a frequencydomain approach to Granger causality. Geweke’s Granger causality from n to n is computed by: n→− n := ln ︂( n n ( ) )︂ n n |n ( ) ,

( 18 ) is the spectral density of n given n. where n n ( ) is the spectral density of n and n n |n ( )

We use

Welch’s method to estimate spectral density as it improves over periodograms in estimating the power spectral density of a signal [34].

We used two information-theoretic measures: Mutual information and Transfer entropy. Mutual information (MI) represents the shared information between two variables, indicating their statistical interdependence [35]. In information theory, the behavior of system n can be characterized by the probability distribution (n) or log (n). This measure is equivalent to the Pearson correlation coeficient if both have a normal distribution. To compute MI between two variables, we need to know the information entropy, which is formulated as follows: (n) := − ∑︁ (n) log2 (n). n∈n ( 19 )

Shannon entropy quantifies the information required to select random values from a discrete distribution. The joint (information) entropy can be expressed as: (n, n ) := −

∑︁ n∈n,n ∈n

(n, n ) log2 (n, n ). ( 20 ) Finally, we can define MI as the quantity of identifying the interaction between subsystems.

(n, n ) := (n) + (n ) − (n, n ).

( 21 ) Following Kvålseth (2017), we use normalized MI (NMI) with range [0, 1] to ensure consistency across measures.

The computation is as follows: (n, n ) :=

(n; n ) min((n), (n )) .

( 22 ) It is computed as Equation ( 24 ):

Transfer entropy (TE) is a non-parametric metric leveraging Shannon’s entropy, quantifying the amount of information transfer between two variables [36]. Based on conditional MI in Equation ( 23 ), we can define the general form of (, )-history TE between two sequences n and n for n(,) = (n,, ., n,− +1) and n(,) = (n,, ., n,− +1).

(n |n) := −

∑︁ n ∈n ,n∈n n(,,→−) (n, n ) log2 n, () :=

(n) (n, n ) . ( 23 ) ∑︁ (n,+1, n(,), n

(,)) log2 Ω (n,+1|n, , n

() (n,+1|n,) () () ,) ,

( 24 ) where Ω := {n,+1, n(,), n

(,)}, which represents the possible sets of those three values. n(,,→−) formation about the future state of n, which is retrieved by subtracting information retrieved from only n(,), and from information gathered from n(,) and n() ,

. We set and to 1. Under these conditions, the equation for TE with

n, () is the in( 1, 1 )-history can be computed as n(1,,1→−) n, () = where Ω = {n,+1, n,, n,}.

This measure can be perceived as conditional mutual information, considering a variable’s influence as a condition.

Also, analogous to the established relationship between the Pearson correlation coeficient and mutual information, an equivalent association can be identified when the two variables comply with the premises of normal distribution [37].

TE measures information flow via uncertainty reduction. "TE from to ," translates to the extent clarifies the future of beyond what can clarify about its own future.

Conditional entropy quantifies the requisite information to derive the outcome of a random variable , given that the value of another random variable is known. It is computed as [38]: E. Descriptive Statistics and

Statistical Properties We implement a t-GCN powered by difusion convolutional recurrent neural networks (DCRNN) to learn SSAR’s spatial and temporal dependency structure [39]. DCRNN shows state-of-the-art performance in modeling trafic dynamics with a spatial and temporal dimension—represented graphically.

The graph signal ∈ R× 1 as each node has a single feature. With representing the signal observed at time , the difusion t-GCN learns a function (· ): [ − , . . . , − 1; →]−− (· ) [ ].

( 26 ) The difusion process explicitly captures the spatial dimension and its stochastic features. The difusion process in generative modeling works by encoding information via increasing noise through a Markov process while decoding information via reversing the noise process [40]. The diffusion mechanism here is characterized by a random walk on with restart probability ∈ [0, 1], and state transition matrix D−1, where D = (1) is the out-degree diagonal matrix, and 1 ∈ R is the all-one vector. The stationary distribution ∈ R× of the difusion process can be computed via the closed form: := =∞ ∑︁ (1 − )(D−1).

=0 After suficient time steps, as represented by the summation to infinity, the Markov process converges to . The intuition is as follows. ,: ∈ R represents the difusion probability from , i.e., it quantifies the proximity with respect to the node. denotes the difusion steps, and is typically set to a finite natural number as each step is analogous to the iflter size in convolution.

As a result, the difusion convolution over our and a iflter is described by:

− 1 :,1 ⋆ := ∑︁ ( ,1(D−1) + ,2(D− 1 )):,1,

=0 where ∈ R× 2 are filter parameters and D−1, D− 1 are the difusion process transition matrices with the latter representing the reverse process. A difusion convolution layer within a neural network architecture would map the ( 27 ) ( 28 ) signal’s feature size to an output of dimension . As we are working with a single feature, we denote a parameter tensor as Θ ∈ R× 1× × 2 = [ ],1. The parameters for the th output is Θ,1 ∈ R× 2. In short, the difusion convolutional layer is described as: ℋ:, := (:,1 ⋆ Θ,1,:,: ), ∈ {1, . . . , }. ( 29 ) Where input ∈ R is mapped to output ℋ ∈ R× , and (· ) is an activation function. With this GCN structure, we can train the network parameters via stochastic gradient

G. Difusion Convolutional Gated

Recurrent Unit Next, the temporal dimension is modeled via a GRU, a variant of RNNs that better captures longer-term dependencies.

Difusion convolution replaces standard matrix multiplication in the GRU architecture: r := (Θ ⋆ [ , ℋ− 1] + b), u := (Θ ⋆ [ , ℋ− 1] + b), ℋ := u ⊙ ℋ

− 1 + (1 − u) ⊙ , := tanh(Θ ⋆ [ , (r ⊙ ℋ

− 1)] + b), where in time step , r, u, , ℋ represent the reset gate, update gate, input tensor, and output tensor, respectively. Θ, Θ, Θ represent the corresponding filter parameters [30].

H. Training And Inference Method The pseudo-code for the training and inference pipeline is available in Algorithms 3, 4, 5, and 6. The ℎ(· ) in Algorithm 3 is done with 260 random seed trials with 13 parallel CPU cores.

The ℋℎ for the GCNs are as follows.

• Input Size: [8, 9, ..., 30] • Hidden Layer Size: [8, 16, ..., 120] • Learning Rate: [1− 1, 1− 2, ..., 1− 6] ( 30 ) ( 31 ) ( 32 ) ( 33 )

The tuned hyperparameters for each data set are presented in Tables 8, 9, 10, and 11.

The difusion t-GCN has five hyperparameters: (i) input vector size , (ii) hidden layer size, (iii) difusion steps (filter size), (iv) learning rate, and (v) training epochs. The for the set of non-linear causal measures, ℳ, is set to 1 as the sparsity in () > 0 causes computational errors.

This makes the hyperparameter count for (· ) ∈ ℳ, four. The output vector size is set to one as the network predicts one time step in the future. The hyperparameters are equivalently optimized ∀⟨(· ), ⟩ combination. The same approach is taken for t-GCN but excludes the hyperparameter as it is not part of the model.

I. Data Set 2 Test Set Quartile Results The results in Table 12 are for ∈ {20, ..., 80} in aggregate, corresponding to the main text’s Figure 5. Figure 8 is Figure 5 of the main text enlarged for better legibility. We note that the Constant case is excluded as its smaller sample size does not allow for fair statistical comparison. 4: 6: 8: 9: 10: 11: 12: 13: 14: 15: 16: 5: , , − ←

_() 7: for ∀(· ) ∈ ℳ do for ∀ ∈ do end for _(

. ) (, ) for 0, 1, ..., __ −

1 do (, ℋ^ , ) Algorithm 3 Training t-GCNs 2: Output: ^

__ 1: Input: , ℳ := {0(· ), ..., (· )}, := {0 , ..., }, ℋℎ, _, __, 3: Function TrainGCN(, ℳ, , ℋℎ, _, __, __): ℎ(ℋℎ, , , (· ), , __) { 0(· ), ..., (· )}, := {0 , ..., }, ℋℎ, end for 17: end for 18: 19: return ^ 20: End Function Algorithm 4 Training Baselines

Output: ^

Input:

D, _, __ := for ∀ (· ) ∈ do for ∀ ∈ do

− ← ^ ℋ − ← D − ← ^ − ← end for end for end for return ^

End Function

Function TrainBaselines(D, , , ℋℎ, _, __): ^.(^, , ) _(

. ) (D, D) for 0, 1, ..., __ −

1 do (D, ℋ^ , )

ℎ(ℋℎ, D, D, , ) Algorithm 5 t-GCN Inference

Output: MSE Function InferenceGCN(, ℳ, , ^, _, __): Input: , ℳ := {0(· ), ..., (· )}, := {0 , ..., }, ^, _, __ , , − ←

_() for ∀(· ) ∈ ℳ do for ∀ ∈ do end for for 0, 1, ..., __ − 1 do − ← MSE.( , (· ), )

(, ^) end for end for *SSAR: Non-Euclidean input-space, †Baseline: Euclidean input-space, ‡Ablation J. Larger The statistical analysis for larger is equally encouraging.

In reference to Table 13, first, we examine in aggregate whether SSARs beats the baselines. The aggregate ± 1 for Data Set 2 are 0.8664 ± 0.0060 (600-samples) and 2.0326 ± 9.0136 (500-samples) for SSARs and baselines, respectively. The T-statistic is -3.1695, corresponding to a one-sided p-value of 0.0008.

To rigorously assess SSAR, we identify the bestperforming baseline. Here, GRU performs best when taking the mean value. The T-statistic performance against GRU is -344.66 (P-val →− 0). |T-statistic| rises as the variance of GRU is significantly lower than the aggregate. In conclusion, the results hold even when raising the .

K. Second Ablation

Data Set 1 has no outliers due to the lower sample size.

Therefore, we analyze the results after controlling for outliers in Data Set 2. First, we identify outliers as 3 + 3 · > , 1 − 3 · < , where represents the th quartile, IQR represents Inter Quartile Range, *SSAR: Non-Euclidean input-space, †Baseline: Euclidean input-space, ‡Ablation and is a MSE data point. We observe that all outliers are adverse, i.e., 3 + 3 · > . This is expected, as a low MSE outlier would be numerically impossible since MSE > 0. Therefore, all outliers worsen performance and sharply reduce the stability of the learning system. The outlier study is done, including larger tested in Appendix J.

We summarize the identified outliers in Table 14.

We examine the results post-outlier-removal in Table 15.

First, we examine in aggregate whether SSARs beats the baselines. The aggregate ± 1 is 0.8654 ± 0.0023 (2700samples) and 1.1025 ± 0.0320 (2250-samples) for SSARs and baselines, respectively. The T-statistic is -383.82 (P-val →− 0).

To more rigorously assess the out-performance of our approach, we identify the best-performing baseline. Here, LSTM performs best when taking the mean MSE. Against *SSAR: Non-Euclidean input-space, †Baseline: Euclidean input-space Algorithm 6 Baselines Inference

Output: MSE Function InferenceBaselines(D, , , ^, _, __): Input: D, := { 0(· ), ..., (· )}, := {0 , ..., }, ^, _, __ LSTM, the T-statistic is -1,905 (P-val →− 0). Correspondingly, we conclude that the results hold even when the adverse outliers in the baselines are removed.

L. Complexity and Scalability The complexity of our representation can be described in two steps: computing the (i) Statistical-space matrix and then (ii) generating the graph. Consistent with the main *SSAR: Non-Euclidean input-space, †Baseline: Euclidean input-space, ‡Ablation

Bold represents the best result across row *SSAR: Non-Euclidean input-space, †Baseline: Euclidean input-space, ‡Ablation

Bold represents the best result across row, and italicized represents the best result across column text, denotes the number of features, and denotes total samples, i.e., time steps. denotes the number of bins for MI and TE. Table 16 summarizes the time and space complexity for step (i). Each complexity value is multiplied by 2 corresponding to each edge, i.e., the directed pair.

The time complexity of generating the temporal graph representation is ( × 2). The corresponding space complexity is ( × 2) if stored in an adjacency matrix and ( × ( + ||)) if stored in an adjacency list, where || is the size of the directed edge list. SSAR is highly scalable in both the temporal and feature dimensions, given that the computed measures are provided. By using a finer discrete time step, can easily rise. However, the complexity rises linearly ... for both the time and space complexity. Despite rising non-linearly, 2 ... we note that ≪ . This pattern will hold when scaling to larger data sets to avoid overfitting.

We used Nvidia GTX 4070 Ti and Nvidia GTX 2080 Ti as our GPUs for the baselines that can leverage high-core count parallel computing. We always used a single GPU system for each computational task. We used commonly available 6 to 32 virtual CPU core systems. Lastly, we used systems with 30 to 32 GB of RAM. Despite a total of 5084 random seed (ablations and baselines included) training and inference experiments, our total time spent running experiments was within two weeks. We approximate that with five parallel systems, each with 5 CPU cores for the GCNs and 5 CPU cores and a CUDA-enabled GPU for baselines, all empirical studies can be conservatively replicated within ten days. We expect our implementation to have no scaling challenges in modern AI clusters.

[1]

D. M.

Durairaj ,

B. K.

Mohan , A convolutional neural network based approach to financial time series prediction , Neural Computing and Applications 34 ( 2022 ) 13319 - 13337 .

[2]

Dimri ,

Ahmad ,

Sharif , Time series analysis of climate variables using seasonal arima approach , Journal of Earth System Science 129 ( 2020 ) 1 - 16 .

[3]

H. D.

Nguyen ,

K. P.

Tran ,

Thomassey ,

Hamad , Forecasting and anomaly detection approaches using lstm and lstm autoencoder techniques with the applications in supply chain management , International Journal of Information Management 57 ( 2021 ) 102282 .

[4]

C. W.

Granger , Investigating causal relations by econometric models and cross-spectral methods , Econometrica: journal of the Econometric Society ( 1969 ) 424 - 438 .

[5]

Lütkepohl , New introduction to multiple time series analysis , Springer Science & Business Media , 2005 .

[6]

Johansen , Estimation and hypothesis testing of cointegration vectors in gaussian vector autoregressive models , Econometrica: journal of the Econometric Society ( 1991 ) 1551 - 1580 .

[7]

Liu ,

Yang ,

Cai , Neural network as a function approximator and its application in solving diferential equations , Applied Mathematics and Mechanics 40 ( 2019 ) 237 - 248 .

[8]

Sherstinsky , Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network, Physica D: Nonlinear Phenomena 404 ( 2020 ) 132306 .

[9]

Hochreiter ,

Schmidhuber , Long short-term memory , Neural computation 9 ( 1997 ) 1735 - 1780 .

[10]

Cho ,

B. V.

Merriënboer ,

Gulcehre ,

Bahdanau ,

Bougares ,

Schwenk ,

Bengio , Learning phrase representations using rnn encoder-decoder for statistical machine translation , arXiv preprint arXiv:1406.1078 ( 2014 ).

[11]

Galassi ,

Lippi ,

Torroni , Attention in natural language processing , IEEE transactions on neural networks and learning systems 32 ( 2020 ) 4291 - 4308 .

[12]

Alam ,

M. D.

Samad ,

Vidyaratne ,

Glandon , K. M. Iftekharuddin , Survey on deep neural networks in speech and vision systems , Neurocomputing 417 ( 2020 ) 302 - 321 .

[13]

Zhou ,

Ma ,

Wen ,

Wang ,

Sun , R. Jin, FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting , in: Proc. 39th International Conference on Machine Learning (ICML 2022 ), Baltimore, Maryland, 2022 .

[14]

Wu ,

Xu ,

Wang ,

Long , Autoformer: Decomposition transformers with Auto-Correlation for long-term series forecasting , in: Advances in Neural Information Processing Systems , 2021 .

[15]

Zhou ,

Zhang ,

Peng ,

Zhang ,

Li ,

Xiong , W. Zhang, Informer: Beyond eficient transformer for long sequence time-series forecasting , in: Proceedings of the AAAI conference on artificial intelligence , volume 35 , 2021 , pp. 11106 - 11115 .

[16]

Liu ,

Yu ,

Liao ,

Li ,

Lin ,

A. X.

Liu ,

Dustdar , Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting , in: International conference on learning representations, 2021 .

[17]

Li ,

Jin ,

Xuan ,

Zhou ,

Chen ,

Y.-X.

Wang ,

Yan , Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting , Advances in neural information processing systems 32 ( 2019 ).

[18]

Zeng ,

Chen ,

Zhang ,

Xu , Are transformers efective for time series forecasting? , in: Proceedings of the AAAI conference on artificial intelligence , volume 37 , 2023 , pp. 11121 - 11128 .

[19]

Wu ,

Cui ,

Pei ,

Zhao ,

Guo , Graph neural networks: Foundation, frontiers and applications , in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , KDD '22, Association for Computing Machinery, New York, NY, USA, 2022 , p. 4840 - 4841 . URL: https://doi.org/10.1145/ 3534678.3542609. doi: 10 .1145/3534678.3542609.

[20]

Cao ,

Shen ,

Gao ,

Wei , X. Cheng, Popularity prediction on social platforms with coupled graph neural networks , in: Proceedings of the 13th International Conference on Web Search and Data Mining , WSDM '20, Association for Computing Machinery, New York, NY, USA, 2020 , p. 70 - 78 . URL: https://doi.org/10.1145/ 3336191.3371834. doi: 10 .1145/3336191.3371834.

[21]

Wang ,

Cao ,

Barati Farimani , Molecular contrastive learning of representations via graph neural networks , Nature Machine Intelligence 4 ( 2022 ) 279 - 287 .

[22]

Li ,

Zhu , Spatial-temporal fusion graph neural networks for trafic flow forecasting , in: Proceedings of the AAAI conference on artificial intelligence , volume 35 , 2021 , pp. 4189 - 4196 .

[23]

Xiang , D. Cheng, C. Shang,

Zhang ,

Liang , Temporal and heterogeneous graph neural network for ifnancial time series prediction , in: Proceedings of the 31st ACM International Conference on Information & Knowledge Management , 2022 , pp. 3584 - 3593 .

[24]

Makoviychuk ,

Wawrzyniak ,

Guo ,

Lu ,

Storey ,

Macklin ,

Hoeller ,

Rudin ,

Allshire ,

Handa , G. State, Isaac gym: High performance GPU based physics simulation for robot learning , in: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , 2021 . URL: https://openreview.net/forum? id=fgFBtYgJQX_.

[25]

Cybenko , Approximation by superpositions of a sigmoidal function , Mathematics of control, signals and systems 2 ( 1989 ) 303 - 314 .

[26]

Hornik ,

Stinchcombe ,

White , Multilayer feedforward networks are universal approximators , Neural networks 2 ( 1989 ) 359 - 366 .

[27]

Alonso ,

Maldonado ,

Aguilera ,

Roldan , Memristor variability and stochastic physical properties modeling from a multivariate time series approach , Chaos, Solitons & Fractals 143 ( 2021 ) 110461 .

[28]

Bastianin , Robust measures of skewness and kurtosis for macroeconomic and financial time series , Applied Economics 52 ( 2020 ) 637 - 670 .

[29]

Liu ,

Wu ,

Jiang ,

Huang , D. Ma, Financial timeseries forecasting: Towards synergizing performance and interpretability within a hybrid machine learning approach , arXiv preprint arXiv:2401.00534 ( 2023 ).

[30]

Li ,

Yu ,

Shahabi , Y. Liu, Difusion convolutional recurrent neural network: Data-driven trafic forecasting , arXiv preprint arXiv:1707 . 01926 ( 2017 ).

[31]

Zhao ,

Song ,

Zhang , Y. Liu,

Wang ,

Lin ,

Deng ,

Li , T-gcn: A temporal graph convolutional network for trafic prediction , IEEE transactions on intelligent transportation systems 21 ( 2019 ) 3848 - 3858 .

[32]

Koh , I. Choi,

Jang , G. Kang,

W. C.

Kim , Curriculum learning and imitation learning for modelfree control on financial time-series , arXiv preprint arXiv:2311.13326 , AAAI 2024 AI for Time Series Analysis ( 2023 ).

[33]

Geweke , Measurement of linear dependence and feedback between multiple time series , Journal of the American Statistical Association 77 ( 1982 ) 304 - 313 .

[34]

Welch , The use of fast fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms , IEEE Transactions on Audio and Electroacoustics 15 ( 1967 ) 70 - 73 .

[35]

C. E.

Shannon , A mathematical theory of communication , The Bell System Technical Journal 27 ( 1948 ) 379 - 423 .

[36]

Schreiber , Measuring information transfer, Physical review letters 85 ( 2000 ) 461 - 464 .

[37]

Barnett ,

A. B.

Barrett ,

A. K.

Seth , Granger causality and transfer entropy are equivalent for gaussian variables , Physical review letters 103 ( 2009 ) 238701 .

[38]

Barnett ,

J. T.

Lizier ,

Harré ,

A. K.

Seth , T. Bossomaier, Information flow in a kinetic ising model peaks in the disordered phase , Physical Review Letters 111 ( 2013 ) 177203 .

[39]

Li ,

Yu ,

Shahabi , Y. Liu, Difusion convolutional recurrent neural network: Data-driven trafic forecasting , in: International Conference on Learning Representations , 2018 . URL: https://openreview.net/ forum?id=SJiHXGWAZ.

[40]

Rombach ,

Blattmann ,

Lorenz ,

Esser ,

Ommer , High-resolution image synthesis with latent diffusion models , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, 2022 , pp. 10684 - 10695 .