<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Encoding Temporal Statistical-space Priors via Augmented Representation under Data Scarcity</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Insu Choi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Woosung Koh</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gimin Kang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuntae Jang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Woo Chang Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Korea Advanced Institute of Science and Technology (KAIST)</institution>
          ,
          <addr-line>Daejeon</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Yonsei University</institution>
          ,
          <addr-line>Seoul</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Woo Cahng KimModeling time series data is a fundamental challenge across various domains due to the intrinsic temporal dimension. Despite significant strides in time series forecasting, high noise-to-signal ratio, non-normality, non-stationarity, and lack of data continue challenging practitioners. To address these, we introduce a simple representation augmentation technique. Our augmented representation acts as a statistical-space prior encoded at each time step. Accordingly, we term our method Statistical-space Augmented Representation (SSAR). The underlying high-dimensional data-generating process inspires our representation augmentation. We rigorously examine the empirical generalization performance on two data sets with two downstream temporal learning algorithms. Our approach significantly beats all five up-to-date baselines. Furthermore, our approach's modular design facilitates easy adaptation to diverse settings. Lastly, we provide comprehensive theoretical insights throughout the paper to underpin our methodology with a clear and rigorous understanding.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Augmented Representation</kwd>
        <kwd>Spatio-temporal Learning</kwd>
        <kwd>Information Theory</kwd>
        <kwd>Time Series Forecasting</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Time series forecasting is crucial across multiple domains
such as finance [ 1], meteorology [2], and manufacturing [3].
Simple time series that are less stochastic and dependent on
a tractable number of variables exist. Research primarily
targets time series with complex, high-dimensional
dependencies. Often, the true set of causal factors,  , is intractable—
i.e., unknown or known but impractical to compute. On top
of this, complex time series structures, (y ∈ |x ∈  ),
often exhibit non-stationarity— challenging modeling.</p>
      <p>Initially, methods like the vector autoregressive (VAR)
model [4, 5] dominated multivariate forecasting. Extensions
like the vector error correction model (VECM) [6] addressed
some limitations of VAR models, but assumptions like
nonstationarity still posed challenges. Despite their widespread
use, these statistical models have caveats, particularly
vis-àvis their underlying statistical property assumptions. Thus,
any analysis using these models requires examination of
these assumptions—especially the non-stationary
assumption, potentially requiring transformations to the data.</p>
      <p>In response, neural network-based sequential models
have become popular in the past decade. Their main
advantage is that a universal function approximator flexibly
captures high-dimensional non-linear dependency
structures [7]. The most widely tested and verified for time
series forecasting are Recurrent Neural Network (RNN) [8]
architectures—with flagship examples being Long
ShortTerm Memory (LSTM) [9] and Gated Recurrent Unit (GRU)
[10]. Both LSTM and GRU are part of our baseline.</p>
      <p>More recently, with the out-performance of attention
mechanism-based models like transformers in other
sequential tasks such as natural language processing (NLP) [11] and
speech recognition [12], numerous transformer-based time
series forecasting models have been developed. Some
sig• Develop an easily reproducible augmented
representation technique, SSAR, that targets modeling
complex non-stationary time series
• Clear discussion of the theoretical need for
augmenting the input space and why it works well against
baselines
• Theoretical discussion of the method’s inspiration—
the data-generating process of high-dimensional
time series structure
• To our knowledge, first to leverage (asymmetric)
information-theoretic measures in modeling the
statistical-space
• Out-sample improvement vis-à-vis performance and
stability against up-to-date baselines: (i) LSTM, (ii)
GRU, (iii) Linear, (iv) NLinear, (v) DLinear
• Out-sample empirical results tested on two data sets
and two downstream temporal graph learning
algorithms
• Present a theoretically unified view with related
work, suggesting that SSAR implicitly smooths
stochastic data</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>Our work is related to temporal graph learning algorithms
as our approach transforms a vector-based time series
representation into a graph-based one. Then, a downstream
graph learning algorithm is inducted to make predictions.
Fundamentally, multi-layer perceptrons (MLPs) are
incompatible with graph representations. However, graphs
naturally represent various real-world phenomena [19]. E.g.,
social networks [20], chemical molecules [21], and trafic
systems [22] inherently possess graphical structures. Graph
Neural Networks (GNNs) bridge this gap, enabling learning
directly from graphical structures. Contrary to works with
predefined edge sets ℰ , we derive ℰ from historical vertex
values . The closest past work is [23], where they
generate Pearson correlation-based ℰ with . However, their
ℰ specifically proxies inter-company relations, tailored to
their domain. Additionally, their  ∈ ℰ are non-directed
and symmetric. In contrast, our approach is (i)
domainagnostic, (ii) employs a simple representation augmentation
to surpass state-of-the-art, (iii) modular with broad
algorithm compatibility, (iv) incorporates directed asymmetric
measures for ℰ , and (v) emphasizes theoretical analysis of
the augmentation mechanism.
3. Preliminary: Complex Time Series
Modeling complex time series via neural networks presents
three key challenges: (i) incomplete modeling, (ii)
nonstationarity, and (iii) limited data-generating process access.</p>
      <p>Let (| −  ) be the true probability structure we
want to learn. Here,  is defined by the modeler as the
variables of interest. Unlike ,  is intractable for
complex problems as (i) it is too large to be computed
realistically, but more pressingly (ii) it is unknown a priori.
Therefore, we typically use heuristics or empirical evidence to
identify ^ . Since we are forecasting, we use lagged
values with  indicating the temporal magnitude of the most
lagged value. Then, with a learner parameterized by  , via
maximum likelihood estimation we train for ^ (|^ −  )
where ^ (|^ −  ) ≈ (|^ −  ). Often, due to  ’s
intractability, in vector form, we set x^[, :] := y[, :], using
output-space’s lagged values as input-space. We use this
heuristic in our study and explain why this is a
reasonable assumption in the Appendix. Since x^ is a tractable
approximation to the true input-space, we face the partial
observation and incomplete modeling problem. This
underlies much of the stochasticity and poor performance in
forecasting high-dimensional structures. For domains that
aggregate information on the global-level—like financial
and climate time series, it is fair to assume that | | →− ∞ ,
dramatically raising the dificulty.</p>
      <p>On top of this, we have a second, more pervasive
challenge—non-stationarity. Non-stationarity is defined
as (| ) ̸= ′ (| ) where  ̸= ′. The cause of
non-stationarity could be from partial observability.
Figure 1’s left diagram summarizes this problem. Note that
the distributions are 1-dimensional for a simplified visual
depiction. This poses a significant challenge to
neuralnetwork-based approximators ^ (| ) as MLPs—the
building block—inherently work on stationary data sets.</p>
      <p>The final challenge involves neural-network-based
function approximators  :  ↦→ . The cost for a highly
lfexible function approximator  is the large | |.
Consequently, as | | rises, the size of the data set || should
also rise, allowing  to generalize out-sample better. I.e.,
better approximate (| ). Ideally, |||| &gt; 0, but
raising || arbitrary is often intractable for complex time
series. There are cases where reasonable simulators exist for
the data-generating process (| ), especially when  is
tractable and the transition function is well approximated
by rules. A representative example is physics simulators
in the robotics field [ 24], where the simulator models the
real-world, (| ) ≈ (| ). Correspondingly, we
require a world simulator for complex time series with an
intractably high-dimensional data-generating process. Since
we have no world simulator, raising || requires time to
pass. Therefore, we are restricted with a finite, lacking .</p>
    </sec>
    <sec id="sec-3">
      <title>4. Methodology</title>
      <sec id="sec-3-1">
        <title>4.1. Statistical-space Augmented Representation</title>
        <p>In response to these three challenges, we apply our method,
SSAR. We rigorously examine how SSAR overcomes each
challenge in Section 4.3. A high-level overview of SSAR
involves: (i) selecting a statistical measure, (ii) computing this
measure (y|x, , ) for each time  with sliding window
, (iii) generate a graph  where vertices  ∈ 
represent variables at , and weight of edges (), where edges
 ∈ ℰ , represent (y|x, , ). Then, with spatiotemporal
graph  := ⋃︀ , any temporal graph learning algorithm
that makes temporal node prediction can be applied.</p>
        <p>As seen in Figure 1, right, SSAR :  ↦→  where the
original time series data  is in vector form d ∈ . The per time
Algorithm 1 SSAR
step functional view would be  ← SSAR(d[− :] ∈ ).
Algorithm 1 details the pseudo-code for SSAR(· ). ∀ d
is transformed into a weighted, directed graph  =
⟨, ℰ , ⟩ where  is the set of nodes, || =  , ℰ is the
set of directed edges, |ℰ | =  2 −  , and the weighted
adjacency matrix  ∈ R×  . Here, each node  ∈ 
represents a variable (scalar) in x^ and y. Each  ∈ ℰ is a
2-tuple denoted ⟨,  ⟩,  ̸= , with each tuple
corresponding to a permutation pair of nodes. ’s |ℰ | =  2 − 
as each permutation pair corresponds to a single directed
edge, and nodes cannot direct to themselves. I.e.,  is
irreflexive. Given that the size of  is computed excluding
the diagonal elements, where  ∈ ,  ≥ 0 ∈ R,  is
equivalent in size to ℰ as each  maps to a single  . I.e.,
 : ℰ ↦→ . Here, ( → ) ← (| , , ). An
intuitive visualization is available in Figure 2.
4.2. Data-generating Process Meta-physics
SSAR is inspired by the meta-physics of the data-generating
process of complex time series. The data-generating process
refers to (· ). Access to (· ) allows for sampling data  ∼
(· ) and approximating ^ (· ) through maximum likelihood
estimation based on . On a diferent note, this abstracted
discussion aims to shed light on how a true (· ) is derived in
the real world. I.e., it aims to hypothesize on the mechanisms
underlying (· ), then describe how it inspires our approach.</p>
        <p>Consider complex time series as described in the
preliminary section.
| | →− ∞
Definition 4.1. Complex time series, causal in nature,
is defined as (| −  ) where  is intractable—i.e.,
.</p>
        <p>Similar to the theoretical nature of (· ), the concept of
 is also theoretical, given that its variables are
humandefined. This implies that an arbitrary degree of granularity
may describe  . I.e., | | can be arbitrarily raised larger
until we reach the smallest units of the physical world. For
instance, a high-level event like COVID-19 is an example
of  ∈  , which can be further broken down into granular
events like patient zero’s contraction of the virus and so
forth. Given (y| −  ), consider , ′ ⊂  ,
where the former is digitally measured by humans in time
series format and the latter comprises the remaining
elements.  ∪ ′ ≡  and  ∩ ′ = ∅.
In the case of learning algorithms that require numerical
input and output spaces, naturally, y ∈  ⊆   and
x^ ∈ . Define any information transfer within 
as endogenous and any within ′ as exogenous to the
system. As not every real-world physical change is digitally
tracked, each endogenous change has its roots in some
exogenous change. With this backdrop, all numerical variables
available to us digitally is a system that absorbs an arbitrary
amount of exogenous shocks ∀.</p>
        <p>
          Let ′ ∈ ′ , and  ∈ . Then, a
simplified view of the data-generating process can be visualized
in Figure 3. Each node at the top of the diagram represents
′ ∈ ′ while each node at the bottom represents
 ∈ . Within the diagram, |′ | →− ∞ is
indicated via "...". Blue and purple edges show causal chains
in the real physical world. Each dotted edge represents an
exogenous shock to the endogenous system. Non-dotted green
and red edges at each time step represent (y| ).
x− 1
However, since ∃(x|x−1′ ) which is unknown,
(y|x−1) = (y|(x−1|x−2′ )). (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
Under this view, all complex time series are inherently
non-stationary and, consequently, incompatible with
models assuming stationarity. Consequently, for models that
require stationary data, we require some tractable function
 (· ), ..,
 (· ) ≈ (x−|x−2′ ).
        </p>
        <p>
          1
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
The next section draws inspiration from the inherently
directed graphical nature of the data-generating process, as
illustrated in Figure 3, to theoretically unpack our method.
4.3. Prior Encoding: Theoretical View
By the universal approximation theorem [25, 26], any
stationary mapping can be approximated by neural networks.
MLPs and their subsequent architectural innovations
implicitly model high-dimensional statistical spaces.
        </p>
        <p>O0 :=  (W0 X + b0),
O1 :=  (W1 O0 + b1),</p>
        <p>
          O2 :=  (W2 O1 + b2),
where  = {⋃︀ W, ⋃︀ b}, X is the input tensor, and ,  are
non-linear activations. Given that neural networks are
directed graphs, the explicit representation by SSAR (Figures
1 and 2) can be implicitly captured by (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ). Despite this, we
opt for an explicit representation encoded as a Bayesian
prior ( ). Under the Bayesian view of learning from data,
( |) :=
(| )( ) .
        </p>
        <p>()
This inductive bias—if accurate, can be helpful for
generalized performance when || ≪ ∞ . As noted earlier,
complex time series feature finite , making it challenging to
increase its size.</p>
        <p>Our prior encoding ∀, as visualized in Figure 1, left,
aids learning via overcoming non-stationarity. Since we
are learning the distribution ^ (|^ −  ), to capture the
non-stationarity, a natural approach would be to add a
second parameter, a regime vector r, resulting in learning
^ (|^ −  , r ← (|^ −  )). This involves learning
  (· ). In this case,</p>
        <p>^ −  , r ←
^ ( |</p>
        <p>(|^ −  )),
∴   :=  ∪  , ⇒ | | &gt; | | ∵ | | ̸= ∅.</p>
        <p>Given the small size of  relative to  , increasing degrees
of freedom without further sampling  ∼ (· ) is not ideal.</p>
        <p>
          An ideal alternative is letting a statistical-space
relationship at  proxy for (· )—i.e., (y ∈ |x^ ∈
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
^ −  , , ) ≈ (y ∈ |x^ ∈ ^ −  ). But, like (y|x^),
(y|x^, , ) is unknown a priori. In this case, like   (· ),
we would require a learned approximation   (· ), raising
the size of aggregate parameters.
        </p>
        <p>A reasonable and tractable approximation known a priori
that does not raise the parameter count is,</p>
        <p>
          (y− 1|x^,  − 1, ) ≈ (y|x^, , ) ≈ (y|x^). (
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
Assuming suficient granularity in time steps ,
(y− 1|x^,  − 1, ) closely approximates (y|x^, , ).
We hypothesize that the trade-of between parameter count
and approximation via  − 1 is advantageous to the learning
system.
        </p>
        <p>
          Despite identifying a feasible regime-changing
approximator, another problem remains. Representing and passing
(y− 1|x^,  − 1, ) via Euclidean geometry significantly
reduces the spatial information inherent to (y− 1|x^,  −
1, ). A natural representation is graphical, like Figure 3—
therefore, we approximate (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ) with (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ) via (
          <xref ref-type="bibr" rid="ref10">10</xref>
          ), (
          <xref ref-type="bibr" rid="ref11">11</xref>
          ), and (
          <xref ref-type="bibr" rid="ref12">12</xref>
          ).
This transformation, which augments the representation,
theoretically encapsulates SSAR.
        </p>
        <p>^ (y|x^, r ←</p>
        <p>(y|x^)) ≈
^ (v ∈ |v−  ∈ , e−  ∈ ℰ ),</p>
        <p>v := y,
v−  := x^,
e−  ≈ ^r ≈</p>
        <p>
          r,
where e−  ←
(y− 1|x^,  − 1, ).
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
(
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
(
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
(
          <xref ref-type="bibr" rid="ref11">11</xref>
          )
(
          <xref ref-type="bibr" rid="ref12">12</xref>
          )
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>4.4. Statistical-space Measures</title>
        <p>Six methods are used to compute (y− 1|x^,  − 1, ).
The set of measures and corresponding abbreviation ℳ :=
{Pearson correlation: Pearson, Spearman rank correlation:
Spearman, Kendall rank correlation: Kendall, Granger
causality: GC, Mutual information: MI, Transfer entropy:
TE}. This set can be divided into correlation-based ℳ
and causal-based ℳ measures, which are symmetric
and asymmetric, respectively. ℳ := {Pearson,
Spearman, Kendall} ⊂ℳ , ℳ := {GC, MI, TE} ⊂ ℳ ,
ℳ ∪ ℳ ≡ ℳ , ℳ ∩ ℳ = ∅.
Symmetric measure refer to ( |) = (| ) ∀⟨, ⟩ ∈
ℰ ,  ̸= . Asymmetric refers to the case where ( |) ̸=
(| ). The asymmetric case is most appropriate for our
use case, as it uses only lagged values, making them a proxy
for causal efects. Embedding (· ) from ℳ as
weights is more natural as  :  ×  ↦→ R≥ 0. On
the other hand,  :  ×  ↦→ [− 1, 1], therefore, we
let  ← | |. We empirically test all six.</p>
        <p>The hyperparameter  is inherent to SSAR, as it is
required to compute (· ). An additional hyperparameter
∃∀ downstream algorithms— . Scalar  represents the
number of previous time steps fed into the model. In our
case,  represents the number of historic graphs as ∃ ∀.
Attaching SSAR with a downstream algorithm involves two
sliding windows:  and  . An intuitive visualization is
provided in Figure 4. The computational details ∀(· ) ∈ ℳ
are available in the Appendix.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Empirical Study</title>
      <p>5.1. Data
To empirically test SSAR, we identify representative data
sets that fit the definition of complex time series. We chose
ifnancial time series, known for their high stochasticity,
nonnormality, and non-stationarity [27, 28, 29]. Consequently,
we sourced two data sets: (i) Inter-category and (ii)
Intracategory variables. Inter- and Intra-category data sets
exhaustively represent most financial time series. Henceforth,
we refer to these data sets as Data Set 1 and 2, respectively.
Sourced based on the largest international trading volumes,
both data sets serve as representative benchmarks applicable
to practitioners. The data sourcing and processing methods
are detailed in the Appendix. Notably, extensive preliminary
statistical tests, detailed in the Appendix, validate the time
series’ complexity.</p>
      <sec id="sec-4-1">
        <title>5.2. Experiment Setting</title>
        <p>We first apply SSAR to each data set. To examine the
sensitivity to the hyperparameter  we apply SSAR ∀ ∈
w := {20, 30, 40, 50, 60, 70, 80}. A minimum  of 20
ensures stability in information-theoretic measures. Data sets
are split into training, validation, and test sets—0.5 × 0.7,
0.5 × 0.3, and 0.5, respectively for Data Set 1, and 0.6, 0.2,
and 0.2, respectively for the Data Set 2. These splits simulate
potential real-world scenarios.</p>
        <p>Five established baselines are included: (i) GRU, (ii) LSTM,
(iii) Linear, (iv) NLinear, (v) DLinear, where (iii), (iv), (v)
have shown to outperform all state-of-the-art
transformerbased architectures. The  for baselines corresponds to the
temporal dimension size of the input vector. Next, to test
the augmented representation, we select two well-known
spatio-temporal GNNs—(i) [30]’s Temporal Graph Difusion
Convolution Network (difusion t-GCN), (ii) [ 31]’s
Temporal Graph Convolution Network (t-GCN). Notably, SSAR
works with any downstream models that support
spatiotemporal data with directed edges and dynamic weights.
The number of compatible downstream models is very large.
We arbitrarily let difusion t-GCN be the downstream model
for Data Set 1, and t-GCN for Data Set 2.</p>
        <p>For ease of replication, we present the tensor operations
of difusion t-GCN for our representation in the Appendix.
We do not diverge from the original method proposed by
the authors for both downstream models. All experimental
design choices, such as splits, downstream models, and
sample sizes, were chosen a priori and were not changed after
inference. Also, each empirical sample is independently
trained from a random seed. I.e., no two test samples result
from an inference of the same model ^.</p>
        <p>The objective function  is the mean squared error
(MSE) of the prediction of  given [-1 : - ]. For a fair
empirical study, we systematically tune hyperparameters
ℎ ∈ ℋ ∀⟨, method, Data Set⟩ in the train and
validation set. Rigorous details of the training, validation, and
inference process are provided in the Appendix.</p>
      </sec>
      <sec id="sec-4-2">
        <title>5.3. Results and Ablation</title>
        <p>We observe highly encouraging results, summarized in
Figure 5 and Table 1. In Table 1, each column represents a
method, and each row represents the . Sample sizes are
one for Data Set 1 and 50 for Data Set 2 for each ⟨method,
⟩ pair. Note that the sample size for the Constant column
does not conform to this pattern as Constant weighted edges
are not associated with a . However, to match the sample
size for each approach, the Constant column presents the
7-sample and 50-sample mean± 1 results in Data Sets 1
and 2, respectively.</p>
        <p>The approaches are divided into (i) SSAR, ours, (ii)
baselines, and (iii) ablation. The ablation, Constant, is where
edge weights are constant in place of a statistical measure.
This setup assesses the utility of graphical structures
independent of statistical measures. In Data Set 1, ∀ SSAR
achieved the best results. Notably, a significant
improvement from baselines → ablation, and another significant
improvement from ablation → SSAR. Moreover, across
42sample results for all six SSAR approaches and , all
35samples of baselines are beaten with a 100% beat rate.</p>
        <p>For Data Set 2, each 50-sample ⟨method, ⟩ combination
enables box-and-whisker plot analysis in Figure 5. Each
boxand-whisker aggregates across , i.e., they each represent
7 · 50 = 350 samples. We observe a dramatic improvement
in accuracy across SSAR-based approaches. The
box-andwhisker plot follows the standard, minimum, quartile-1,
median, quartile-3, maximum value. The x-axis is
intentionally not scaled to include Linear, NLinear, and DLinear
outliers. Scaling would significantly reduce legibility. An
enlarged version of Figure 5 is in the Appendix.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Discussion</title>
      <sec id="sec-5-1">
        <title>6.1. Statistical Analysis</title>
        <p>In aggregate, 7 · 12 (row · column) = 84 random seed
outsample results are available for Data Set 1, and 7 · 11 · 50
(row · column · sample-size) = 3850 results are available
for SSAR and baselines for Data Set 2. An additional 50
samples for the ablation leads to 3900 result samples for
Data Set 2.</p>
        <p>The statistical analysis is highly encouraging. First, we
examine in aggregate whether the mean of SSARs beats
the aggregate mean of the baselines. Data Set 1’s results
are 0.7141 ± 0.0253 (42-samples) and 0.8346 ± 0.0179
(35-samples) for SSARs and baselines, respectively. The
T-statistic is -23.9022 (P-val →− 0). Data Set 2’s results
are 0.8652 ± 0.0022 (2100-samples) and 1.2740 ± 1.9097
(1750-samples) for SSARs and baselines, respectively. The
T-statistic is -9.8117 (P-val →− 0).
*SSAR: Non-Euclidean input-space, †Baseline: Euclidean input-space, ‡Ablaftion
Bold represents the best result across row, and italicized represents the best result across column
is 1.2740± 1.9097 → 0.8671± 0.0027 → 0.8652± 0.0022.</p>
        <p>This corresponds to a 31.94% reduction in MSE from
baselines to Constant and a 0.22% reduction from Constant to
SSARs. From baselines to SSARs, a 32.09% reduction is
observed. Additional study on larger  values, with details
in the Appendix, shows that statistical significance remains
robust.</p>
        <p>The second study focuses on adverse outliers in
stateof-the-art methods (Linear, NLinear, DLinear). For
robustness, we re-examine statistical results after excluding these
models’ adverse outliers. The results are detailed in the
Appendix—and the statistical findings remain unchanged.</p>
        <p>This observation of significant adverse outliers bodes poorly
for the baselines and contrarily emphasizes the stability of
our proposed approach. By examining the F-Test on
baselines and SSARs, we observe an F-static of 764,534 and
a corresponding one-tail F-Critical of 1.08 (P-val →− 0).</p>
        <p>The evidence indicates a significant fall in the variance of</p>
        <sec id="sec-5-1-1">
          <title>SSARs.</title>
          <p>Finally, we discuss the implications of setting . A
naïve interpretation might attribute SSAR’s improved
performance to a larger implicit  (based on Figure 4), but
this is contradicted by the lack of a significant relationship
between  and   (Figure 6). Moreoever, if this
was true,  &lt; 0. On the contrary, there seems to

be no meaningful relationship between  and  
for the baselines. We present two histograms that
summarize (( )|,  ∨ ), where  ∨  denotes a
boolean with some abuse of notation—true: Baseline, false:</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>SSAR.</title>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>6.2. Theoretical Implications</title>
        <p>Initially, the performance improvement in the Constant
ablation case appears surprising. Based on the theoretical
discussion provided by [32], we show that SSAR is not only
helpful in modeling the shifting underlying distribution
but also implicitly smooths highly stochastic data. These</p>
        <p>Baselines SSARs
efects are visually summarized in Figure 7. [ 32] shows that
when the causal structure is very high-dimensional and
therefore highly stochastic, augmenting the training data
via smoothing techniques is helpful when noise-to-signal
is high. The authors use exponential moving averages to
smooth the input and target space. We show that SSAR
paired with a temporal graph learning algorithm implicitly
makes the same augmentations—explaining the improved
performance in the Constant ablation case.</p>
        <p>
          Temporal weighted graph learning algorithms for node
prediction aggregate neighbouring weights and nodes each
node. Afterwards, this new encoding is fed into some neural
network with a sequential encoding (e.g., RNNs,
Transformers). In this context, (· ) represents edge weights, and  the
learning system’s parameters. In its theoretically simplest
form, without loss of generality, it aggregates the weights of
edges incident to the node,
⎡
∈ℐ
⎤
∀, ^ :=  + ⎣ ∑︁ () ()⎦ ,
(
          <xref ref-type="bibr" rid="ref13">13</xref>
          )
where ^ is the post-encoding node embedding, ℐ is the
set of edges incident to , and   is the learned weight
parameter. First, we know that (· ) ≥ 0, and ∑︀ () &gt; 0
for both the Constant and SSAR case. Then, whether ^ &gt;
 or ^ &lt; , and magnitude |^ − | is only dependant
on parameter  (). This implies,  () can learn to de-noise
the highly stochastic data. De-noising high noise-to-signal
series improves results significantly [ 32]. Essentially, as
long as the Constant weight,
(· ) :=  ∈ R̸=0,
(
          <xref ref-type="bibr" rid="ref14">14</xref>
          )
⇒   can implicitly learn to de-noise the input and target
space, resulting in improved out-sample performance. This
explains why adding no statistical-space prior, but a simple
augmented representation with fixed (· ) :=  &gt; 0, ∀(· )
resulted in improved performance.
        </p>
        <p>
          This implicit de-noising partially explains the superior
performance of SSAR. Remaining improvements are due
to approximating Equation (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ) with (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ). In short, SSAR
can be decomposed into two efects: (i) SS: statistical-space
encoding, which tracks the underlying distribution shift,
and (ii) AR: augmented representation, which allows for a
learnable function approximator to implicitly de-noise the
stochastic data.
        </p>
        <p>
          Decomposing SSAR into SS and AR, unlike the
clearcut efects in Figure 7, is challenging. As seen in Equation
(
          <xref ref-type="bibr" rid="ref13">13</xref>
          ),  () could not only learn to de-noise the data but also
implicitly learn the r ←   (|^ −  )) in Equation (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ).
Also when providng prior ( ) in Equation (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) via e ∈ ℰ , in
which it passed through Equation (
          <xref ref-type="bibr" rid="ref13">13</xref>
          ), there is no clear way
of decomposing the two efects. Thus, while the ablation
study aids understanding of SSAR’s mechanisms, it is not
a rigorous method to quantify the two efects.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>6.3. Future Works</title>
        <p>Our work, which compares SSAR and Euclidean
inputspace-based state-of-the-art models, can be viewed as two
ends of the extreme. Euclidean input-space-based models
must learn the underlying non-stationary distribution
implicitly, while SSAR takes a more deliberate approach.</p>
        <p>
          SSAR explicitly provides a statistical-space
approximation ∀, resulting in (i) allowing the neural network to use
an approximated regime-vector, and further learn the
distribution shift, and (ii) bootstrap the neural network with
priors, given that our data is limited. However, in cases
where we have access to  ∼ (· ), or || is already
suficiently large, we can hypothesize that a learned statistical
space may be beneficial. I.e., implement Equation (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) instead
of Equation (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ). In this case, the statistical space could be
learned implicitly via  : · · · × ⟨  →  ⟩ × · · · ↦→ R,
 ̸=  where edge weights are initialized () :̸= 0,
in Equation (
          <xref ref-type="bibr" rid="ref13">13</xref>
          ). Under the Bayesian view in Equation (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ),
this would correspond to the prior being a uniform
distribution, ( ) :=  (· ). Contrarily, the statistical space could
be learned explicitly where the weights of the edges are
learned explicitly,  : · · · × ⟨  →  ⟩ × · · · ↦→ R≥ 0,
 ̸= . This would closely mimic the attention mechanism
in transformers.
        </p>
        <p>We encourage future research to explore these
middleground approaches within the solution space spectrum
presented here. A more nuanced study could theoretically and
empirically study which method in the spectrum is most
ideal under specific degrees of access to (· ), equivalently,
the amount of data  available.</p>
        <p>A. Assumption: x^ [,:] := y [,:]
The assumption that the input-space features are equivalent
to the output-space features is highly reasonable. Essentially,
when training to predict y, since || ≫ 0 ⇒ y[:,] ≫ 0 ∴
∃x^[:,] ≫ 0. Even if x^[,:] ̸= y[,:], the method and implications
presented in this work hold with trivial modifications in the
learning system.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>B. Data Source</title>
      <p>We use two representative data sets for financial markets.
The first is an array of major macroeconomic
exchangetraded funds (ETFs) and variables, available in Table 2. These
variables are representative as they have been chosen based
on the largest worldwide trading volumes. This data set
examines the efectiveness of our approach across many
ifnancial categories (inter-asset-class). The second data set
is an array of major commodity futures available in Table 3.
Again, these features are chosen beforehand based on the
largest worldwide trading volume. This data set examines
the efectiveness of our approach within a financial category
(intra-asset-class)—commodity futures market.</p>
      <p>Both data sets are easily attainable via public sources.
However, we source the data from S&amp;P Capital IQ and
Bloomberg for high-quality data that is not adjusted later—
to concretely prevent any look-ahead bias. The Bull-Bear
Spread is sourced separately from the Investor Sentiment
Index of the American Association of Individual Investors
(AAII).</p>
      <p>The initial time step is set to the date where ∃ valid data
points ∀ variable. Data Set 1’s date spans from 2006-04-11
to 2022-07-08 in daily units. Data Set 2’s date spans from
1990-01-01 to 2023-06-26 in daily units.</p>
    </sec>
    <sec id="sec-7">
      <title>C. Data Processing</title>
      <p>The only data processing done from raw data is
transforming price data into return (change) data, and pre-processing
non-available (nan) data points. We transform market
variables to log return, as typical practice in the financial domain.
Log return is used instead of regular diference as log
allows for computational convenience. Other data points are
transformed to the regular diference approach as their data
points are much smaller in magnitude, and require higher
levels of precision. The pseudo-code for the data processing
is available in Algorithm 2.</p>
    </sec>
    <sec id="sec-8">
      <title>D. Computing statistical dependencies</title>
      <p>Given n := {− 1, ..., − 1−  } and n :=
{− 1, ..., − 1−  }, the six measures are computed
as follows. We remove the superscript  for improved
legibility and let  n,n ,  n ,n ,  n ,n , denotes Pearson
correlation, Spearman rank correlation, and Kendell rank
correlation, respectively. n denotes rank for time series
if  is    then</p>
      <p>∀D[ ].[] − ←
else if . ̸=  then
∀D[ ].[] − ←
(D[ ].[]/D[ ].[ − 1])
D[ ].[] −</p>
      <p>D[ ].[ − 1]
∀D[ ].[] − ←</p>
      <p>D[ ].[] −</p>
      <p>D[ ].[____ &lt; ]
Algorithm 2 Data Process</p>
      <p>,
,)( −
,</p>
      <p>,) &lt; 0.
,)(
,
where n¯ denotes the mean of series n,  is the
number of concordant pairs, and  is the number of
discordant pairs. A pair ⟨
,, ,⟩,⟨</p>
      <p>,, ,⟩ is concordant if
the ranks for both elements agree in their order: (, −</p>
      <p>,) &gt; 0, and discordant if they disagree</p>
      <p>We used Granger causality [4] based on Geweke’s method
[33].</p>
      <p>Geweke’s Granger causality (GC) is a
frequencydomain approach to Granger causality. Geweke’s Granger
causality from n to n is computed by:
n→− n := ln
︂(
n n ( ) )︂
n n |n ( )
,</p>
      <p>
        (
        <xref ref-type="bibr" rid="ref18">18</xref>
        )
is the spectral density of n given n.
where n n ( ) is the spectral density of n and n n |n ( )
      </p>
      <p>We use</p>
      <p>Welch’s
method to estimate spectral density as it improves over
periodograms in estimating the power spectral density of a
signal [34].</p>
      <p>
        We used two information-theoretic measures: Mutual
information and Transfer entropy. Mutual information (MI)
represents the shared information between two variables,
indicating their statistical interdependence [35]. In
information theory, the behavior of system n can be characterized
by the probability distribution (n) or log (n). This
measure is equivalent to the Pearson correlation coeficient if
both have a normal distribution. To compute MI between
two variables, we need to know the information entropy,
which is formulated as follows:
(n) := −
∑︁ (n) log2 (n).
n∈n
(
        <xref ref-type="bibr" rid="ref19">19</xref>
        )
      </p>
      <p>Shannon entropy quantifies the information required to
select random values from a discrete distribution. The joint
(information) entropy can be expressed as:
(n, n ) := −</p>
      <p>∑︁
n∈n,n ∈n</p>
      <p>
        (n, n ) log2 (n, n ). (
        <xref ref-type="bibr" rid="ref20">20</xref>
        )
Finally, we can define MI as the quantity of identifying the
interaction between subsystems.
      </p>
      <p>(n, n ) := (n) + (n ) − (n, n ).</p>
      <p>
        (
        <xref ref-type="bibr" rid="ref21">21</xref>
        )
Following Kvålseth (2017), we use normalized MI (NMI)
with range [0, 1] to ensure consistency across measures.
      </p>
      <p>The computation is as follows:
  (n, n ) :=</p>
      <p>(n; n )
min((n), (n ))
.</p>
      <p>
        (
        <xref ref-type="bibr" rid="ref22">22</xref>
        )
It is computed as Equation (
        <xref ref-type="bibr" rid="ref24">24</xref>
        ):
      </p>
      <p>
        Transfer entropy (TE) is a non-parametric metric
leveraging Shannon’s entropy, quantifying the amount of
information transfer between two variables [36]. Based on
conditional MI in Equation (
        <xref ref-type="bibr" rid="ref23">23</xref>
        ), we can define the general
form of (, )-history TE between two sequences n and n
for n(,) = (n,, ., n,− +1) and n(,) = (n,, ., n,− +1).
      </p>
      <p>(n |n) := −</p>
      <p>∑︁
n ∈n ,n∈n
 n(,,→−)
(n, n ) log2
n, () :=</p>
      <p>
        (n)
(n, n ) . (
        <xref ref-type="bibr" rid="ref23">23</xref>
        )
∑︁ (n,+1, n(,), n
      </p>
      <p>(,)) log2
Ω
(n,+1|n, , n</p>
      <p>()
(n,+1|n,)
()
()
,) ,</p>
      <p>
        (
        <xref ref-type="bibr" rid="ref24">24</xref>
        )
where Ω := {n,+1, n(,), n
      </p>
      <p>(,)}, which represents the
possible sets of those three values.  n(,,→−)
formation about the future state of n, which is retrieved
by subtracting information retrieved from only n(,), and
from information gathered from n(,) and n()
,</p>
      <p>. We set 
and  to 1. Under these conditions, the equation for TE with</p>
      <p>
        n, () is the
in(
        <xref ref-type="bibr" rid="ref1 ref1">1, 1</xref>
        )-history can be computed as
 n(1,,1→−)
n, () =
where Ω = {n,+1, n,, n,}.
      </p>
      <p>This measure can be perceived as conditional mutual
information, considering a variable’s influence as a condition.</p>
      <p>Also, analogous to the established relationship between the
Pearson correlation coeficient and mutual information, an
equivalent association can be identified when the two
variables comply with the premises of normal distribution [37].</p>
      <p>TE measures information flow via uncertainty reduction.
"TE from  to ," translates to the extent  clarifies the
future of  beyond what  can clarify about its own future.</p>
      <p>Conditional entropy quantifies the requisite information
to derive the outcome of a random variable , given that
the value of another random variable  is known. It is
computed as [38]:
E. Descriptive Statistics and</p>
      <p>Statistical Properties
We implement a t-GCN powered by difusion convolutional
recurrent neural networks (DCRNN) to learn SSAR’s spatial
and temporal dependency structure [39]. DCRNN shows
state-of-the-art performance in modeling trafic dynamics
with a spatial and temporal dimension—represented
graphically.</p>
      <p>The graph signal  ∈ R× 1 as each node has a single
feature. With   representing the signal observed at time ,
the difusion t-GCN learns a function (· ):
[ −  , . . . ,  − 1; →]−− (· ) [ ].</p>
      <p>
        (
        <xref ref-type="bibr" rid="ref26">26</xref>
        )
The difusion process explicitly captures the spatial
dimension and its stochastic features. The difusion process in
generative modeling works by encoding information via
increasing noise through a Markov process while decoding
information via reversing the noise process [40]. The
diffusion mechanism here is characterized by a random walk
on  with restart probability  ∈ [0, 1], and state transition
matrix D−1, where D = (1) is the out-degree
diagonal matrix, and 1 ∈ R is the all-one vector. The
stationary distribution  ∈ R×  of the difusion process
can be computed via the closed form:
 :=
=∞
∑︁  (1 −  )(D−1).
      </p>
      <p>=0
After suficient time steps, as represented by the summation
to infinity, the Markov process converges to . The intuition
is as follows. ,: ∈ R represents the difusion probability
from , i.e., it quantifies the proximity with respect to the
node.  denotes the difusion steps, and  is typically set
to a finite natural number as each step is analogous to the
iflter size in convolution.</p>
      <p>As a result, the difusion convolution over our  and a
iflter  is described by:</p>
      <p>− 1
:,1 ⋆  := ∑︁ ( ,1(D−1) +  ,2(D− 1 )):,1,</p>
      <p>
        =0
where  ∈ R× 2 are filter parameters and D−1, D− 1
are the difusion process transition matrices with the latter
representing the reverse process. A difusion convolution
layer within a neural network architecture would map the
(
        <xref ref-type="bibr" rid="ref27">27</xref>
        )
(
        <xref ref-type="bibr" rid="ref28">28</xref>
        )

signal’s feature size to an output of dimension . As we
are working with a single feature, we denote a parameter
tensor as Θ ∈ R× 1× × 2 = [ ],1. The parameters for
the th output is Θ,1 ∈ R× 2. In short, the difusion
convolutional layer is described as:
ℋ:, := (:,1 ⋆ Θ,1,:,: ),    ∈ {1, . . . , }. (
        <xref ref-type="bibr" rid="ref29">29</xref>
        )
Where input  ∈ R is mapped to output ℋ ∈ R× ,
and (· ) is an activation function. With this GCN structure,
we can train the network parameters via stochastic gradient
      </p>
      <p>G. Difusion Convolutional Gated</p>
      <p>Recurrent Unit
Next, the temporal dimension is modeled via a GRU, a
variant of RNNs that better captures longer-term dependencies.</p>
      <p>Difusion convolution replaces standard matrix
multiplication in the GRU architecture:
r :=  (Θ ⋆ [ , ℋ− 1] + b),
u :=  (Θ ⋆ [ , ℋ− 1] + b),
ℋ := u ⊙ ℋ</p>
      <p>− 1 + (1 − u) ⊙  ,
 := tanh(Θ ⋆ [ , (r ⊙ ℋ</p>
      <p>− 1)] + b),
where in time step , r, u,  , ℋ represent the reset gate,
update gate, input tensor, and output tensor, respectively.
Θ, Θ, Θ represent the corresponding filter parameters
[30].</p>
      <p>H. Training And Inference Method
The pseudo-code for the training and inference pipeline
is available in Algorithms 3, 4, 5, and 6. The
ℎ(· ) in Algorithm 3 is done with 260
random seed trials with 13 parallel CPU cores.</p>
      <p>The ℋℎ for the GCNs are as follows.</p>
      <p>
        • Input Size: [8, 9, ..., 30]
• Hidden Layer Size: [8, 16, ..., 120]
• Learning Rate: [1− 1, 1− 2, ..., 1− 6]
(
        <xref ref-type="bibr" rid="ref30">30</xref>
        )
(
        <xref ref-type="bibr" rid="ref31">31</xref>
        )
(
        <xref ref-type="bibr" rid="ref32">32</xref>
        )
(
        <xref ref-type="bibr" rid="ref33">33</xref>
        )
      </p>
      <p>The tuned hyperparameters for each data set are
presented in Tables 8, 9, 10, and 11.</p>
      <p>The difusion t-GCN has five hyperparameters: (i) input
vector size  , (ii) hidden layer size, (iii) difusion steps
(filter size),  (iv) learning rate, and (v) training epochs. The
 for the set of non-linear causal measures, ℳ, is set to
1 as the sparsity in () &gt; 0 causes computational errors.</p>
      <p>This makes the hyperparameter count for (· ) ∈ ℳ,
four. The output vector size is set to one as the network
predicts one time step in the future. The hyperparameters
are equivalently optimized ∀⟨(· ), ⟩ combination. The
same approach is taken for t-GCN but excludes the
hyperparameter  as it is not part of the model.</p>
      <p>I. Data Set 2 Test Set Quartile Results
The results in Table 12 are for  ∈ {20, ..., 80} in
aggregate, corresponding to the main text’s Figure 5. Figure 8 is
Figure 5 of the main text enlarged for better legibility. We
note that the Constant case is excluded as its smaller sample
size does not allow for fair statistical comparison.
4:
6:
8:
9:
10:
11:
12:
13:
14:
15:
16:
5: , ,  − ←</p>
      <p>_()
7: for ∀(· ) ∈ ℳ do
for ∀ ∈  do
end for
_(</p>
      <p>. )
(, )
for 0, 1, ..., __ −</p>
      <p>1 do
  (, ℋ^ , )
Algorithm 3 Training t-GCNs
2: Output: ^</p>
      <p>__
1: Input: ,
ℳ
:=
{0(· ), ...,  (· )},  :=
{0 , ..., }, ℋℎ, _, __,
3: Function TrainGCN(, ℳ, , ℋℎ, _, __, __):
ℎ(ℋℎ, , , (· ), , __)
{ 0(· ), ...,   (· )},

:=
{0 , ..., },
ℋℎ,
end for
17: end for
18:
19: return ^
20: End Function
Algorithm 4 Training Baselines</p>
      <p>Output: ^</p>
      <sec id="sec-8-1">
        <title>Input:</title>
        <p>D,
_, __
:=
for ∀ (· ) ∈  do
for ∀ ∈  do</p>
        <p>− ←
^
ℋ − ←
D − ←
^
 − ←
end for
end for
end for
return ^</p>
      </sec>
      <sec id="sec-8-2">
        <title>End Function</title>
        <p>Function TrainBaselines(D,  , , ℋℎ, _, __):
^.(^,  , )
_(</p>
        <p>. )
(D, D)
for 0, 1, ..., __ −</p>
        <p>1 do
  (D, ℋ^ , )</p>
        <p>ℎ(ℋℎ, D, D,  , )
Algorithm 5 t-GCN Inference</p>
        <p>Output: MSE
Function InferenceGCN(, ℳ, , ^, _, __):
Input: , ℳ := {0(· ), ...,  (· )},  := {0 , ..., }, ^, _, __
, ,  − ←</p>
        <p>_()
for ∀(· ) ∈ ℳ do
for ∀ ∈  do
end for
for 0, 1, ..., __ − 1 do
  − ←
MSE.( , (· ), )</p>
        <p>(, ^)
end for
end for
*SSAR: Non-Euclidean input-space, †Baseline: Euclidean input-space, ‡Ablation
J. Larger 
The statistical analysis for larger  is equally encouraging.</p>
        <p>In reference to Table 13, first, we examine in aggregate
whether SSARs beats the baselines. The aggregate  ±
1 for Data Set 2 are 0.8664 ± 0.0060 (600-samples) and
2.0326 ± 9.0136 (500-samples) for SSARs and baselines,
respectively. The T-statistic is -3.1695, corresponding to a
one-sided p-value of 0.0008.</p>
        <p>To rigorously assess SSAR, we identify the
bestperforming baseline. Here, GRU performs best when taking
the mean value. The T-statistic performance against GRU
is -344.66 (P-val →− 0). |T-statistic| rises as the variance of
GRU is significantly lower than the aggregate. In conclusion,
the results hold even when raising the .</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>K. Second Ablation</title>
      <p>Data Set 1 has no outliers due to the lower sample size.</p>
      <p>Therefore, we analyze the results after controlling for
outliers in Data Set 2. First, we identify outliers as 3 + 3 ·
  &gt;  , 1 − 3 ·   &lt;  , where 
represents the th quartile, IQR represents Inter Quartile Range,
*SSAR: Non-Euclidean input-space, †Baseline: Euclidean input-space, ‡Ablation
and   is a MSE data point. We observe that all outliers
are adverse, i.e., 3 + 3 ·   &gt;  . This is expected,
as a low MSE outlier would be numerically impossible since
MSE &gt; 0. Therefore, all outliers worsen performance and
sharply reduce the stability of the learning system. The
outlier study is done, including larger  tested in Appendix J.</p>
      <p>We summarize the identified outliers in Table 14.</p>
      <p>We examine the results post-outlier-removal in Table 15.</p>
      <p>First, we examine in aggregate whether SSARs beats the
baselines. The aggregate  ± 1 is 0.8654 ± 0.0023
(2700samples) and 1.1025 ± 0.0320 (2250-samples) for SSARs
and baselines, respectively. The T-statistic is -383.82 (P-val
→− 0).</p>
      <p>To more rigorously assess the out-performance of our
approach, we identify the best-performing baseline. Here,
LSTM performs best when taking the mean MSE. Against
*SSAR: Non-Euclidean input-space, †Baseline: Euclidean input-space
Algorithm 6 Baselines Inference</p>
      <p>Output: MSE
Function InferenceBaselines(D,  , , ^, _, __):
Input: D,  := { 0(· ), ...,   (· )},  := {0 , ..., }, ^, _, __
LSTM, the T-statistic is -1,905 (P-val →− 0).
Correspondingly, we conclude that the results hold even when the
adverse outliers in the baselines are removed.</p>
      <p>L. Complexity and Scalability
The complexity of our representation can be described in
two steps: computing the (i) Statistical-space matrix and
then (ii) generating the graph. Consistent with the main
*SSAR: Non-Euclidean input-space, †Baseline: Euclidean input-space, ‡Ablation</p>
      <p>Bold represents the best result across row
*SSAR: Non-Euclidean input-space, †Baseline: Euclidean input-space, ‡Ablation</p>
      <p>Bold represents the best result across row, and italicized represents the best result across column
text,  denotes the number of features, and  denotes total
samples, i.e., time steps.  denotes the number of bins for MI
and TE. Table 16 summarizes the time and space complexity
for step (i). Each complexity value is multiplied by  2
corresponding to each edge, i.e., the directed pair.</p>
      <p>The time complexity of generating the temporal graph
representation is ( ×  2). The corresponding space
complexity is ( ×  2) if stored in an adjacency matrix
and ( × ( + ||)) if stored in an adjacency list, where
|| is the size of the directed edge list. SSAR is highly
scalable in both the temporal and feature dimensions, given
that the computed measures are provided. By using a finer
discrete time step,  can easily rise. However, the
complexity rises linearly ...  for both the time and space
complexity. Despite rising non-linearly,  2 ...  we
note that  ≪  . This pattern will hold when scaling to
larger data sets to avoid overfitting.</p>
      <p>We used Nvidia GTX 4070 Ti and Nvidia GTX 2080 Ti as
our GPUs for the baselines that can leverage high-core count
parallel computing. We always used a single GPU system for
each computational task. We used commonly available 6 to
32 virtual CPU core systems. Lastly, we used systems with
30 to 32 GB of RAM. Despite a total of 5084 random seed
(ablations and baselines included) training and inference
experiments, our total time spent running experiments was
within two weeks. We approximate that with five parallel
systems, each with 5 CPU cores for the GCNs and 5 CPU
cores and a CUDA-enabled GPU for baselines, all empirical
studies can be conservatively replicated within ten days. We
expect our implementation to have no scaling challenges in
modern AI clusters.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Durairaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. K.</given-names>
            <surname>Mohan</surname>
          </string-name>
          ,
          <article-title>A convolutional neural network based approach to financial time series prediction</article-title>
          ,
          <source>Neural Computing and Applications</source>
          <volume>34</volume>
          (
          <year>2022</year>
          )
          <fpage>13319</fpage>
          -
          <lpage>13337</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Dimri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sharif</surname>
          </string-name>
          ,
          <article-title>Time series analysis of climate variables using seasonal arima approach</article-title>
          ,
          <source>Journal of Earth System Science</source>
          <volume>129</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H. D.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thomassey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hamad</surname>
          </string-name>
          ,
          <article-title>Forecasting and anomaly detection approaches using lstm and lstm autoencoder techniques with the applications in supply chain management</article-title>
          ,
          <source>International Journal of Information Management</source>
          <volume>57</volume>
          (
          <year>2021</year>
          )
          <fpage>102282</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C. W.</given-names>
            <surname>Granger</surname>
          </string-name>
          ,
          <article-title>Investigating causal relations by econometric models and cross-spectral methods</article-title>
          ,
          <source>Econometrica: journal of the Econometric Society</source>
          (
          <year>1969</year>
          )
          <fpage>424</fpage>
          -
          <lpage>438</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lütkepohl</surname>
          </string-name>
          , New introduction to multiple
          <source>time series analysis</source>
          ,
          <source>Springer Science &amp; Business Media</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Johansen</surname>
          </string-name>
          ,
          <article-title>Estimation and hypothesis testing of cointegration vectors in gaussian vector autoregressive models</article-title>
          ,
          <source>Econometrica: journal of the Econometric Society</source>
          (
          <year>1991</year>
          )
          <fpage>1551</fpage>
          -
          <lpage>1580</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <article-title>Neural network as a function approximator and its application in solving diferential equations</article-title>
          ,
          <source>Applied Mathematics and Mechanics</source>
          <volume>40</volume>
          (
          <year>2019</year>
          )
          <fpage>237</fpage>
          -
          <lpage>248</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sherstinsky</surname>
          </string-name>
          ,
          <article-title>Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network, Physica D: Nonlinear Phenomena 404 (</article-title>
          <year>2020</year>
          )
          <fpage>132306</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>Long short-term memory</article-title>
          ,
          <source>Neural computation 9</source>
          (
          <year>1997</year>
          )
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. V.</given-names>
            <surname>Merriënboer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gulcehre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bougares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Learning phrase representations using rnn encoder-decoder for statistical machine translation</article-title>
          ,
          <source>arXiv preprint arXiv:1406.1078</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Galassi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lippi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Torroni</surname>
          </string-name>
          ,
          <article-title>Attention in natural language processing</article-title>
          ,
          <source>IEEE transactions on neural networks and learning systems 32</source>
          (
          <year>2020</year>
          )
          <fpage>4291</fpage>
          -
          <lpage>4308</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Samad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vidyaratne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Glandon</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. M. Iftekharuddin</surname>
          </string-name>
          ,
          <article-title>Survey on deep neural networks in speech and vision systems</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>417</volume>
          (
          <year>2020</year>
          )
          <fpage>302</fpage>
          -
          <lpage>321</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          , R. Jin,
          <article-title>FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting</article-title>
          ,
          <source>in: Proc. 39th International Conference on Machine Learning (ICML</source>
          <year>2022</year>
          ), Baltimore, Maryland,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Long</surname>
          </string-name>
          , Autoformer:
          <article-title>Decomposition transformers with Auto-Correlation for long-term series forecasting</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xiong</surname>
          </string-name>
          , W. Zhang, Informer:
          <article-title>Beyond eficient transformer for long sequence time-series forecasting</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>35</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>11106</fpage>
          -
          <lpage>11115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dustdar</surname>
          </string-name>
          , Pyraformer:
          <article-title>Low-complexity pyramidal attention for long-range time series modeling and forecasting</article-title>
          , in: International conference on learning representations,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <article-title>Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Are transformers efective for time series forecasting?</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>37</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>11121</fpage>
          -
          <lpage>11128</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Graph neural networks: Foundation, frontiers and applications</article-title>
          ,
          <source>in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source>
          , KDD '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>4840</fpage>
          -
          <lpage>4841</lpage>
          . URL: https://doi.org/10.1145/ 3534678.3542609. doi:
          <volume>10</volume>
          .1145/3534678.3542609.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wei</surname>
          </string-name>
          , X. Cheng,
          <article-title>Popularity prediction on social platforms with coupled graph neural networks</article-title>
          ,
          <source>in: Proceedings of the 13th International Conference on Web Search and Data Mining</source>
          , WSDM '20,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2020</year>
          , p.
          <fpage>70</fpage>
          -
          <lpage>78</lpage>
          . URL: https://doi.org/10.1145/ 3336191.3371834. doi:
          <volume>10</volume>
          .1145/3336191.3371834.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barati</surname>
          </string-name>
          <string-name>
            <surname>Farimani</surname>
          </string-name>
          ,
          <article-title>Molecular contrastive learning of representations via graph neural networks</article-title>
          ,
          <source>Nature Machine Intelligence</source>
          <volume>4</volume>
          (
          <year>2022</year>
          )
          <fpage>279</fpage>
          -
          <lpage>287</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Spatial-temporal fusion graph neural networks for trafic flow forecasting</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>35</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>4189</fpage>
          -
          <lpage>4196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Xiang</surname>
          </string-name>
          , D. Cheng, C. Shang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Temporal and heterogeneous graph neural network for ifnancial time series prediction</article-title>
          ,
          <source>in: Proceedings of the 31st ACM International Conference on Information &amp; Knowledge Management</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>3584</fpage>
          -
          <lpage>3593</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>V.</given-names>
            <surname>Makoviychuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wawrzyniak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Storey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Macklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hoeller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rudin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Allshire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Handa</surname>
          </string-name>
          , G. State,
          <article-title>Isaac gym: High performance GPU based physics simulation for robot learning</article-title>
          ,
          <source>in: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)</source>
          ,
          <year>2021</year>
          . URL: https://openreview.net/forum? id=fgFBtYgJQX_.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cybenko</surname>
          </string-name>
          ,
          <article-title>Approximation by superpositions of a sigmoidal function</article-title>
          ,
          <source>Mathematics of control, signals and systems 2</source>
          (
          <year>1989</year>
          )
          <fpage>303</fpage>
          -
          <lpage>314</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hornik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stinchcombe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <article-title>Multilayer feedforward networks are universal approximators</article-title>
          ,
          <source>Neural networks 2</source>
          (
          <year>1989</year>
          )
          <fpage>359</fpage>
          -
          <lpage>366</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alonso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maldonado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aguilera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Roldan</surname>
          </string-name>
          ,
          <article-title>Memristor variability and stochastic physical properties modeling from a multivariate time series approach</article-title>
          , Chaos,
          <source>Solitons &amp; Fractals</source>
          <volume>143</volume>
          (
          <year>2021</year>
          )
          <fpage>110461</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bastianin</surname>
          </string-name>
          ,
          <article-title>Robust measures of skewness and kurtosis for macroeconomic and financial time series</article-title>
          ,
          <source>Applied Economics</source>
          <volume>52</volume>
          (
          <year>2020</year>
          )
          <fpage>637</fpage>
          -
          <lpage>670</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Huang</surname>
          </string-name>
          , D. Ma,
          <article-title>Financial timeseries forecasting: Towards synergizing performance and interpretability within a hybrid machine learning approach</article-title>
          ,
          <source>arXiv preprint arXiv:2401.00534</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shahabi</surname>
          </string-name>
          , Y. Liu,
          <article-title>Difusion convolutional recurrent neural network: Data-driven trafic forecasting</article-title>
          ,
          <source>arXiv preprint arXiv:1707</source>
          .
          <year>01926</year>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Liu,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>T-gcn: A temporal graph convolutional network for trafic prediction</article-title>
          ,
          <source>IEEE transactions on intelligent transportation systems 21</source>
          (
          <year>2019</year>
          )
          <fpage>3848</fpage>
          -
          <lpage>3858</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>W.</given-names>
            <surname>Koh</surname>
          </string-name>
          , I. Choi,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jang</surname>
          </string-name>
          , G. Kang,
          <string-name>
            <given-names>W. C.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Curriculum learning and imitation learning for modelfree control on financial time-series</article-title>
          ,
          <source>arXiv preprint arXiv:2311.13326</source>
          ,
          <article-title>AAAI 2024 AI for Time Series Analysis (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>J.</given-names>
            <surname>Geweke</surname>
          </string-name>
          ,
          <article-title>Measurement of linear dependence and feedback between multiple time series</article-title>
          ,
          <source>Journal of the American Statistical Association</source>
          <volume>77</volume>
          (
          <year>1982</year>
          )
          <fpage>304</fpage>
          -
          <lpage>313</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>P.</given-names>
            <surname>Welch</surname>
          </string-name>
          ,
          <article-title>The use of fast fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms</article-title>
          ,
          <source>IEEE Transactions on Audio and Electroacoustics</source>
          <volume>15</volume>
          (
          <year>1967</year>
          )
          <fpage>70</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Shannon</surname>
          </string-name>
          ,
          <source>A mathematical theory of communication</source>
          ,
          <source>The Bell System Technical Journal</source>
          <volume>27</volume>
          (
          <year>1948</year>
          )
          <fpage>379</fpage>
          -
          <lpage>423</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schreiber</surname>
          </string-name>
          , Measuring information transfer,
          <source>Physical review letters 85</source>
          (
          <year>2000</year>
          )
          <fpage>461</fpage>
          -
          <lpage>464</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>L.</given-names>
            <surname>Barnett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Barrett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Seth</surname>
          </string-name>
          ,
          <article-title>Granger causality and transfer entropy are equivalent for gaussian variables</article-title>
          ,
          <source>Physical review letters 103</source>
          (
          <year>2009</year>
          )
          <fpage>238701</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>L.</given-names>
            <surname>Barnett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Lizier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Harré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Seth</surname>
          </string-name>
          , T. Bossomaier,
          <article-title>Information flow in a kinetic ising model peaks in the disordered phase</article-title>
          ,
          <source>Physical Review Letters</source>
          <volume>111</volume>
          (
          <year>2013</year>
          )
          <fpage>177203</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shahabi</surname>
          </string-name>
          , Y. Liu,
          <article-title>Difusion convolutional recurrent neural network: Data-driven trafic forecasting</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2018</year>
          . URL: https://openreview.net/ forum?id=SJiHXGWAZ.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rombach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Blattmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lorenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Esser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ommer</surname>
          </string-name>
          ,
          <article-title>High-resolution image synthesis with latent diffusion models</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision</source>
          and Pattern Recognition, IEEE,
          <year>2022</year>
          , pp.
          <fpage>10684</fpage>
          -
          <lpage>10695</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>