<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Joint Hypergraph Rewiring and Memory-Augmented Forecasting Techniques in Digital Twin Technology</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sagar Srinivas Sakhinana</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shivam Gupta</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Krishna Sai Sudhir Aripirala</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Venkataramana Runkana</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Digital Twin technology creates virtual replicas of physical objects, processes, or systems by replicating their properties, data, and behaviors. This advanced technology offers a range of intelligent functionalities, such as modeling, simulation, and data-driven decision-making, that facilitate design optimization, performance estimation, and monitoring operations. Forecasting plays a pivotal role in Digital Twin technology, as it enables the prediction of future outcomes, supports informed decision-making, minimizes risks, driving improvements in efficiency, productivity, and cost reduction. Recently, Digital Twin technology has leveraged Graph forecasting techniques in large-scale complex sensor networks to enable accurate forecasting and simulation of diverse scenarios, fostering proactive and data-driven decision-making. However, existing Graph forecasting techniques lack scalability for many real-world applications. They have limited ability to adapt to non-stationary environments, retain past knowledge, lack a mechanism to capture the higher-order spatio-temporal dynamics, and estimate uncertainty in model predictions. To surmount the challenges, we introduce a hybrid architecture that enhances the hypergraph representation learning backbone by incorporating fast adaptation to new patterns and memory-based retrieval of past knowledge. This balance aims to improve the slowly-learned backbone and achieve better performance in adapting to recent changes. In addition, it models the time-varying uncertainty of multi-horizon forecasts, providing estimates of prediction uncertainty. Our forecasting architecture has been validated through ablation studies and has demonstrated promising results across multiple benchmark datasets, surpassing state-of-the-art forecasting methods by a significant margin.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Digital Twins</kwd>
        <kwd>Deep Learning on Graphs</kwd>
        <kwd>Time-series Forecasting</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Digital twins have several applications in various domains, including finance, retail and
ecommerce, logistics and transport, healthcare, and many other domains. Digital Twins are
useful in finance for risk management, trading, and investment decision-making. They enable
ifnancial institutions to simulate different scenarios and identify potential risks before they
occur. They can help traders identify profitable opportunities and optimize their trades, while
also allowing investors to model different economic scenarios and market conditions for better
portfolio allocation strategies. Digital twins are useful in retail and ecommerce for creating virtual
replicas of products, stores, and supply chains. This capability can contribute to transforming
product design and development, streamlining operations, enhancing customer experiences, and
driving sales growth. Digital Twins can be used in electricity pricing, auction, and design to
optimize energy efficiency, reduce costs, and improve electricity markets. They can help energy
analysts detect potential issues and optimize the layout and design of electricity grids to enhance
energy efficiency and reduce costs. They can also assist electricity retailers in optimizing bidding
strategies in electricity auctions to increase profits and reduce costs. Load forecasting is a crucial
application of Digital Twins in electricity pricing, as it enables electricity distributors to accurately
anticipate electricity demand and dynamically adjust pricing in real-time to prevent blackouts
or brownouts. The digital twin technology involves creating a digital counterpart of a tangible
entity, such as a machine, complex systems, or other physical objects. The creation of a digital
twin involves utilizing diverse data sources, such as real-time sensor data, historical data, and
other relevant information. By integrating this data into a processing system, the digital twin can
effectively observe and record the key functionalities of the tangible entity. For instance, if the
tangible entity under consideration is a gas turbine, a digital twin of the physical object would be
created to mirror its exact specifications, such as size, shape, and technical features. Real-time
sensor data from the turbine, including fuel injection rate, air-fuel ratio, inlet air temperature, and
exhaust emissions, would be collected and fed into the digital twin. Subsequently, the digital twin
would analyze this data and offer insights into the condition monitoring of the gas turbine. The
digital twin can be employed to run simulations and analyze performance concerns for a wide
range of applications, including fault diagnosis, safety monitoring, and performance optimization.
The digital twin technology offers the opportunity to test potential upgrades to a physical object
in a virtual environment prior to real-world implementation. This approach provides valuable
insights that can be implemented on the physical object, resulting in the ability to improve
operational efcfiiency, minimize downtime, and reduce maintenance expenses. Of particular
interest in this work is digital twin technology for forecasting of complex dynamical systems.
Forecasting is a critical aspect of digital twin technology as it enables accurate predictions of the
behavior of a physical object, enabling proactive maintenance, operational efficiency improvement
and safety monitoring. Furthermore, the digital twin can forecast the expected behavior of the
physical object in different scenarios, enabling operators to optimize its performance and reduce
downtime, while minimizing risks associated with implementing untested changes on the actual
physical object. As a result, it is imperative to develop accurate models of physical systems in
order to create Digital Twins that can faithfully replicate the behavior of the physical systems for
forecasting purposes.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work on Time Series Forecasting</title>
      <p>
        Accurately forecasting the behavior of complex dynamical systems, which are characterized by
high-dimensional multivariate time series(MTS) in interconnected sensor networks, is crucial
for enabling well-informed decision-making in various applications. Forecasting MTS data is
challenging due to the intricate relationships among multiple time series variables and the unique
features of MTS data, including non-linearity, high-dimensionality and non-stationarity. The
spatio-temporal graph neural networks(STGNNs) have become a popular approach to model
the relational dependencies between time series variables in the MTS data for multivariate time
series forecasting. Several researchers (e.g., [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6">1, 2, 3, 4, 5, 6</xref>
        ]) have contributed to this trend, and
their work has significantly advanced the use of GNNs in time series forecasting task. Training
STGNNs on the fly is challenging due to their inability to adjust to non-stationary environments
and retain past knowledge. The ability of STGNNs to adapt quickly is critical, and successful
approaches must handle changes to both new and recurring patterns effectively. However,
STGNNs , despite their strong representation learning capabilities, face two major challenges
when dealing with time series data streams. Firstly, training STGNNs on data streams in a
straightforward manner requires a considerable number of samples to converge. This is because
mini-batches or multiple epoch training, commonly used in offline training, are not feasible.
Thus, when there is a distribution shift, such neural architectures can become cumbersome and
require a large number of samples to learn new concepts effectively, which can ultimately result
in suboptimal performance. In essence, the primary challenge lies in the absence of a mechanism
within STGNNs to facilitate learning on continuously generated data streams effectively. As
a result, the STGNNs must adapt to new trends and patterns in data streams over time. The
second challenge arises from the fact that time series data frequently displays recurring patterns
that may cease to exist temporarily and then reappear in the future. STGNNs are prone to the
catastrophic forgetting phenomenon, whereby the model discards previously acquired knowledge
when presented with new data, leading to suboptimal learning of recurring patterns. As a result,
this limitation further hinders the overall performance of STGNNs for time series forecasting.
Existing STGNNs can learn MTS data dynamics by simultaneously inferring discrete dependency
graph structures or by leveraging domain expertise knowledge of predefined relationships among
multiple time series variables. While complex dynamical systems consist of interconnected
networks, these networks may have higher-order structural relations that extend beyond pairwise
associations. Hypergraphs, which provide a more generalized representation of graphs, can
effectively model such relations in high-dimensional MTS data. Furthermore, conventional
STGNNs prioritize pointwise forecasting and do not offer uncertainty estimates associated with
these multi-horizon forecasts. To tackle these challenges, we introduce the Joint Hypergraph
Rewiring and Forecasting Neural Framework, which we will refer to as JHgRF-Net for brevity.
The proposed framework achieves continual learning by balancing two objectives: (i) leveraging
prior knowledge to facilitate rapid learning of current trends and patterns, and (ii) maintaining and
updating previously acquired knowledge. The JHgRF-Net framework achieves dynamic balance
between rapid adaptation to recent changes and retrieval of similar old knowledge by leveraging
the interaction between two complementary components: the Spatio-Temporal Hypergraph
Convolutional Network(STHgCN) and the Spatio-Temporal Transformer Network(STTN). The
Mixture of Experts(MOE) approach is utilized to design algorithmic architecture for hypergraph
time series forecasting. This approach involves using the aforementioned set of complementary
modeling approaches, whose predictions are combined to create a robust mechanism capable of
improving the overall accuracy of forecasting. The STHgCN neural operator simultaneously infers
discrete dependency hypergraph structure and learns MTS data dynamics. The STHgCN neural
operator consists of two sequentially operating modules: hypergraph-structure learning(HgSL)
and hypergraph representation learning(HgRL). The HgSL module infers the discrete dependency
hypergraph structure and performs hypergraph rewiring to modify the hyperedges so that they
better reflect the dependencies between hypernodes. This can involve adding or removing
hyperedges to optimize the relational structure between hypernodes. The HgRL module models
the spatio-temporal dynamics underlying the hypergraph-structured MTS data for multi-horizon
forecasting. The STTN neural operator learns the underlying dynamics of MTS data beyond the
original sparse relational hypergraph structure through a self-attention mechanism. The STTN
neural operator learns the underlying dynamics of MTS data beyond the original sparse relational
hypergraph structure through a self-attention mechanism. A gating mechanism is utilized to
regulate the information flow from complementary components. This mechanism further distills
knowledge and improves the accuracy and reliability of the model’s predictions. Moreover, the
framework captures time-varying uncertainty in forecasts. As a result, the framework provides
accurate multi-horizon predictions and reliable uncertainty estimates of forecasts. Furthermore,
the framework is designed to provide superior generalization and scalability for large-scale
spatio-temporal MTS forecasting tasks that are commonly encountered in real-world applications.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Problem Formulation</title>
      <p>Let us consider a historical time series dataset with  correlated variables observed over T
time steps. The dataset is represented by the notation X=(︀ x1, . . . , xT)︀ , where the subscript
indicates the time step. The observations of all the variables at time step  are denoted by
x=(︀ x(1), x(2), . . . , x())︀ ∈ R(× ), where the superscript refers to the variables. Each sensor can
measure multiple physical quantities denoted by . For example, in intelligent transportation systems,
the traffic loop detectors or traffic sensors placed across travel lanes can simultaneously measure
three parameters: trafcfi flow, speed, and volume. Therefore, in this particular case,  = 3. In
MTSF, we use a rolling-window technique to predict the future values of n-correlated variables
for the forecast horizon. At each time step , we define a look-back window which includes
the prior  -steps of time series data to predict the next  -steps. We use a historical window of
-correlated variables, observed over the previous  -steps prior to time step , represented by
X(−  : − 1) ∈ R×  × , to predict the future values of -variables for the next  -steps, represented
by X(:+ − 1) ∈ R×  × . To capture complex higher-order relationships among variables within
the MTS data, we represent the historical data as continuous-time spatial-temporal hypergraphs
denoted by G. Hypergraphs consist of hypernodes(V), representing time series variables and
hyperedges(E) that capture hierachial relationships among an arbitrary number of hypernodes.
The time-dependent hypernode feature matrix is denoted by X(−  : − 1). We learn the implicit
hypergraph structure through an embedding-based similarity metric learning approach. The
incidence matrix I∈R×  describes the hypergraph structure, where I, =1 if hyperedge  is
incident with hypernode , and 0 otherwise. Hypergraph sparsity is determined by the number
of hyperedges in the hypergraph. In a sparse hypergraph, the number of hyperedges(m(|E|)) is
relatively small compared to the number of hypernodes(n(|V|)), while in a dense hypergraph, the
number of hyperedges is relatively large. Sparser hypergraphs generally result in more efficient
algorithms, due to the impact of hypergraph sparsity on computational efficiency and algorithmic
complexity. A hypergraph with more hyperedges has a denser and more complex structure,
resulting in a higher level of connectivity among the hypernodes. Conversely, a hypergraph
with fewer hyperedges has a sparser structure with fewer connections between the hypernodes.
The proposed framework aims to learn a differentiable function  ( ) that can predict the future
estimates X(:+ − 1), of historical window inputs X(−  : − 1), given a hypergraph G. To put it
This is mathematically represented as:
briefly, the function  ( ) takes in the past observations and hypergraph structure, represented
by [x(−  ), · · ·
, x(− 1); G], and predict future observations, denoted as [x(+1), · · ·
, x(+ − 1)].
︀[ x(−  ), · · · , x(− 1); G]︀ →− ( )[︀ x(+1), · · · , x(+ − 1)</p>
      <p>︀]

min ℒMAE(︀ X(:+ − 1), X̂︀ (:+ − 1); X(−  : − 1), G)︀</p>
      <p>The MTSF task formulated on the implicit hypergraph(G), can be expressed as shown below:
which is defined as:</p>
      <p>The function  ( ) involves a set of parameters  which can be trained to optimize its
performance. The predicted future observations is denoted by X̂︀ (:+ − 1). To train the learning
algorithm, we minimize the loss function denoted by ℒMAE, i.e., the mean absolute error(MAE),
Here, X(:+ − 1) is the actual future MTS data, and 1 is a scaling factor.</p>
      <p>ℒMAE =</p>
    </sec>
    <sec id="sec-4">
      <title>4. OUR APPROACH</title>
      <p>Our proposed neural forecasting framework consists of two key components: the projection
layer and the spatio-temporal feature extractor, as shown in Figure 1. The spatio-temporal
inference component includes two distinct methods for hypergraph representation learning:
the Spatio-Temporal Hypergraph Convolutional Network(STHgCN) and the Spatio-Temporal
Transformer Network(STTN). The STHgCN method employs hypergraph as a mathematical
model for learning the underlying higher-order relations of the time series variables. This is
achieved by optimizing the discrete hypergraph structure underlying the observed data. It then
peforms the gated hypergraph convolution operations on the hypergraph-structured MTS data to
model the intricate spatio-temporal dynamics within the latent hypernode-level representations.
The final representations can then be used to predict multi-horizon forecasts. The STTN method
is a powerful technique for modeling the hypergraph-structured MTS data. The STTN method
extends transformer networks to handle arbitrary sparse hypergraph structures with full attention
as a useful inductive bias. This enables the model to learn intra- and inter-correlations among the
variables without being limited by the hierarchical structural information underlying the MTS
data. It leverages task-specific relations between variables beyond the original sparse structure
to generate expressive hypernode-level representations that improve forecast accuracy. We use
a gating mechanism to regulate the flow of information from the two methods. This enables us
to learn optimal representations of the hypernode-level representations that capture the accurate
dynamics of complex interconnected sensor networks. To summarize, our framework performs
the joint optimization of the different learning components to generate accurate forecasts across
multiple forecast horizons, while also ensuring reliable estimates of uncertainty for time-series
forecasting tasks.</p>
      <sec id="sec-4-1">
        <title>Windowed</title>
        <p>Time Series</p>
      </sec>
      <sec id="sec-4-2">
        <title>Pointwise</title>
        <p>Forecasts
r
e
y
a
L
n
o
i
t
c
e
j
o
r
P</p>
        <sec id="sec-4-2-1">
          <title>HgSL</title>
          <p>t</p>
          <p>Spatio-Temporal Inference</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>STHgCN</title>
        </sec>
        <sec id="sec-4-2-3">
          <title>STTN</title>
          <p>
            Δ
Δ Gating Mechanism
The proposed framework uses a projection layer with gated linear networks(GLN, [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]) to
obtain non-linear representations of input data. Specifically, the input data X(−  : − 1) ∈ R×  ×  is
transformed through a gating mechanism, resulting in X(:+ − 1) ∈ R×  × , which represents the
non-linear transformed input data. It is described as follows:
          </p>
          <p>X(:+ − 1)=(︀  (W0X(−  : − 1)) ⊗ W1X(−  : − 1)︀) W2</p>
          <p>
            Here, the trainable weight matrices are W0, W1∈R× , W2∈R ×  , and the element-wise
multiplication is denoted by ⊗ . The utilization of a non-linear activation function  improves
representation learning and enables the framework to effectively learn and model complex patterns
present in the MTS data.
4.2. SPATIAL-INFERENCE
Figure 2 illustrates the spatio-temporal feature extractor of the framework, which consists of
two distinct methods(STHgCN and STTN). Further information regarding each method will be
elaborated in the subsequent sections.
4.2.1. Spatio-Temporal Hypergraph Convolutional Network(STHgCN)
The STHgCN method comprises sequentially operating modules, including hypergraph structure
learning(HgSL) and hypergraph representation learning(HgRL) modules. The following sections
will elaborate on each module and provide more details.
4.2.1.1. Implicit hypergraph Inferenece
The HgSL module uses an embedding-based similarity metric learning technique to capture
higher-order dependency relationships between different variables in the MTS data and
computes an optimal discrete hypergraph structure for a hypergraph-structured representation of
the MTS data. In short, the implicit hypergraph provides a spatio-temporal inductive bias that
enables a structured representation of the MTS data, capturing the underlying relationships
and dependencies among the variables. The hypernodes and hyperedges of the hypergraph are
represented by the differentiable embeddings in the -dimensional vector space, zi, zj∈R(),
where 1≤ ≤  and 1≤ ≤ . By leveraging the learned embeddings to transform the MTS data
into a hypergraph-structured time series data, the HgSL module computes the optimal hypergraph
topology that captures the task-relevant relationships and dependencies among the variables,
making it a powerful tool for learning relational hypergraph structures from complex MTS data.
The pairwise similarity(P,) between any pair of zi and zj is computed as follows:
ziTzj + 1
2 ‖zi‖ · ‖ zj‖
where ‖ denotes vector concatenation. The differentiable, sigmoid activation function is applied
to map the pairwise scores to the interval [
            <xref ref-type="bibr" rid="ref1">0,1</xref>
            ]. The hyperedge probability over hypernodes
of the hypergraph is represented as P(,)∈R× 2, where ∈{0, 1}. The scalar value of P(,)∈[
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ]
indicates the relationship between a pair of hypernodes and hyperedges, indexed by (, ). To
be precise, P(,0) represents the probability of hypernode  being connected to hyperedge , while
          </p>
          <p>
            P, =  (︀ [S,||1 − S,]︀) ; S, =
P(,1) denotes the probability that hypernode  is not connected to hyperedge . To accurately and
efcfiiently sample discrete hypergraph structures from the hyperedge probability distribution
P, , we leverage the Gumbel-softmax trick introduced in [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ]. This technique is powerful in
capturing complex relationships among variables in MTS data, making the HgSL module more
effective. The connectivity pattern of the hypergraph structure is then represented using an
incidence matrix I∈R× , which captures the relationships between hypernodes and hyperedges
in the hypergraph. By using the Gumbel-softmax trick, we can learn the hypergraph structure
in an end-to-end differentiable manner. Thus, it becomes possible to apply the gradient-based
optimization methods during model training, enabling an inductive-learning approach to learn
complex underlying structures within the MTS data. The Gumbel-Softmax trick involves using
random noise from the Gumbel distribution to perturb the hyperedge probability distribution and
then sampling the optimal discrete structure from the distribution using the Softmax function.
The incidence matrix is obtained as,
          </p>
          <p>I, = exp (︀ (,) + P(,) +  )︀ / )︀⧸ ∑︁ exp (︀ (,) + P(,) +  )︀ / )︀
where the temperature parameter( ) of Gumbel-Softmax trick is set to 0.05, and 
is a small constant added to avoid numerical instability. Random noise, denoted by
()∼ Gumbel(0, 1)=log(− log(U(0, 1)) is sampled from the Gumbel distribution, where U represents
the uniform distribution with a range of 0 to 1. We optimize the hypergraph distribution
parameters to ensure that the learned hypergraph is sparse, eliminating redundant hyperedges over
hypernodes. The forecasting task provides indirect supervisory information that helps to reveal
the hypergraph relation structure in the observed MTS data. In summary, the HgSL module learns
the latent hypergraph structure of multiple interacting time series variables to create a structured
representation of the time series data, which facilitates downstream multi-horizon forecasting
with predictive uncertainty estimation.
4.2.1.2. Hypergraph Attention Network(HgAT)
The HgAT neural operator extends attention-based convolution operations to non-Euclidean
domains, such as hypergraphs. It accurately models the complex hypergraph-structured MTS
data, thereby improving multi-horizon forecast accuracy. The HgAT operator captures spatial
correlations among time-series variables by encoding relational inductive bias within the
hypergraph’s connectivity. It performs message-passing schemes to propagate information through the
hypergraph-structured MTS data, which is characterized by an incidence matrix (represented by
I∈R× ) and a feature matrix (represented by X(:+ − 1) ∈ R×  × ) to compute the hypernode
representation matrix (represented by H(:+ − 1)∈ R×  × ). Each row in the matrix H(:+ − 1) represents
the hypernode representations, h∈R × . The HgAT operator captures relationships among
timeseries variables by encoding structural and feature characteristics of spatio-temporal hypergraphs
in hypernode representations. It adapts to changes in time-series variable dependencies over time
in the hypernode representations h. The HgAT operator models spatio-temporal correlations
among time-series variables in hypergraph-structured MTS data using intra-edge and inter-edge
neighborhood aggregation schemes. The intra-edge aggregation considers hypernodes associated
with a specific hyperedge, while inter-edge aggregation considers hyperedges connected to a
specific hypernode. In hypergraph-structured MTS data, hyperedges capture relationships between
multiple time-series variables, which can have varying degrees of correlation and complexity.
Let the notation N, represent a subset of hypernodes  associated with a specific hyperedge .
The intra-edge neighborhood of a hypernode , denoted as N,∖, captures a localized cluster of
semantically-corelated time-series variables and their higher-order relationships. The inter-edge
neighborhood of a hypernode , represented by N, , includes the set of hyperedges  connected
to that hypernode, providing a more comprehensive understanding that each variable may have
multiple and potentially complex relationships with other time-series variables in the data. We use
attention-based intra-edge neighborhood aggregation to obtain latent hyperedge representations,
which leads to a more comprehensive understanding of the MTS data. This approach can be
described as follows:</p>
          <p>Z
h(,ℓ)=∑︁  (︀

∑︁  ,</p>
          <p>(,ℓ,)W0()h(,ℓ− 1,))︀
=1  ∈ N,
where the hyperedge representations at layer ℓ are denoted by h()∈ R × . Each hypernode’s
initial representation is its corresponding feature vector,
h(,0,)= x()</p>
          <p>where x()</p>
          <p>∈ R ×  represents the ℎ row of the feature matrix X(:+ − 1)∈R×  × . At each
layer, the HgAT operator produces multiple representations denoted by h(,) of the input data,
each with its own set of parameters, and combines them by summation. This enables the HgAT
operator to capture various aspects of the relations underlying the intra-edge neighborhood in the
hypergraph-structured MTS data. To determine the attention coefficient  , for the hypernode 
incident with hyperedge , we compute its relative importance as follows:
 (,,ℓ,)=
(,,ℓ,)=ReLU (︀ W(0)h(,ℓ− 1,))︀</p>
          <p>(,ℓ,))︀
exp (︀ ,</p>
          <p>=1 ∈N,</p>
          <p>The weight matrices that are trained are represented as W(0), W(1)∈R× . The ReLU activation
function is used to introduce non-linearity while updating the hypernode-level representations.
The attention scores  , are normalized and determine the relevance of each hyperedge  that
is incident with hypernode . This allows the HgAT operator to focus on the most significant
hyperedges, and the attention scores are computed as follows:
(,,ℓ,)=ReLU (︀ W(3) · (︀ W(2)h(,ℓ− 1,) ⊕ W(2)h(,ℓ,))︀</p>
          <p>(,ℓ,)=
 ,</p>
          <p>(,ℓ,))
exp(,</p>
          <p>(,ℓ,))
∑︀ ∈ N, ∪  exp(,
where W(2)∈R×  and W(3)∈R2 are trainable weight matrix and vector, respectively. ⊕ denotes
the concatenation operator. The unnormalized attention score is denoted by , . Batch
normalization and dropout techniques are used to enhance generalization and mitigate overfitting. A gating
mechanism is employed to selectively combine features from x() and h(,ℓ) in a differentiable
way. These methods improve the HgAT operator reliability and accuracy for the downstream
MTSF task.</p>
          <p>
            ()= (︀ (h(,ℓ)) + (x()))︀
h(,ℓ)= (︀ ()(h(,ℓ)) + (1 − ())(x()))︀
where  and  denote the linear projections, enabling the HgAT operator to capture the
relationships between time-series variables and their temporal changes, resulting in enhanced
forecast accuracy. In summary, the HgAT operator is a powerful technique for encoding and
analyzing spatio-temporal hypergraphs.
4.2.1.3. Saptio-temporal Hypergraph Representation Learning
We present the spatio-temporal hypergraph representation learning(HgRL) module to operate on a
sequence of dynamic hypergraphs, where hypergraph structure is fixed, and hypernode attributes
change over time, where each hypergraph represents the hypergraph-structured MTS data at
a specific time step. The HgRL operator utilizes Gated Recurrent Units(GRU, [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]) to model
the spatio-temporal dynamics of the dynamic hypergraph sequence. The computation of the
update gate, reset gate, and hidden state in a traditional GRU involves matrix multiplication with
weight matrices. In the HgRL module, however, these matrix multiplications are replaced with
Hypergraph Attention Networks(HgAT). The HgRL operator analyzes hypergraph-structured
MTS data over time. It propagates information between hypernodes across different time
steps, which enables the model to capture the complex spatio-temporal dependencies between
the hypergraphs. The HgRL operator utilizes the implicit hypergraph topology to propagate
information between hypernodes by averaging the hypernode representations in their local
neighborhood at each time step computed as follows,
          </p>
          <p>
            U:+ − 1 =  (︀ W [︀  (︀ I, X(:+ − 1)︀) || H−  : − 1︀] + B)︀
R:+ − 1 =  (︀ W [︀  (︀ I, X(:+ − 1)︀) || H−  : − 1︀] + B)︀
where  (︀ I, X(:+ − 1)︀) denote the HgAT operator. ||, and ⊗ denotes the concatenation operation
and element-wise multiplication operation. The update and reset gates at time  are represented
by the matrices U:+ − 1 and R:+ − 1, respectively. W, W, and W are learnable weight
matrices and B, B, and B are learnable biases. In summary, the node representation matrix,
H:+ − 1 captures the spatio-temporal dynamics at different scales underlying the discrete-time
dynamic hypergraphs, where each row in H:+ − 1 represents the hypernode representations
hi() ∈ R × , ∀∈V. Some of the key advantages of T-HGCN operator over traditional methods
include its ability to handle large and sparse spatio-temporal hypergraphs. The STHgCN method
utilizes useful relational inductive bias encoded in the hypergraph-structured data for modeling
the continuous-time nonlinear dynamics of the complex system to disentangle the various latent
aspects underneath the data for better forecast accuracy.
4.2.2. Spatio-Temporal Transformer Network(STTN)
The Spatio-temporal transformer network(STTN) operator is a new extension of transformer
networks ([
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]) that incorporates full attention as a desired inductive bias to model MTS data
with arbitrary sparse hypergraph structures. This capability enables it to capture fine-grained
spatio-temporal dependencies in MTS data, unconstrained by hierarchical structural information
underlying the MTS data. By allowing attention to all hypernodes within the hypergraph, the
neural operator can span large receptive fields and reason globally about complex dependencies
in hypergraph-structured MTS data. As a result, it can serve as a drop-in replacement for
existing methods that model hierarchical relationships among time-series variables in MTS data.
Additionally, the neural operator is particularly suitable for downstream forecasting tasks in
spatio-temporal hypergraphs. The transformer encoder comprises alternating layers of
multiheaded self-attention(MSA) and multi-layer perceptron(MLP) blocks to capture both local and
global contextual information. To enhance performance and regularize the transformer operator,
each block is followed by layer normalization(LN([
            <xref ref-type="bibr" rid="ref11">11</xref>
            ])) and residual connections. The
skipconnections are incorporated through an initial connection strategy inspired by ResNets([
            <xref ref-type="bibr" rid="ref12">12</xref>
            ])
to address vanishing gradients and over-smoothing issues and enable the learning of complex
and deep representations of the data. Using a space-then-time(STT, [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]) approach, the STTN
ifrst performs a temporal-encoding step to capture the long-term temporal
dependencies(intradependencies) within the time series variables. This is followed by a spatial-encoding step,
which captures the inter-dependencies among the time series variables. We model the intra- and
inter-dependencies through a sequential operating temporal and spatial transformer networks,
respectively.
4.2.2.1. Temporal Transformer
In self-attention mechanism, the input sequences is transformed into three tensors: the query
tensor, the key tensor, and the value tensor, where the input tensor is denoted by X(:+ − 1)∈
R×  ×  and has three dimensions: number of time series variables(), forecast horizon( ), and
embedding dimension(). The key tensor is searched using the query tensor to retrieve relevant
information, and the value tensor is weighted by the resulting attention weights. The weighted
value tensor is then summed to produce the final output. To begin with, we reshape the input
tensors to split the embedding dimension into multiple heads:
queries,,
(:+ − 1)= queries,,ℎ* ℎ →queries,,ℎ,ℎ
(:+ − 1)
keys(,:,+ − 1)= keys(,:+,ℎ * −ℎ1)→keys(,:+,ℎ , ℎ−1)
values,,
(:+ − 1)= values,,ℎ* ℎ →values,,ℎ,ℎ
(:+ − 1)
where, ℎ and ℎ represents the index of the attention head, and head dimension, respectively.
Here,  and  represent the indices of the query and key positions, respectively. We compute the
energy between queries and keys, as described below.
          </p>
          <p>energy(,:,+, ℎ− 1)=∑︁ queries(,:,+ℎ, ℎ−1) · keys(,:+,ℎ , ℎ−1)
We compute the attention scores using the softmax function described below:
attention(,:,+, ℎ− 1)=
exp energy(,:,+, ℎ− 1)/√ℎ)︁</p>
          <p>︁(
∑︀′ exp energy(,:,+′,−ℎ 1)/√ℎ︁)</p>
          <p>︁(
out(,:,+ℎ, ℎ−1)=∑︁ attention(,:,+, ℎ−1) · values,,ℎ,ℎ</p>
          <p>To calculate the output tensor, we multiply the values tensor with the attention scores, which is
described below,</p>
          <p>We perform the concatenation operation along the ℎ dimension, which combines the outputs
of all the heads. We apply a linear transformation to obtain the final output, as follows,

(:+ − 1)= out(,:,+ℎ* ℎ−1)Wℎ* ℎ,
,,
4.2.2.2. Spatial Transformer
The output of the temporal transformer, denoted by out(:+ − 1)∈ R(×  × ), is passed to the
spatial transformer as input, and it consists of three dimensions: number of time series variables(),
forecast horizon( ), and embedding dimension(). The input sequences are first transformed to
three tensors, namely the query tensor, the key tensor, and the value tensor, before applying the
self-attention mechanism. In order to retrieve relevant information, the query tensor is employed
to search through the key tensor. The resulting attention scores are then used to weight the value
tensor. Finally, the weighted values are subsequently aggregated to produce the final output. We
reshape the input tensors to split the embedding dimension into multiple heads:
queries,,
(:+ − 1)= queries,,ℎ * ℎ →queries,,ℎ,ℎ 
(:+ − 1)
keys(,:,+ − 1)= keys(,:,ℎ+* −ℎ1)→keys(,:,ℎ+, ℎ − 1)
values,,
(:+ − 1)= values,,ℎ * ℎ →values,,ℎ,ℎ 
(:+ − 1)
where, ℎ and ℎ denote the number of heads, and head dimension, respectively. We compute
the energy between queries and keys, as described below.</p>
          <p>We obtain the attention scores using the softmax function:
energy(,:,+,ℎ − 1)=∑︁ queries(,:,ℎ+, ℎ − 1) · keys(,:,ℎ+, ℎ − 1)
exp energy(,:,+,ℎ − 1)/√ℎ)︁</p>
          <p>︁(
∑︀′ exp energy(,:′+,,ℎ− 1)/√ℎ︁)</p>
          <p>︁(
We then compute the output tensor by multiplying the attention scores with the values tensor:
out(,:,ℎ+, ℎ − 1)=∑︁ attention(,:,+,ℎ − 1) · values,,ℎ,ℎ 
We apply a linear transformation to obtain the final output, as follows,
out(,:,+ − 1)=
(:+ − 1)Wℎ* ℎ,
,,ℎ * ℎ
4.2.3. Gating Mechanism
The mixture-of-experts(MOE) mechanism in deep learning combines predictions from multiple
subnetworks, such as “STHgCN" and “STTN" representation learning methods, through a gating
mechanism that computes a weighted sum of their predictions based on the input. The aim is to
ifnd the optimal weight assignment for the gating function and train the experts accordingly using
these weights. From a cooperative game theory perspective, the MOE is a cooperative game
where experts collaborate to optimize the system’s overall performance. The gating mechanism
can optimize the weights assigned to each expert by evaluating their individual performance, as
well as the system’s overall performance. The fused representations in MOE are obtained by
combining expert predictions using the gating mechanism weights. This is described below:
′′= (︀ ′′(H:+ − 1) + ′′(out(:+ − 1)))︀</p>
          <p>X̂︀ (:+ − 1)= (︀ ′′(H:+ − 1) + (1 − ′′)(out(:+ − 1)))︀
where, X̂︀ (:+ − 1) are model multi-horizon forecasts. H:+ − 1 and out(:+ − 1) denote
the hypernode representation matrix computed by the STHgCN and STTN neural network
methods, respectively. ′′ and ′′ are linear projections. Moreover, our framework
variant(w/UncJHgRF-Net) ensures precise and reliable uncertainty estimates of multi-horizon forecasts by
minimizing the negative Gaussian log likelihood. For more details, refer to the appendix. The
proposed methods(JHgRF-Net, w/Unc-JHgRF-Net) enable end-to-end modeling of hidden
interdependencies and their evolution over time in sensor network-based dynamical systems for
highly accurate forecasting task.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Datasets</title>
      <p>
        The study aims to evaluate the effectiveness of two new models, JHgRF-Net and
w/Unc-JHgRFNet(JHgRF-Net with local-uncertainty estimation), on large-scale spatial-temporal datasets([
        <xref ref-type="bibr" rid="ref14">14</xref>
        ])
containing real-world trafcfi information. The datasets include PeMSD3, PeMSD4, PeMSD7,
PeMSD7(M), and PeMSD8. The study includes a preprocessing step to ensure consistency with
prior research by aggregating the 30-second interval data into 5-minute averages. Additionally,
publicly accessible METR-LA and PEMS-BAY datasets([
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]) were used for traffic flow prediction.
The preprocessing step involves transforming the time series data into 5-minute interval averages
to ensure a fair comparison with the prior research. For all the above-mentioned traffic datasets, we
possess information about the underlying sensor graph. To create the sensor graph, we calculated
the distances between sensors in the road network and utilized a thresholded Gaussian kernel to
build the adjacency matrix. Our experimental findings, discussed in the next section, support the
rationale of learning the implicit hypergraph relational structure of the variables underlying the
MTS data and modeling the spatial-temporal dynamics for improved forecast accuracy compared
to the learning to forecast on predefined(prior-known) sensor graphs.Furthermore, we utilize
various multivariate datasets, including Electricity1, Solar-energy2, Exchange-rate3, and Traffic 4,
for which no prior sensor graph structure exists. Additionally, the SWaT([
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]) and WADI([
        <xref ref-type="bibr" rid="ref17">17</xref>
        ])
are sensor datasets that measure water treatment plants and also do not have a predefined sensor
graph structure. They were first used in prior research for anomaly detection due to the presence
of annotated anomalies, but later used in forecasting experiments because their training sets are
anomaly-free. The experimental study conducted on benchmark datasets aims to showcase the
effectiveness and advantages of the proposed methodology( JHgRF-Net and w/Unc-JHgRF-Net)
in analyzing and modeling complex spatio-temporal MTS data, surpassing existing methods.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Experimental results</title>
      <p>Table 2 provides a thorough comparison between the proposed models(JHgRF-Net and
w/UncJHgRF-Net), and several baseline models on the MTSF task across vfie different benchmark
datasets: PeMSD3, PeMSD4, PeMSD7, PeMSD7M, and PeMSD8. To evaluate the models
effectiveness, we measured forecast errors for a well-established benchmark, involving a 12(
)step-prior to 12( )-step-ahead forecasting task. We utilize a multi-metric approach in forecasting
tasks to comprehensively evaluate the proposed models performance compared to the baseline
models. We use several performance metrics, including mean absolute error(MAE), root mean
1archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014
2www.nrel.gov/grid/solar-power-data.html
3github.com/laiguokun/multivariate-time-series-data
4https://pems.dot.ca.gov</p>
      <p>Model
squared error(RMSE), and mean absolute percentage error(MAPE) to provide an accurate
estimate of the models performance. We reported the baseline model results from [18]. Our
experimental findings indicate that the proposed models( JHgRF-Net and w/Unc–Net)
consistently outperformed the baseline models, exhibiting lower forecast errors across the different
benchmark datasets. On the PeMSD3, PeMSD4, PeMSD7, PeMSD8, and PeMSD7(M) datasets,
the proposed model(JHgRF-Net) demonstrated significant improvement over the next-best
baseline models, achieving a reduction of 20.71%, 7.49%, 2.81%, 11.08%, and 1.30% in the RMSE
metric, respectively. Apart from pointwise forecasts, the w/Unc-JHgRF-Net model(which
integrates JHgRF-Net with local uncertainty estimation) predicts time-varying uncertainty estimates
of the multi-horizon forecasts. While it exhibits slightly lower performance than the JHgRF-Net
model, it still outperforms several robust baselines found in the literature, as demonstrated by
the reduced prediction error. Additionally, in Table 1 , we show the performance of JHgRF-Net
and w/Unc-JHgRF-Net, and several baseline models on the MTSF task across multiple datasets:
METR-LA, PEMS-BAY, Solar-energy, Electricity, Exchange-rate, Trafcfi, SWaT and WADI
. The models were evaluated using various metrics, including MAE, RMSE, and MAPE, and
corresponding forecast errors were reported for 3-, 6-, and 12-steps ahead forecast horizons. The
proposed models, JHgRF-Net and w/Unc-JHgRF-Net, demonstrated superior performance
compared to the baseline models, with significantly lower forecast errors observed on all the datasets.
On the METR-LA, PEMS-BAY, Solar-energy, Electricity, Exchange-rate, Trafcfi, SWaT and
WADI datasets, the proposed model(JHgRF-Net) shows superior performance over the next-best
baseline models, achieving a reduction of 37.01%, 21.74%, 54.93%, 3.77%, 37.14%, 18.18%,
51.25% and 17.63% in the RMSE metric, respectively for the 6-step ahead forecast horizon.
Our empirical findings validate the efficacy of the proposed neural forecasting architecture to
capture the complex nonlinear spatio-temporal dynamics that are present in MTS data, leading
to improved forecasting performance. Please refer to the appendix, for further details on the
experimental methodology, ablation studies, and additional experimental results. The appendix
includes a comprehensive analysis of the JHgRF-Net model’s ability to handle missing data,
as well as a more detailed description of the w/Unc-JHgRF-Net model’s ability to estimate
uncertainty. Additionally, the appendix offers comprehensive visualizations of model predictions
with uncertainty estimates in comparison to the ground truth, along with additional information
on brief overview of the baseline models.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>Our proposed forecasting architecture accurately models the complex spatio-temporal dynamics
within MTS data and achieves accurate multi-horizon forecasts compared to the several baselines.
The experimental results obtained from real-world datasets demonstrate the effectiveness of our
approach, as supported by improved forecast estimates and reliable uncertainty estimations. In
the future, our focus will be on expanding the framework’s capabilities to handle large-scale
graph datasets, enabling its utilization for a wide range of applications, such as anomaly detection,
missing data imputation, etc.
[18] J. Choi, H. Choi, J. Hwang, N. Park, Graph neural controlled differential equations for
traffic forecasting, in: Proceedings of the AAAI Conference on Artificial Intelligence,
volume 36, 2022, pp. 6367–6374.
[19] I. Marisca, A. Cini, C. Alippi, Learning to reconstruct missing data from spatiotemporal
graphs with sparse observations, arXiv preprint arXiv:2205.13479 (2022).
[20] A. Cini, D. Zambon, C. Alippi, Sparse graph learning for spatiotemporal time series, arXiv
preprint arXiv:2205.13492 (2022).
[21] D. A. Nix, A. S. Weigend, Estimating the mean and variance of the target probability
distribution, in: Proceedings of 1994 ieee international conference on neural networks
(ICNN’94), volume 1, IEEE, 1994, pp. 55–60.
[22] A. Deng, B. Hooi, Graph neural network-based anomaly detection in multivariate time
series, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2021,
pp. 4027–4035.
[23] J. D. Hamilton, Time series analysis, Princeton university press, 2020.
[24] S. Bai, J. Z. Kolter, V. Koltun, An empirical evaluation of generic convolutional and
recurrent networks for sequence modeling, arXiv:1803.01271 (2018).
[25] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks, in:</p>
      <p>NeurIPS, 2014, pp. 3104–3112.
[26] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, Y. Bengio, Learning
phrase representations using rnn encoder-decoder for statistical machine translation, in:
EMNLP, 2014.
[27] S. Huang, D. Wang, X. Wu, A. Tang, Dsanet: Dual self-attention network for multivariate
time series forecasting, in: CIKM, 2019.
[28] Y. Li, R. Yu, C. Shahabi, Y. Liu, Diffusion convolutional recurrent neural network:
Datadriven traffic forecasting, in: ICLR, 2018.
[29] B. Yu, H. Yin, Z. Zhu, Spatio-temporal graph convolutional networks: A deep learning
framework for trafcfi forecasting, in: IJCAI, 2018. URL: https://doi.org/10.24963/ijcai.
2018/505. doi:10.24963/ijcai.2018/505.
[30] Z. Wu, S. Pan, G. Long, J. Jiang, C. Zhang, Graph wavenet for deep spatial-temporal graph
modeling, in: IJCAI, 2019, pp. 1907–1913.
[31] S. Guo, Y. Lin, N. Feng, C. Song, H. Wan, Attention based spatial-temporal graph
convolutional networks for trafcfi flow forecasting, in: AAAI, 2019. doi: 10.1609/aaai.
v33i01.3301922.
[32] L. Bai, L. Yao, S. S. Kanhere, X. Wang, Q. Z. Sheng, Stg2seq: Spatial-temporal graph
to sequence model for multi-step passenger demand forecasting, in: IJCAI, 2019. URL:
https://doi.org/10.24963/ijcai.2019/274. doi:10.24963/ijcai.2019/274.
[33] C. Song, Y. Lin, S. Guo, H. Wan, Spatial-temporal synchronous graph convolutional
networks: A new framework for spatial-temporal network data forecasting, in: AAAI, 2020.
doi:10.1609/aaai.v34i01.5438.
[34] R. Huang, C. Huang, Y. Liu, G. Dai, W. Kong, Lsgcn: Long short-term trafcfi prediction
with graph convolutional networks., in: IJCAI, 2020, pp. 2355–2361.
[35] L. Bai, L. Yao, C. Li, X. Wang, C. Wang, Adaptive graph convolutional recurrent network
for traffic forecasting, in: NeurIPS, volume 33, 2020, pp. 17804–17815.
[36] M. Li, Z. Zhu, Spatial-temporal fusion graph neural networks for trafcfi flow forecasting,
in: AAAI, 2021.
[37] Y. Chen, I. Segovia-Dominguez, Y. R. Gel, Z-gcnets: Time zigzags at graph convolutional
networks for time series forecasting, in: ICML, 2021.
[38] Z. Fang, Q. Long, G. Song, K. Xie, Spatial-temporal graph ode networks for trafcfi flow
forecasting, in: KDD, 2021.
[39] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, R. Zemel, Neural relational inference for
interacting systems, in: International Conference on Machine Learning, PMLR, 2018, pp.
2688–2697.</p>
    </sec>
    <sec id="sec-8">
      <title>8. APPENDIX</title>
      <p>8.1. Ablation Study
The JHgRF-Net framework, serving as the baseline for our ablation study, seamlessly integrates
both spatial and temporal inference components to model complex inter- and intra-time series
correlations in interconnected sensor networks. Its spatial inference component comprises
of two modules: Spatio-Temporal Hypergraph Convolutional Network(STHgCN) and
SpatioTemporal Transformer Network(STTN). In an extensive ablation study, we evaluate the impact
of each component in the JHgRF-Net framework on the MTSF task. By selectively removing
components, we can observe the impact of individual components on the overall framework
performance, gaining valuable insight into their unique contributions towards the framework
effectiveness. The study conducted a systematic elimination and creation of various ablated
variants to identify critical components that enhance the framework performance. By comparing
the impact of these components on the MTSF task against the baseline, valuable insights were
gained into each component contribution to the overall framework performance. The ablation
study led to an improved understanding of the relationship between the various ablated variants
and the baseline, which resulted in a better understanding of the mechanisms that underlie their
generalization performance. We present detailed information on each ablated variant created by
systematically removing specific components, as follows:
• “w/o - Spatial": A variant of JHgRF-Net framework that excluded the spatial inference
component, and its degraded performance highlights the significance of using STHgCN and
STTN neural operators for effective modeling of inter-series correlations among multiple
time series variables present in complex interconnected sensor networks.
• “w/o - Temporal": A variant of JHgRF-Net that excluded the temporal inference
component, and its deteriorated performance highlighted the importance of incorporating the
temporal inference component for effectively modeling the time-varying inter-series
dependencies within multiple time series variables present in complex sensor network-based
dynamical systems.
• “w/o - STHgCN": A variant of JHgRF-Net that excluded the STHgCN method, and
its substandard performance shed light on the importance of attention-based hypergraph
convolution operation for modeling the spatio-temporal dynamics present in the
highdimensional sensor network-based dynamic systems.
• “w/o - STTN": A variant of JHgRF-Net that excluded the STTN method, and its subpar
performance emphasized the significance of hypergraph transformer networks, which
utilizes full attention as a structural inductive bias for modeling the complex dynamics
present in the high-dimensional interconnected sensor networks.</p>
      <p>In Tables 3 - 9, we present the findings of our ablation studies on benchmark datasets. We
employed multiple forecasting accuracy metrics, including Mean Absolute Error(MAE), Root
Mean Squared Error(RMSE), and Mean Absolute Percentage Error(MAPE), to offer a
comprehensive understanding of the relative performance of ablated variants compared to the baseline.
We evaluated the accuracy of multistep-ahead forecasting task by comparing pointwise forecasts
with observed data(ground-truth) during the prediction interval and the results were reported
using the previously mentioned forecast accuracy metrics. For additional clarity, we enclosed
the relative percentage difference between the ablated variants and the baseline performance
within parentheses. To ensure the accuracy of our findings, we conducted multiple experiments
and reported the average results. Moreover, we evaluated the ablated variants ability to handle
long-term predictions by setting the forecast horizon to 12 and comparing it with the baseline.
Tables 3 - 9, demonstrate that the ablated variants have lower forecast accuracy and perform
considerably worse than the baseline. Upon closer examination, it is apparent that, for achieving
state-of-the-art performance on benchmark datasets, the spatial inference component within the
JHgRF-Net framework is more important than the temporal inference component. The ablation
studies yielded the following observations:
• On the PeMSD8 dataset, analysis indicates that the “w/o - Spatial" variant shows a
significant decline in performance relative to the baseline, with an increase of 21.12% in RMSE,
27.55% in MAE, and 40.77% in MAPE. Conversely, the “w/o - Temporal" variant exhibits
slightly inferior performance compared to the baseline, with a modest rise of 16.23% in
RMSE, 14.02% in MAE, and 10.15% in MAPE.
• Likewise, similar trends are observed on the PeMSD4 dataset. The “w/o - Spatial" variant
significantly underperforms the benchmark, with an increase of 20.86% in RMSE, 22.26%
in MAE, and 22.34% in MAPE. In contrast, the “w/o - Temporal" variant exhibits a minor
reduction in its performance when compared to the baseline, with a marginal rise of 8.76%
in RMSE, 4.47% in MAE, and 2.86% in MAPE.
• Analogous trends are observed for PeMSD7 dataset. In particular, the “w/o - Spatial"
variant displays a notable decline in performance relative to the baseline, with an increase
of 18.33% in RMSE, 21.05% in MAE, and 44.33% in MAPE. On the other hand, the “w/o
- Temporal" variant indicates a minor drop in performance compared to the baseline, with a
slight increase of 8.63% in RMSE, 7.75% in MAE, and 8.74% in MAPE.</p>
      <p>The higher increase in the error metrics of the ablated variants performance, in comparison
to the baseline, further emphasizes the relative significance of the mechanisms underlying the
excluded components of the baseline. To put it briefly, the spatial inference component serves
as a powerful backbone that fortifies the JHgRF-Net framework for improving forecasting
performance. This component is responsible for capturing the intricate dependencies among
multiple time series variables and learning the dynamics of interacting systems. The crucial role
of the spatial inference component is evident from the substantial decline in performance when
it is excluded as compared to the baseline, emphasizing its indispensable nature. Our proposed
neural forecast architecture is built upon two fundamental methods known as the STHgCN
and STTN neural operators, which collectively make up the spatial inference component. The
following observations were made from the ablation studies:
• The “w/o - STHgCN" variant yielded inferior results compared to the benchmark, with a
difference of 12.03%, 10.66%, and 11.43% in terms of RMSE, MAE, and MAPE metrics,
respectively, for PeMSD4 dataset. Similarly on PeMSD8, the variants exhibited a 10.52%,
9.97%, and 9.89% decrease in performance with respect to the same metrics as compared
to the benchmark. These results provide evidence in support of the notion that incorporating
Method
JHgRF-Net
w/o - Spatial
w/o - Temporal
w/o - STHgCN
w/o - STTN</p>
      <p>Method
JHgRF-Net
w/o - Spatial
w/o - Temporal
w/o - STHgCN
w/o - STTN</p>
      <p>STHgCN method in the learning process can result in better performance in multi-horizon
forecasting tasks.
• The “w/o - STTN" variant exhibited a slight increase in the RMSE, MAE, and MAPE
metrics compared to the baseline, with differences of 3.79%, 0.31%, and 0.61% on PeMSD4,
and 1.90%, 1.88%, and 2.29% on PeMSD8, respectively. Nonetheless, the integration
of the STTN method was found to be crucial, as it resulted in a notable improvement in
forecast accuracy.</p>
      <p>Based on the ablation studies, we can conclude that STHgCN method is more effective than
the STTN method in accurately modeling spatio-temporal dependencies in MTS data, leading to
better multi-horizon forecasts. Additional results from the ablation study on benchmark datasets
are presented in Tables 3 - 9. The results indicate that the proposed JHgRF-Net framework
exhibits strong generalization capabilities, even when dealing with intricate patterns across an
extensive variety of datasets, and it can efficiently scale to handle large-scale graph datasets. In
summary, the ablation studies provide evidence in favor of the hypothesis that joint optimization
of spatial-temporal inference components can lead to enhanced performance in multi-horizon
forecasting tasks. In addition, the experimental findings support the rationale of inclusion of
STHgCN and STTN neural operators to model the interdependencies among the multiple variables
and learn the dynamics of the complex interconnected systems.
BAY and Traffic benchmark datasets.
and Electricity benchmark datasets.
8.2. Prediction error for multi-horizon forecasting
We conducted comprehensive experiments to evaluate the capability of the neural forecasting
architecture, JHgRF-Net, to generate accurate multi-horizon forecasts on several benchmark
datasets. The forecast errors of the JHgRF-Net framework performance on benchmark datasets
are shown in Figure 10. The framework performance was evaluated using various metrics, such
as MAPE and MAE. Lower values of forecast errors indicate better model performance. The
results demonstrate that the framework outperformed the baselines on all the prediction horizons.
M
e
P
M
e
P
M
7
D
S
M
e
P
22
20
18
16
E14
A
M
12
10
8
6
25</p>
      <p>HgRNet
STGODE
STG-NCDE
Z-GCNETs</p>
      <p>AGCRN
10</p>
      <p>12
HgRNet
STGODE
STG-NCDE
Z-GCNETs</p>
      <p>AGCRN
10</p>
      <p>12
HgRNet
STGODE
STG-NCDE
Z-GCNETs</p>
      <p>AGCRN
10</p>
      <p>12
HgRNet
STGODE
STG-NCDE
Z-GCNETs</p>
      <p>AGCRN
10
12
8
6
4
0
20
18
16
14
E
P12
A
M
10
20
18
16
14
E
P12
A
M
10
8
6
4
0
14
12
10
E
P8
A
M
6
4
2
0
4
2
0
12
10
E8
P
A
M
6
8
7
6
5
E
4
P
A
M
3
2
1
0</p>
      <p>MAE</p>
      <p>MAPE</p>
      <p>HgRNet
STGODE
STG-NCDE
Z-GCNETs</p>
      <p>AGCRN
10</p>
      <p>12
HgRNet
STGODE
STG-NCDE
Z-GCNETs</p>
      <p>AGCRN
10</p>
      <p>12
HgRNet
STGODE
STG-NCDE
Z-GCNETs</p>
      <p>AGCRN
10</p>
      <p>12
HgRNet
STGODE
STG-NCDE
Z-GCNETs</p>
      <p>AGCRN
10</p>
      <p>12
HgRNet
STGODE
STG-NCDE
Z-GCNETs</p>
      <p>AGCRN
These findings suggest that the proposed framework has the potential to accurately model the
nonlinear spatio-temporal dependencies and improve multi-horizon forecast accuracy through
effectively exploiting the relational inductive biases within the hypergraph-structured MTS data.
8.3. Irregular time series forecasting
The JHgRF-Net framework ability to handle missing data in large, complex sensor networks was
evaluated by simulating two commonly observed missingness patterns([19], [20]): point-missing
and block-missing patterns. These patterns were created to mimic the missingness patterns
observed in real-world data of such complex interconnected sensor networks. In point-missing
pattern, observations of each variable were randomly dropped within a historical window, with
missing ratios of 10%, 30%, and 50%. Similarly, in block-missing pattern, available data for
each variable was randomly masked within a historical window, also with missing ratios ranging
from 10%, 30%, and 50%. Moreover, sensor failures were simulated with a probability of
0.15%, leading to blocks of missing data for the multivariate time series data. To evaluate
the JHgRF-Net framework performance on MTS data with missing values and to analyze
the impact of increasing missing data percentage on framework performance, we split several
benchmark datasets into three mutually exclusive sets - training, validation, and test - based
on their chronological order. The METR-LA and PEMS-BAY datasets were split in a ratio of
7:1:2, while the other datasets(PeMSD3, PeMSD4, PeMSD7, PeMSD8, and PeMSD7(M), Traffic,
Solar-Energy, Electricity, Exchange-Rate) were split in a ratio of 6:2:2. We utilized multiple
forecasting metrics to evaluate the JHgRF-Net framework performance in handling missing
data. We trained the JHgRF-Net framework on fully observed data to establish a benchmark
for the MTSF task with missing values. The tables 11 - 15 present the forecasting results of
the framework performance on the irregular-time-series datasets. The experimental studies
demonstrate that the JHgRF-Net framework is reliable and robust in handling missing data,
which is widely prevalent in real-world applications. The framework performance deteriorates
slightly compared to the benchmark when there is a lower percentage of missing data. With
a further increase in the percentage of missing data, the framework performance continues to
decline, resulting in lower forecast accuracy across all benchmark datasets, regardless of the
missing data pattern. Instead of relying on imputed values for model predictions, the proposed
framework utilizes observed data for multi-horizon forecasting, hence demonstrates its robustness
to handle missing data. Moreover, by capturing complex dependencies and patterns within
multivariate time series data present in interconnected networks, the framework generates more
dependable out-of-sample forecasts, resulting in enhanced multi-horizon forecast accuracy.
8.4. Sensitivity analysis
We carried out a hyperparameter study to evaluate the impact of specific hyperparameters on the
proposed framework performance. Our aim was to find the ideal set of hyperparameter values that
could result in achieving the best possible performance on the benchmark datasets. We tuned four
hyperparameters - embedding size(d), number of hyperedges(|E|), batch size(b), and learning
rate(lr) - within specific ranges of values. The opted ranges were as follows: d ∈{2, 6, 10, 18, 24},
|E| ∈{2, 5, 8}, b ∈{2, 6, 10, 18, 24, 32, 64}, and lr ∈{1 × 10− 1, 1 × 10− 2, 1 × 10− 3, 1 × 10− 4}.
We have carefully selected ranges for the hyperparameters to prevent memory errors and limit
10%
30%
50%</p>
      <p>Model
JHgRF-Net
w/Point
w/Block
w/Point
w/Block
w/Point
w/Block</p>
      <p>MAE
Horizon@3 Horizon@6 Horizon@12</p>
      <p>2.039 2.059 4.937
LA 2.083 2.093 4.951
-TR 2.127 2.138 4.983
EM 2.137 2.143 5.113
2.175 2.189 5.267
2.166 2.178 5.671
2.181 2.193 5.703</p>
      <p>Y
A
B
S
M
e
P
10%
30%
50%</p>
      <p>JHgRF-Net
w/Point
w/Block
w/Point
w/Block
w/Point
w/Block</p>
      <p>MAE
Horizon@3 Horizon@6 Horizon@12
0.575 0.868 0.873
0.579 0.871 0.877
0.588 0.879 0.893
0.613 0.911 0.937
0.637 0.943 0.981
0.689 0.988 1.103
0.713 0.997 1.119
model size. We optimized the framework hyperparameters through grid search and measured
the model performance by measuring metrics like MAE and RMSE. These experimental results
offered valuable insights into the effect of these hyperparameters on the framework ability to
produce accurate forecasts in multivariate time series analysis, enhancing our understanding
of its overall performance. The optimal hyperparameter configurations that yielded the best
performance for each dataset are presented below,
• For PeMSD3, we set the batch size(b) to 18, the initial learning rate(lr) to 1 × 10− 3, and
the embedding size(d) to 18. Additionally, the number of hyperedges is 5.
• For PeMSD4, we set the batch size(b) to 32, the initial learning rate(lr) to 1 × 10− 3, and
the embedding size(d) to 18. Additionally, the number of hyperedges is 5.
• For PeMSD7, we set the batch size(b) to 6, the initial learning rate(lr) to 1 × 10− 3, and the
embedding size(d) to 18. Additionally, the number of hyperedges is 6.
• For PeMSD8, we set the batch size(b) to 48, the initial learning rate(lr) to 1 × 10− 3, and
the embedding size(d) to 18. Additionally, the number of hyperedges is 8.
• For PeMSD7(M), we set the batch size(b) to 48, the initial learning rate(lr) to 1 × 10− 3,
and the embedding size(d) to 18. The number of hyperedges is 6.
• For METR-LA, we set the batch size(b) to 48, the initial learning rate(lr) to 1 × 10− 3, and
the embedding size(d) to 18. Additionally, the number of hyperedges is 5.
• For PEMS-BAY, we set the batch size(b) to 12, the initial learning rate(lr) to 1 × 10− 3, and
the embedding size(d) to 18. The number of hyperedges is 5.
• For SWAT, we set the batch size(b) to 256, the initial learning rate(lr) to 1 × 10− 3, and the
embedding size(d) to 18. Additionally, the number of hyperedges is 5.
• For WADI, we set the batch size(b) to 64, the initial learning rate(lr) to 1 × 10− 3, and the
embedding size(d) to 12. Additionally, the number of hyperedges is 5.
• For Electricity, we set the batch size(b) to 32, the initial learning rate(lr) to 1 × 10− 3, and
the embedding size(d) to 18. Additionally, the number of hyperedges is 2.
• For Solar-energy, we set the batch size(b) to 32, the initial learning rate(lr) to 1 × 10− 3,
and the embedding size(d) to 18. The number of hyperedges is 6.
• For Exchange-rate, we set the batch size(b) to 32, the initial learning rate(lr) to 1 × 10− 3,
and the embedding size(d) to 18. The number of hyperedges is 6.
• For Traffic, we set the batch size( b) to 8, the initial learning rate(lr) to 1 × 10− 3, and the
embedding size(d) to 18. Additionally, the number of hyperedges is 5.</p>
      <p>The proportion of hypernodes connected to hyperedges in a hypergraph indicates the network’s
“edge density". A higher fraction of connected hypernodes suggests a denser network, while a
lower fraction implies a sparser network. Modifying the number of hyperedges enables control
over the hypergraph’s density. The hyperparameter study yields the optimal number of hyperedges
for an MTSF task by evaluating the impact of the number of predefined hyperedges on the learned
hypergraph structures. This study sheds light on how the hypergraph’s density changes as the
number of hyperedges increases or decreases for a particular dataset in the MTSF task, with Table
16 presenting the experimental results.</p>
      <p>Experimental results of the hyperparameter study on the benchmark datasets.
8.5. Time series forecasting visualization
Figure 3 depicts the ground truth, pointwise forecasts, and time-varying uncertainty estimates
obtained from the proposed w/Unc-JHgRF-Net framework. The visualizations provide valuable
insights into the framework performance and facilitates comprehensive analysis and result
interpretation. Existing methods for MTSF can model nonlinear spatio-temporal dependencies within
interconnected sensor networks but often fail to provide accurate measures of uncertainty. In
contrast, the proposed w/Unc-JHgRF-Net framework(JHgRF-Net framework with local
uncertainty estimation) effectively utilizes relational inductive bias via spatio-temporal propagation
architecture to quantitatively estimate uncertainty of multi-horizon forecasts. The framework
accurately estimates uncertainty, outperforming existing methods that solely provide pointwise
forecasts for MTSF. The multifaceted visualizations shows the framework effectiveness in time
series representation learning for the MTSF task, making it a valuable contribution to the field of
multivariate time series analysis for uncertainty estimation.</p>
      <p>0
50 100 150 200 250 300
0
50 100 150 200 250 300
0
50 100 150 200 250 300
0
50 100 150 200 250 300
Instance</p>
      <p>Instance</p>
      <p>Instance</p>
      <p>Instance
(a) Node 12 in PeMSD3
(b) Node 99 in PeMSD3
(c) Node 108 in PeMSD3
(d) Node 141 in PeMSD3
Predictions
Ground Truth
Uncertainity
50 100 150 200 250 300
50 100 150 200 250 300
Instance</p>
      <p>Instance
(e) Node 149 in PeMSD4
(f) Node 170 in PeMSD4
(g) Node 211 in PeMSD4
(h) Node 287 in PeMSD4</p>
      <p>Predictions</p>
      <p>Uncertainity
0
50 100 150 200 250 300
0
50 100 150 200 250 300
0
50 100 150 200 250 300
0
50 100 150 200 250 300
Instance</p>
      <p>Instance</p>
      <p>Instance
(i) Node 85 in PeMSD8
(j) Node 104 in PeMSD8
(k) Node 155 in PeMSD8
(l) Node 162 in PeMSD8
o
l
F
c200
i
f
f
a
r100
T</p>
      <p>0
400
w300
o
l
F
c200
i
f
f
a
r100
T</p>
      <p>0
500
400
w
o
lF300
c
i
ff200
a
r
T100</p>
      <p>0
n1600
o
i
t1400
p
m1200
u
sn1000
o
C800
r
e600
w
Po400
30
e20
w
o
P
r10
a
l
o
S
0</p>
      <p>Instance</p>
      <p>Predictions
Ground Truth</p>
      <p>Uncertainity
Instance
w
w
0
50 100 150 200 250 300
0
50 100 150 200 250 300
0
50 100 150 200 250 300
Instance</p>
      <p>Instance</p>
      <p>Instance
0
50 100 150 200 250 300</p>
      <p>Instance
(m) Node 16 in Electricity
(n) Node 99 in Electricity
(o) Node 196 in Electricity
(p) Node 290 in Electricity
r</p>
      <p>r
400
w
o
lF300
c
i
ff200
a
r
T100</p>
      <p>0
500
400
w
o
lF300
c
i
ff200
a
r
T100</p>
      <p>0
200
w150
o
l
F
c100
i
f
f
a
r 50
T</p>
      <p>0
n650
o
i
tp550
m
u450
s
n
o350
C
r
e250
Po150
30
e20
w
o
P
r
a10
l
o
S
0
400
w
o
lF300
c
i
ff200
a
r
T100</p>
      <p>0
500
400
w
o
lF300
c
i
ff200
a
r
T100</p>
      <p>0
200
w150
o
l
F
c100
i
f
f
a
r 50
T</p>
      <p>0
n550
o
i
t
p450
m
u
s
n350
o
C
r250
e
Po150
30
r
0
50 100 150 200 250 300
50 100 150 200 250 300
50 100 150 200 250 300
Instance
uncertainty estimates obtained through multi-horizon forecasting on benchmark datasets.
Predictions
Uncertainity
400
w
o
lF300
c
i
ff200
a
r
T100</p>
      <p>0
400
w300
o
l
F
c200
i
f
f
a
r100
T</p>
      <p>0
700
600
w500
o
l
F400
c
if300
f
ra200
n1800
o
it1600
p1400
m
u1200
s
n1000
o
C800
r
e600
ow400
P
12
10
r
e8
w
o6
P
r 4
a
l
o2
S
0</p>
      <p>Instance
8.6. Forecasting uncertainty
with the corresponding ground-truth data(X(:+ − 1)), computed as follows:
The loss function for training the JHgRF-Net framework is the mean absolute error(MAE),
which is calculated by comparing the pointwise forecasts of the model predictions ( X̂︀ (:+ − 1))
1 ⃒
 ⃒
ℒMAE ( ) =
It is mathematically described as follows,</p>
      <p>During the training process, the model parameters  are fine-tuned to minimize the mean
absolute error(MAE) loss function, represented as ℒMAE ( ). The w/Unc-JHgRF-Net is a variant
of the JHgRF-Net that estimates the uncertainty in model predictions to enhance the reliability
of decision-making. The framework utilizes a heteroscedastic Gaussian distribution to predict
time-varying uncertainty in model predictions, characterized by mean and variance denoted by
 (︀ X(:+ − 1)︀) and  2(︀ X(:+ − 1)︀) , respectively, while the input time series is denoted as X(:+ − 1).</p>
      <p>The predicted mean and standard deviation can be obtained from the following equation, as
X̂︀ (:+ − 1) ∼  (︀  ︀( X(:+ − 1)︀) ,  2(︀ X(:+ − 1)</p>
      <p>︀)
 (︀ X(:+ − 1)︀) ,  2(︀ X^ (:+ − 1)︀) = (︀ X̂︀ (:+ − 1)
︀)</p>
      <p>A neural network  takes the output of the spatio-temporal inference component, represented
the predicted Gaussian distribution. It is mathematically described as follows,
by X̂︀ (:+ − 1), and predicts the mean and standard deviation of future observations. The future
observations are represented by X̂︀ (:+ − 1) and it is the Maximum likelihood estimation(MLE) of
X̂︀ (:+ − 1) =  (︀ X(:+ − 1)
︀)</p>
      <p>To predict a Gaussian(normal) distribution for sampling future observations, we typically use
the maximum likelihood estimates(MLE) of the distribution’s parameters, namely the mean and
standard deviation. The MLE values are obtained by maximizing the likelihood of the observed
data being generated by the predicted distribution. In simpler terms,  ︀( X(:+ − 1)
the future values, X̂︀ (:+ − 1), using the input time series, X(:+ − 1). Meanwhile,  2(︀ X(:+ − 1)
︀)
predicts the uncertainty in the model’s predictions over the next  time steps, starting from the
︀) estimates
current time point, . The uncertainty modeling framework optimizes the negative Gaussian log
likelihood([21]) of the observations, based on estimates of the mean and variance. This approach
provides a more comprehensive understanding and measurement of prediction uncertainty. The
negative Gaussian log likelihood measures the likelihood of the observations, given the estimated
mean and variance of the Gaussian distribution. A lower negative Gaussian log likelihood
indicates a better fit of the Gaussian distribution to the observed values. The Gaussian distribution
to predict the future observations is described by,
 ( X̂︀ (:+ − 1);  ︀( X(−  : − 1)︀) ,  ︀( X(−  : − 1)︀) ) =
1
 (︀ X(−  : − 1)
︀) √2
We apply logarithm transformation on both sides of the equation,
1
− 2 ⎝
⎛ X(:+ − 1) −  (︀ X(−  : − 1) ⎟
⎜</p>
      <p>︀) ⎞2
log  ( X̂︀ (:+ − 1);  ︀( X(−  : − 1)︀) ,  ︀( X(−  : − 1)︀) ) = log
[︃</p>
      <p>1
 (︀ X(−  : − 1)︀) √2
]︃
+ log ⎢⎢
⎢
⎣
1
= log</p>
      <p>(︀ X(−  : − 1)︀) + log √2 − 2
= − log  (︀ X(−  : − 1)︀) +  − 2
⎡
︃( X(:+ − 1) −  (︀ X(−  : − 1)︀) )︃2</p>
      <p>We drop the constant(C) and the Gaussian negative log likelihood loss(i.e., negative log
gaussian probability density function(pdf)) for the dataset is described by,</p>
      <p>=1
ℒGaussianNLLLoss=∑T︁ [︃ log  ︀( X(−  : − 1)
︀) 2
︀( X(:+ − 1) −   (︀ X(−  : − 1)
2</p>
      <p>The negative Gaussian log-likelihood is used to evaluate the model’s fit for estimated mean and
variance to the observations in the training set, where T denotes the time steps. To recapitulate,
the objective of the JHgRF-Net framework is to minimize the Mean Absolute Error(MAE), while
the w/Unc-JHgRF-Net framework aims to minimize the LGaussianNLLLoss to provide quantitative
uncertainty estimation. The w/Unc-JHgRF-Net framework uses a Gaussian likelihood function
to model the mean and variance, optimize the model parameters for data fitting, and provide
uncertainty estimates.
8.7. Datasets
Our novel JHgRF-Net framework effectiveness was evaluated by comparing it with existing
benchmark models on several real-world datasets, including PeMSD3, PeMSD4, PeMSD7,
PeMSD7(M), PeMSD8, Electricity, Solar-Energy, Exchange-Rate, Traffic, METR-LA,
PEMSBAY, SWaT, and WADI. Tables 17 - 19 provides additional information regarding these benchmark
datasets. The datasets used in this study comprise Solar-Energy, which records solar power
production from 137 PV plants in Alabama state at 10-minute intervals in 2016; Electricity, which
includes hourly records of electricity consumption(kWh) for 321 clients from 2012 to 2014;
Exchange-Rate, which collects daily exchange rates of eight foreign countries from 1990 to 2016;
and Traffic dataset provides hourly data on road occupancy rates(ranging from 0 to 1) that were
recorded on the various lanes of San Francisco Bay area freeways from 2015 to 2016, spanning
48 months in total. Moreover, METR-LA contains hourly data of traffic speed from loop detectors
on the highways in Los Angeles, while PEMS-BAY includes data on trafcfi volume and speed
from sensors on San Francisco Bay area freeways. Additionally, PeMS is an open-access dataset
that consists of vfie traffic network datasets(PeMSD3, PeMSD4, PeMSD7, PeMSD7(M), and
PeMSD8), obtained from the Caltrans Performance Measurement System across vfie California
districts, with data points available at 5-minute intervals, providing 288 data points per day. The
SWaT dataset consists of 11 days of continuous operation from an industrial water treatment
plant, comprising 7 days of normal operation and 4 days with 41 attack scenarios. Two versions
of the dataset are available, with Version 1 excluding the first 30 minutes and containing only
normal operation data. The WADI dataset is a collection of 16 days of continuous operation from
a testbed, encompassing 14 days of normal operation and 2 days with 15 attack scenarios. An
updated version is available that removes affected readings due to plant instability during certain
periods.</p>
      <p>Dataset Variables Training Points Testing Points Anomalies
SWaT 51 47515 44986 11.97</p>
      <p>
        WADI 127 118795 17275 5.99
8.8. Experimental Study design
In order to examine the effectiveness of the proposed models(JHgRF-Net, w/Unc-JHgRF-Net)
compared to the baselines, the various benchmark datasets were split into training, validation,
and test sets. The PEMS-BAY and METR-LA datasets were split with a 7/1/2 ratio, while all
other datasets were split with a 6/2/2 ratio. SWaT and WADI datasets have predefined splits,
where the training set is anomaly-free and the test set contains anomalies. We utilize the training
sets for our experiments. To prepare the SWaT and WADI datasets for training and evaluation,
we normalize each variable data by rescaling it to fit within the range of [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ], as in [ 22]. For
all other benchmark datasets, each variable data was preprocessed by scaling to have zero mean
and unit variance. During the training and evaluation of forecasting models, various accuracy
metrics, including MAE, RMSE, and MAPE, were calculated based on the original scale of
the time series data. The JHgRF-Net architecture was trained for 30 epochs on the training
set to minimize forecast error. The validation set was employed to identify the optimal model
that improves overall performance and early stopping was utilized to prevent overfitting. The
framework performance was evaluated on the test set to examine its ability to perform well on
unseen data. To improve convergence, the model training was optimized by using a learning rate
scheduler to effectively learn from the training set. If there was no improvement in the evaluation
metrics on the validation set over vfie epochs, the learning rate was reduced by half. The Adam
optimizer was employed to fine-tune the trainable parameters of the models. An initial learning
rate of 1 × 10− 3 was set to minimize the MAE loss for the JHgRF-Net model and the negative
Gaussian log-likelihood for the w/Unc-JHgRF-Net model, ensuring a better fit between the
ground truth and the model predictions. The use of powerful GPUs such as NVIDIA Tesla T4,
Nvidia Tesla V100, and GeForce RTX 2080 GPUs expedited the training process and allowed
for the utilization of larger models and datasets based on the PyTorch framework. Multiple
independent experimental runs were conducted, and the ensemble average was reported to ensure
reliable model evaluation. In the Section 8.4, we reported the optimum hyperparameters values
of the learning algorithm for each dataset, such as embedding size, number of hyperedges, batch
size, and learning rate.
8.9. Baselines
Established algorithms are commonly used as benchmarks for evaluating the performance of
proposed neural forecasting models such as JHgRF-Net and w/Unc-JHgRF-Net on the MTSF
task. The selection of benchmark algorithms depends on their extensive usage in the literature
and their demonstrated performance on benchmark datasets.
      </p>
      <p>
        • HA [23] is a time series prediction technique that involves using the average of a predefined
historical window of observations to predict the next value in the time series.
• ARIMA is a statistical analysis model commonly used for handling non-stationary time
series data, but it has limitations in handling long-term trends or changing seasonal patterns
over time.
• VAR([23]) is a linear multivariate time series model that extends the univariate
autoregressive(AR) model. It is designed to capture the inter-dependencies among multiple time
series variables for analyzing and forecasting complex systems.
• TCN( [24]) is specifically designed to handle sequential data in multistep-ahead time
series prediction tasks by using causal convolutions and dilation layers to incorporate past
information and learn long-range correlations. The use of these techniques allows the
model to capture and learn relationships between multiple variables in time series data,
making it effective in handling such complex data.
• FC-LSTM( [25]) is an encoder-decoder architecture that employs Long Short-Term
Memory(LSTM) units with peephole connections to perform multistep-ahead time series
prediction. By capturing both short-term and long-term relationships among multiple time
series variables in MTS data, this architecture effectively models the intricate patterns and
relationships, resulting in a highly complex and nonlinear representation of the data and
improved forecasting accuracy.
• GRU-ED( [26]) is an encoder-decoder architecture that utilizes Gated Recurrent Unit
(GRU) units to handle sequential data in multi-horizon time series prediction tasks. By
capturing relevant information from previous time steps, this framework effectively models
the sequential dependencies in the data.
• DSANet( [27]) is a time series forecasting method that utilizes convolutional neural
networks (CNNs) to capture long-range intra-temporal dependencies among multiple
time series, without relying on recurrent networks. In addition, it further incorporates
self-attention blocks to adaptively capture interdependencies and generate multi-horizon
forecasts for MTS data, resulting in an extremely effective and precise forecasting method.
• DCRNN( [28]) is a highly effective technique that combines graph convolution with
recurrent neural networks, utilizing bidirectional random walks on graphs. This unique
approach enables the model to predict multistep-ahead forecasts in MTS data through
an encoder-decoder architecture, which effectively captures complex spatial-temporal
dependencies in the data, resulting in highly accurate predictions.
• STGCN( [29])is a cutting-edge technique that seamlessly integrates graph convolution
and gated temporal convolution networks. By accurately capturing the spatial-temporal
correlations among multiple time series variables, this approach enables multi-horizon time
series prediction with high precision and accuracy.
• GraphWaveNet( [30]) uses a wave-based propagation mechanism and graph
representations that are computed from dilated causal convolution neural networks to model MTS
data. By jointly learning an adaptive dependency matrix and capturing spatial-temporal
dependencies, this method effectively captures the dependencies between multiple time
series variables, leading to improved multistep-ahead time series prediction accuracy.
• ASTGCN( [31]) utilizes an attention-based spatio-temporal graph convolutional neural
network to capture inter- and intra-dependencies for predicting multihorizon forecasts in
time series data. By using attention mechanisms, this technique effectively models the
spatial-temporal relationships between multiple time series variables, resulting in highly
accurate predictions.
• STG2Seq( [32]) is an advanced technique for predicting multistep-ahead forecasts in
MTS data that incorporates gated graph convolutional networks(GGCNs) with a
sequenceto-sequence(seq2seq) architecture featuring attention mechanisms. Through this unique
combination, the model captures dynamic temporal and cross-channel information to
effectively model the complex relationships among multiple time series variables, resulting
in highly accurate and precise predictions.
• STSGCN( [33]) is a time series prediction technique that utilizes multiple layers of
spatialtemporal graph convolutional networks to predict multistep-ahead forecasts in MTS data.
This approach captures localized intra- and inter-dependencies in the graph-structured MTS
data, effectively modeling the complex relationships between multiple time series variables
and resulting in high precision and accuracy for multi-horizon time series prediction.
• LSGCN( [34]) is a method used for multi-horizon time series forecasting, utilizing a
graph attention mechanism integrated into a spatial gated block to predict multistep-ahead
forecasts in MTS data. By utilizing attention mechanisms, it can effectively model dynamic
spatial-temporal dependencies between multiple time series variables, leading to improved
forecasting accuracy.
• AGCRN( [35]) predicts multistep-ahead forecasts in MTS data by utilizing a data-adaptive
graph structure learning method. This approach captures node-specific intra- and
intercorrelations, effectively modeling complex spatial-temporal dependencies among the
multiple time series variables, resulting in improved forecasting accuracy.
• STFGNN( [36]) is a time series prediction technique that fuses representations obtained
from temporal graph and gated convolutional neural networks to predict multistep-ahead
forecasts in MTS data. By operating these networks in parallel and learning
spatialtemporal correlations, STFGNN effectively models the complex relationships between
multiple time series variables, leading to improved forecasting accuracy.
• Z-GCNETs( [37]) predicts multi-horizon forecasts in MTS data by integrating a time-aware
zigzag topological layer into time-conditioned graph convolutional networks. It captures
hidden spatial-temporal dependencies and salient time-conditioned topological information
to effectively model complex relationships between multiple time series variables while
considering their topological properties. This approach performs well in time series
prediction tasks that require the modeling of complex dependencies and relationships.
• STGODE( [38]) predicts multistep-ahead forecasts in MTS data using a tensor-based
ordinary differential equation (ODE) to capture inter- and intra-dependency dynamics
among multiple time series variables. By effectively representing the MTS data, the
model can capture the complex relationships among variables and their temporal dynamics,
resulting in highly accurate and reliable predictions.
• GDN( [22]) is a graph-based anomaly detection model that leverages graph embeddings to
learn the inherent complex graph structure underlying the MTS data and further employs a
graph forecasting network to compute deviation scores for anomaly detection based on a
threshold on forecast error.
• NRI( [39]) is an unsupervised technique that learns to deduce interactions and dynamics
from observational data within a variational auto-encoder framework. This model
effectively predicts the inherent dynamics of complex systems, making it a valuable tool for a
wide range of applications.
• MTGNN( [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) presents a graph neural network framework for modeling multivariate time
series data. This innovative approach automatically extracts uni-directed relations among
multiple time series variables through a graph learning module, while also incorporating
mix-hop propagation and dilated inception layers to capture spatial and temporal
dependencies within the time series. The result is a highly accurate model capable of multi-horizon
forecasts, making it a suitable tool for various time-series applications.
• The vanilla LSTM( [? ]) predicts multistep-ahead forecasts in MTS data using gating
mechanism. The LSTM-U, consisting of N univariate LSTMs, treats all time series
variables as independent and performs univariate multi-horizon forecasting.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. Zhang,</surname>
          </string-name>
          <article-title>Graph wavenet for deep spatial-temporal graph modeling</article-title>
          ,
          <source>in: IJCAI</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1907</fpage>
          -
          <lpage>1913</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Adaptive graph convolutional recurrent network for traffic forecasting</article-title>
          , in: NeurIPS,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Zhang, Connecting the dots: Multivariate time series forecasting with graph neural networks</article-title>
          ,
          <source>in: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery &amp; data mining</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>753</fpage>
          -
          <lpage>763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting</article-title>
          ,
          <source>in: IJCAI</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>3634</fpage>
          -
          <lpage>3640</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Segovia-Dominguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Coskunuzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gel</surname>
          </string-name>
          , TAMP-s2GCNets:
          <article-title>Coupling time-aware multipersistence knowledge representation with spatio-supra graph convolutional networks for time-series forecasting</article-title>
          , in: International Conference on Learning Representations,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shahabi</surname>
          </string-name>
          , Y. Liu,
          <article-title>Diffusion convolutional recurrent neural network: Datadriven traffic forecasting</article-title>
          ,
          <source>in: ICLR (Poster)</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y. N.</given-names>
            <surname>Dauphin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Auli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grangier</surname>
          </string-name>
          ,
          <article-title>Language modeling with gated convolutional networks</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>933</fpage>
          -
          <lpage>941</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Poole</surname>
          </string-name>
          ,
          <article-title>Categorical reparameterization with gumbel-softmax</article-title>
          ,
          <source>arXiv preprint arXiv:1611.01144</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Van</given-names>
            <surname>Merriënboer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gulcehre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bougares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Learning phrase representations using rnn encoder-decoder for statistical machine translation</article-title>
          ,
          <source>arXiv preprint arXiv:1406.1078</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Kiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>Layer normalization</article-title>
          ,
          <source>arXiv preprint arXiv:1607.06450</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <article-title>On the equivalence between temporal and static equivariant graph representations</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>7052</fpage>
          -
          <lpage>7076</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Petty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Skabardonis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Varaiya</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z. Jia,</surname>
          </string-name>
          <article-title>Freeway performance measurement system: mining loop detector data</article-title>
          ,
          <source>Transportation Research Record</source>
          <volume>1748</volume>
          (
          <year>2001</year>
          )
          <fpage>96</fpage>
          -
          <lpage>102</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shahabi</surname>
          </string-name>
          , Y. Liu,
          <article-title>Diffusion convolutional recurrent neural network: Datadriven traffic forecasting</article-title>
          ,
          <source>arXiv preprint arXiv:1707</source>
          .
          <year>01926</year>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. O.</given-names>
            <surname>Tippenhauer</surname>
          </string-name>
          ,
          <article-title>Swat: A water treatment testbed for research and training on ics security</article-title>
          , in: 2016 international workshop on cyber
          <article-title>-physical systems for smart water networks (CySWater)</article-title>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>31</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>C. M. Ahmed</surname>
            ,
            <given-names>V. R.</given-names>
          </string-name>
          <string-name>
            <surname>Palleti</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          <string-name>
            <surname>Mathur</surname>
          </string-name>
          ,
          <article-title>Wadi: a water distribution testbed for research in the design of secure cyber physical systems</article-title>
          ,
          <source>in: Proceedings of the 3rd international workshop on cyber-physical systems for smart water networks</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>