1. Introduction

Joint Hypergraph Rewiring and Memory-Augmented Forecasting Techniques in Digital Twin Technology

Sagar Srinivas Sakhinana

Shivam Gupta

Krishna Sai Sudhir Aripirala

Venkataramana Runkana

Digital Twin technology creates virtual replicas of physical objects, processes, or systems by replicating their properties, data, and behaviors. This advanced technology offers a range of intelligent functionalities, such as modeling, simulation, and data-driven decision-making, that facilitate design optimization, performance estimation, and monitoring operations. Forecasting plays a pivotal role in Digital Twin technology, as it enables the prediction of future outcomes, supports informed decision-making, minimizes risks, driving improvements in efficiency, productivity, and cost reduction. Recently, Digital Twin technology has leveraged Graph forecasting techniques in large-scale complex sensor networks to enable accurate forecasting and simulation of diverse scenarios, fostering proactive and data-driven decision-making. However, existing Graph forecasting techniques lack scalability for many real-world applications. They have limited ability to adapt to non-stationary environments, retain past knowledge, lack a mechanism to capture the higher-order spatio-temporal dynamics, and estimate uncertainty in model predictions. To surmount the challenges, we introduce a hybrid architecture that enhances the hypergraph representation learning backbone by incorporating fast adaptation to new patterns and memory-based retrieval of past knowledge. This balance aims to improve the slowly-learned backbone and achieve better performance in adapting to recent changes. In addition, it models the time-varying uncertainty of multi-horizon forecasts, providing estimates of prediction uncertainty. Our forecasting architecture has been validated through ablation studies and has demonstrated promising results across multiple benchmark datasets, surpassing state-of-the-art forecasting methods by a significant margin.

eol>Digital Twins Deep Learning on Graphs Time-series Forecasting

1. Introduction

Digital twins have several applications in various domains, including finance, retail and ecommerce, logistics and transport, healthcare, and many other domains. Digital Twins are useful in finance for risk management, trading, and investment decision-making. They enable ifnancial institutions to simulate different scenarios and identify potential risks before they occur. They can help traders identify profitable opportunities and optimize their trades, while also allowing investors to model different economic scenarios and market conditions for better portfolio allocation strategies. Digital twins are useful in retail and ecommerce for creating virtual replicas of products, stores, and supply chains. This capability can contribute to transforming product design and development, streamlining operations, enhancing customer experiences, and driving sales growth. Digital Twins can be used in electricity pricing, auction, and design to optimize energy efficiency, reduce costs, and improve electricity markets. They can help energy analysts detect potential issues and optimize the layout and design of electricity grids to enhance energy efficiency and reduce costs. They can also assist electricity retailers in optimizing bidding strategies in electricity auctions to increase profits and reduce costs. Load forecasting is a crucial application of Digital Twins in electricity pricing, as it enables electricity distributors to accurately anticipate electricity demand and dynamically adjust pricing in real-time to prevent blackouts or brownouts. The digital twin technology involves creating a digital counterpart of a tangible entity, such as a machine, complex systems, or other physical objects. The creation of a digital twin involves utilizing diverse data sources, such as real-time sensor data, historical data, and other relevant information. By integrating this data into a processing system, the digital twin can effectively observe and record the key functionalities of the tangible entity. For instance, if the tangible entity under consideration is a gas turbine, a digital twin of the physical object would be created to mirror its exact specifications, such as size, shape, and technical features. Real-time sensor data from the turbine, including fuel injection rate, air-fuel ratio, inlet air temperature, and exhaust emissions, would be collected and fed into the digital twin. Subsequently, the digital twin would analyze this data and offer insights into the condition monitoring of the gas turbine. The digital twin can be employed to run simulations and analyze performance concerns for a wide range of applications, including fault diagnosis, safety monitoring, and performance optimization. The digital twin technology offers the opportunity to test potential upgrades to a physical object in a virtual environment prior to real-world implementation. This approach provides valuable insights that can be implemented on the physical object, resulting in the ability to improve operational efcfiiency, minimize downtime, and reduce maintenance expenses. Of particular interest in this work is digital twin technology for forecasting of complex dynamical systems. Forecasting is a critical aspect of digital twin technology as it enables accurate predictions of the behavior of a physical object, enabling proactive maintenance, operational efficiency improvement and safety monitoring. Furthermore, the digital twin can forecast the expected behavior of the physical object in different scenarios, enabling operators to optimize its performance and reduce downtime, while minimizing risks associated with implementing untested changes on the actual physical object. As a result, it is imperative to develop accurate models of physical systems in order to create Digital Twins that can faithfully replicate the behavior of the physical systems for forecasting purposes.

2. Related Work on Time Series Forecasting

Accurately forecasting the behavior of complex dynamical systems, which are characterized by high-dimensional multivariate time series(MTS) in interconnected sensor networks, is crucial for enabling well-informed decision-making in various applications. Forecasting MTS data is challenging due to the intricate relationships among multiple time series variables and the unique features of MTS data, including non-linearity, high-dimensionality and non-stationarity. The spatio-temporal graph neural networks(STGNNs) have become a popular approach to model the relational dependencies between time series variables in the MTS data for multivariate time series forecasting. Several researchers (e.g., [ 1, 2, 3, 4, 5, 6 ]) have contributed to this trend, and their work has significantly advanced the use of GNNs in time series forecasting task. Training STGNNs on the fly is challenging due to their inability to adjust to non-stationary environments and retain past knowledge. The ability of STGNNs to adapt quickly is critical, and successful approaches must handle changes to both new and recurring patterns effectively. However, STGNNs , despite their strong representation learning capabilities, face two major challenges when dealing with time series data streams. Firstly, training STGNNs on data streams in a straightforward manner requires a considerable number of samples to converge. This is because mini-batches or multiple epoch training, commonly used in offline training, are not feasible. Thus, when there is a distribution shift, such neural architectures can become cumbersome and require a large number of samples to learn new concepts effectively, which can ultimately result in suboptimal performance. In essence, the primary challenge lies in the absence of a mechanism within STGNNs to facilitate learning on continuously generated data streams effectively. As a result, the STGNNs must adapt to new trends and patterns in data streams over time. The second challenge arises from the fact that time series data frequently displays recurring patterns that may cease to exist temporarily and then reappear in the future. STGNNs are prone to the catastrophic forgetting phenomenon, whereby the model discards previously acquired knowledge when presented with new data, leading to suboptimal learning of recurring patterns. As a result, this limitation further hinders the overall performance of STGNNs for time series forecasting. Existing STGNNs can learn MTS data dynamics by simultaneously inferring discrete dependency graph structures or by leveraging domain expertise knowledge of predefined relationships among multiple time series variables. While complex dynamical systems consist of interconnected networks, these networks may have higher-order structural relations that extend beyond pairwise associations. Hypergraphs, which provide a more generalized representation of graphs, can effectively model such relations in high-dimensional MTS data. Furthermore, conventional STGNNs prioritize pointwise forecasting and do not offer uncertainty estimates associated with these multi-horizon forecasts. To tackle these challenges, we introduce the Joint Hypergraph Rewiring and Forecasting Neural Framework, which we will refer to as JHgRF-Net for brevity. The proposed framework achieves continual learning by balancing two objectives: (i) leveraging prior knowledge to facilitate rapid learning of current trends and patterns, and (ii) maintaining and updating previously acquired knowledge. The JHgRF-Net framework achieves dynamic balance between rapid adaptation to recent changes and retrieval of similar old knowledge by leveraging the interaction between two complementary components: the Spatio-Temporal Hypergraph Convolutional Network(STHgCN) and the Spatio-Temporal Transformer Network(STTN). The Mixture of Experts(MOE) approach is utilized to design algorithmic architecture for hypergraph time series forecasting. This approach involves using the aforementioned set of complementary modeling approaches, whose predictions are combined to create a robust mechanism capable of improving the overall accuracy of forecasting. The STHgCN neural operator simultaneously infers discrete dependency hypergraph structure and learns MTS data dynamics. The STHgCN neural operator consists of two sequentially operating modules: hypergraph-structure learning(HgSL) and hypergraph representation learning(HgRL). The HgSL module infers the discrete dependency hypergraph structure and performs hypergraph rewiring to modify the hyperedges so that they better reflect the dependencies between hypernodes. This can involve adding or removing hyperedges to optimize the relational structure between hypernodes. The HgRL module models the spatio-temporal dynamics underlying the hypergraph-structured MTS data for multi-horizon forecasting. The STTN neural operator learns the underlying dynamics of MTS data beyond the original sparse relational hypergraph structure through a self-attention mechanism. The STTN neural operator learns the underlying dynamics of MTS data beyond the original sparse relational hypergraph structure through a self-attention mechanism. A gating mechanism is utilized to regulate the information flow from complementary components. This mechanism further distills knowledge and improves the accuracy and reliability of the model’s predictions. Moreover, the framework captures time-varying uncertainty in forecasts. As a result, the framework provides accurate multi-horizon predictions and reliable uncertainty estimates of forecasts. Furthermore, the framework is designed to provide superior generalization and scalability for large-scale spatio-temporal MTS forecasting tasks that are commonly encountered in real-world applications.

3. Problem Formulation

Let us consider a historical time series dataset with correlated variables observed over T time steps. The dataset is represented by the notation X=(︀ x1, . . . , xT)︀ , where the subscript indicates the time step. The observations of all the variables at time step are denoted by x=(︀ x(1), x(2), . . . , x())︀ ∈ R(× ), where the superscript refers to the variables. Each sensor can measure multiple physical quantities denoted by . For example, in intelligent transportation systems, the traffic loop detectors or traffic sensors placed across travel lanes can simultaneously measure three parameters: trafcfi flow, speed, and volume. Therefore, in this particular case, = 3. In MTSF, we use a rolling-window technique to predict the future values of n-correlated variables for the forecast horizon. At each time step , we define a look-back window which includes the prior -steps of time series data to predict the next -steps. We use a historical window of -correlated variables, observed over the previous -steps prior to time step , represented by X(− : − 1) ∈ R× × , to predict the future values of -variables for the next -steps, represented by X(:+ − 1) ∈ R× × . To capture complex higher-order relationships among variables within the MTS data, we represent the historical data as continuous-time spatial-temporal hypergraphs denoted by G. Hypergraphs consist of hypernodes(V), representing time series variables and hyperedges(E) that capture hierachial relationships among an arbitrary number of hypernodes. The time-dependent hypernode feature matrix is denoted by X(− : − 1). We learn the implicit hypergraph structure through an embedding-based similarity metric learning approach. The incidence matrix I∈R× describes the hypergraph structure, where I, =1 if hyperedge is incident with hypernode , and 0 otherwise. Hypergraph sparsity is determined by the number of hyperedges in the hypergraph. In a sparse hypergraph, the number of hyperedges(m(|E|)) is relatively small compared to the number of hypernodes(n(|V|)), while in a dense hypergraph, the number of hyperedges is relatively large. Sparser hypergraphs generally result in more efficient algorithms, due to the impact of hypergraph sparsity on computational efficiency and algorithmic complexity. A hypergraph with more hyperedges has a denser and more complex structure, resulting in a higher level of connectivity among the hypernodes. Conversely, a hypergraph with fewer hyperedges has a sparser structure with fewer connections between the hypernodes. The proposed framework aims to learn a differentiable function ( ) that can predict the future estimates X(:+ − 1), of historical window inputs X(− : − 1), given a hypergraph G. To put it This is mathematically represented as: briefly, the function ( ) takes in the past observations and hypergraph structure, represented by [x(− ), · · · , x(− 1); G], and predict future observations, denoted as [x(+1), · · · , x(+ − 1)]. ︀[ x(− ), · · · , x(− 1); G]︀ →− ( )[︀ x(+1), · · · , x(+ − 1)

︀] min ℒMAE(︀ X(:+ − 1), X̂︀ (:+ − 1); X(− : − 1), G)︀

The MTSF task formulated on the implicit hypergraph(G), can be expressed as shown below: which is defined as:

The function ( ) involves a set of parameters which can be trained to optimize its performance. The predicted future observations is denoted by X̂︀ (:+ − 1). To train the learning algorithm, we minimize the loss function denoted by ℒMAE, i.e., the mean absolute error(MAE), Here, X(:+ − 1) is the actual future MTS data, and 1 is a scaling factor.

ℒMAE =

4. OUR APPROACH

Our proposed neural forecasting framework consists of two key components: the projection layer and the spatio-temporal feature extractor, as shown in Figure 1. The spatio-temporal inference component includes two distinct methods for hypergraph representation learning: the Spatio-Temporal Hypergraph Convolutional Network(STHgCN) and the Spatio-Temporal Transformer Network(STTN). The STHgCN method employs hypergraph as a mathematical model for learning the underlying higher-order relations of the time series variables. This is achieved by optimizing the discrete hypergraph structure underlying the observed data. It then peforms the gated hypergraph convolution operations on the hypergraph-structured MTS data to model the intricate spatio-temporal dynamics within the latent hypernode-level representations. The final representations can then be used to predict multi-horizon forecasts. The STTN method is a powerful technique for modeling the hypergraph-structured MTS data. The STTN method extends transformer networks to handle arbitrary sparse hypergraph structures with full attention as a useful inductive bias. This enables the model to learn intra- and inter-correlations among the variables without being limited by the hierarchical structural information underlying the MTS data. It leverages task-specific relations between variables beyond the original sparse structure to generate expressive hypernode-level representations that improve forecast accuracy. We use a gating mechanism to regulate the flow of information from the two methods. This enables us to learn optimal representations of the hypernode-level representations that capture the accurate dynamics of complex interconnected sensor networks. To summarize, our framework performs the joint optimization of the different learning components to generate accurate forecasts across multiple forecast horizons, while also ensuring reliable estimates of uncertainty for time-series forecasting tasks.

Windowed

Time Series

Pointwise

Forecasts r e y a L n o i t c e j o r P

HgSL

Spatio-Temporal Inference

STHgCN STTN

Δ Δ Gating Mechanism The proposed framework uses a projection layer with gated linear networks(GLN, [ 7 ]) to obtain non-linear representations of input data. Specifically, the input data X(− : − 1) ∈ R× × is transformed through a gating mechanism, resulting in X(:+ − 1) ∈ R× × , which represents the non-linear transformed input data. It is described as follows:

X(:+ − 1)=(︀ (W0X(− : − 1)) ⊗ W1X(− : − 1)︀) W2

Here, the trainable weight matrices are W0, W1∈R× , W2∈R × , and the element-wise multiplication is denoted by ⊗ . The utilization of a non-linear activation function improves representation learning and enables the framework to effectively learn and model complex patterns present in the MTS data. 4.2. SPATIAL-INFERENCE Figure 2 illustrates the spatio-temporal feature extractor of the framework, which consists of two distinct methods(STHgCN and STTN). Further information regarding each method will be elaborated in the subsequent sections. 4.2.1. Spatio-Temporal Hypergraph Convolutional Network(STHgCN) The STHgCN method comprises sequentially operating modules, including hypergraph structure learning(HgSL) and hypergraph representation learning(HgRL) modules. The following sections will elaborate on each module and provide more details. 4.2.1.1. Implicit hypergraph Inferenece The HgSL module uses an embedding-based similarity metric learning technique to capture higher-order dependency relationships between different variables in the MTS data and computes an optimal discrete hypergraph structure for a hypergraph-structured representation of the MTS data. In short, the implicit hypergraph provides a spatio-temporal inductive bias that enables a structured representation of the MTS data, capturing the underlying relationships and dependencies among the variables. The hypernodes and hyperedges of the hypergraph are represented by the differentiable embeddings in the -dimensional vector space, zi, zj∈R(), where 1≤ ≤ and 1≤ ≤ . By leveraging the learned embeddings to transform the MTS data into a hypergraph-structured time series data, the HgSL module computes the optimal hypergraph topology that captures the task-relevant relationships and dependencies among the variables, making it a powerful tool for learning relational hypergraph structures from complex MTS data. The pairwise similarity(P,) between any pair of zi and zj is computed as follows: ziTzj + 1 2 ‖zi‖ · ‖ zj‖ where ‖ denotes vector concatenation. The differentiable, sigmoid activation function is applied to map the pairwise scores to the interval [ 0,1 ]. The hyperedge probability over hypernodes of the hypergraph is represented as P(,)∈R× 2, where ∈{0, 1}. The scalar value of P(,)∈[ 0, 1 ] indicates the relationship between a pair of hypernodes and hyperedges, indexed by (, ). To be precise, P(,0) represents the probability of hypernode being connected to hyperedge , while

P, = (︀ [S,||1 − S,]︀) ; S, = P(,1) denotes the probability that hypernode is not connected to hyperedge . To accurately and efcfiiently sample discrete hypergraph structures from the hyperedge probability distribution P, , we leverage the Gumbel-softmax trick introduced in [ 8 ]. This technique is powerful in capturing complex relationships among variables in MTS data, making the HgSL module more effective. The connectivity pattern of the hypergraph structure is then represented using an incidence matrix I∈R× , which captures the relationships between hypernodes and hyperedges in the hypergraph. By using the Gumbel-softmax trick, we can learn the hypergraph structure in an end-to-end differentiable manner. Thus, it becomes possible to apply the gradient-based optimization methods during model training, enabling an inductive-learning approach to learn complex underlying structures within the MTS data. The Gumbel-Softmax trick involves using random noise from the Gumbel distribution to perturb the hyperedge probability distribution and then sampling the optimal discrete structure from the distribution using the Softmax function. The incidence matrix is obtained as,

I, = exp (︀ (,) + P(,) + )︀ / )︀⧸ ∑︁ exp (︀ (,) + P(,) + )︀ / )︀ where the temperature parameter( ) of Gumbel-Softmax trick is set to 0.05, and is a small constant added to avoid numerical instability. Random noise, denoted by ()∼ Gumbel(0, 1)=log(− log(U(0, 1)) is sampled from the Gumbel distribution, where U represents the uniform distribution with a range of 0 to 1. We optimize the hypergraph distribution parameters to ensure that the learned hypergraph is sparse, eliminating redundant hyperedges over hypernodes. The forecasting task provides indirect supervisory information that helps to reveal the hypergraph relation structure in the observed MTS data. In summary, the HgSL module learns the latent hypergraph structure of multiple interacting time series variables to create a structured representation of the time series data, which facilitates downstream multi-horizon forecasting with predictive uncertainty estimation. 4.2.1.2. Hypergraph Attention Network(HgAT) The HgAT neural operator extends attention-based convolution operations to non-Euclidean domains, such as hypergraphs. It accurately models the complex hypergraph-structured MTS data, thereby improving multi-horizon forecast accuracy. The HgAT operator captures spatial correlations among time-series variables by encoding relational inductive bias within the hypergraph’s connectivity. It performs message-passing schemes to propagate information through the hypergraph-structured MTS data, which is characterized by an incidence matrix (represented by I∈R× ) and a feature matrix (represented by X(:+ − 1) ∈ R× × ) to compute the hypernode representation matrix (represented by H(:+ − 1)∈ R× × ). Each row in the matrix H(:+ − 1) represents the hypernode representations, h∈R × . The HgAT operator captures relationships among timeseries variables by encoding structural and feature characteristics of spatio-temporal hypergraphs in hypernode representations. It adapts to changes in time-series variable dependencies over time in the hypernode representations h. The HgAT operator models spatio-temporal correlations among time-series variables in hypergraph-structured MTS data using intra-edge and inter-edge neighborhood aggregation schemes. The intra-edge aggregation considers hypernodes associated with a specific hyperedge, while inter-edge aggregation considers hyperedges connected to a specific hypernode. In hypergraph-structured MTS data, hyperedges capture relationships between multiple time-series variables, which can have varying degrees of correlation and complexity. Let the notation N, represent a subset of hypernodes associated with a specific hyperedge . The intra-edge neighborhood of a hypernode , denoted as N,∖, captures a localized cluster of semantically-corelated time-series variables and their higher-order relationships. The inter-edge neighborhood of a hypernode , represented by N, , includes the set of hyperedges connected to that hypernode, providing a more comprehensive understanding that each variable may have multiple and potentially complex relationships with other time-series variables in the data. We use attention-based intra-edge neighborhood aggregation to obtain latent hyperedge representations, which leads to a more comprehensive understanding of the MTS data. This approach can be described as follows:

Z h(,ℓ)=∑︁ (︀ ∑︁ ,

(,ℓ,)W0()h(,ℓ− 1,))︀ =1 ∈ N, where the hyperedge representations at layer ℓ are denoted by h()∈ R × . Each hypernode’s initial representation is its corresponding feature vector, h(,0,)= x()

where x()

∈ R × represents the ℎ row of the feature matrix X(:+ − 1)∈R× × . At each layer, the HgAT operator produces multiple representations denoted by h(,) of the input data, each with its own set of parameters, and combines them by summation. This enables the HgAT operator to capture various aspects of the relations underlying the intra-edge neighborhood in the hypergraph-structured MTS data. To determine the attention coefficient , for the hypernode incident with hyperedge , we compute its relative importance as follows: (,,ℓ,)= (,,ℓ,)=ReLU (︀ W(0)h(,ℓ− 1,))︀

(,ℓ,))︀ exp (︀ ,

=1 ∈N,

The weight matrices that are trained are represented as W(0), W(1)∈R× . The ReLU activation function is used to introduce non-linearity while updating the hypernode-level representations. The attention scores , are normalized and determine the relevance of each hyperedge that is incident with hypernode . This allows the HgAT operator to focus on the most significant hyperedges, and the attention scores are computed as follows: (,,ℓ,)=ReLU (︀ W(3) · (︀ W(2)h(,ℓ− 1,) ⊕ W(2)h(,ℓ,))︀

(,ℓ,)= ,

(,ℓ,)) exp(,

(,ℓ,)) ∑︀ ∈ N, ∪ exp(, where W(2)∈R× and W(3)∈R2 are trainable weight matrix and vector, respectively. ⊕ denotes the concatenation operator. The unnormalized attention score is denoted by , . Batch normalization and dropout techniques are used to enhance generalization and mitigate overfitting. A gating mechanism is employed to selectively combine features from x() and h(,ℓ) in a differentiable way. These methods improve the HgAT operator reliability and accuracy for the downstream MTSF task.

()= (︀ (h(,ℓ)) + (x()))︀ h(,ℓ)= (︀ ()(h(,ℓ)) + (1 − ())(x()))︀ where and denote the linear projections, enabling the HgAT operator to capture the relationships between time-series variables and their temporal changes, resulting in enhanced forecast accuracy. In summary, the HgAT operator is a powerful technique for encoding and analyzing spatio-temporal hypergraphs. 4.2.1.3. Saptio-temporal Hypergraph Representation Learning We present the spatio-temporal hypergraph representation learning(HgRL) module to operate on a sequence of dynamic hypergraphs, where hypergraph structure is fixed, and hypernode attributes change over time, where each hypergraph represents the hypergraph-structured MTS data at a specific time step. The HgRL operator utilizes Gated Recurrent Units(GRU, [ 9 ]) to model the spatio-temporal dynamics of the dynamic hypergraph sequence. The computation of the update gate, reset gate, and hidden state in a traditional GRU involves matrix multiplication with weight matrices. In the HgRL module, however, these matrix multiplications are replaced with Hypergraph Attention Networks(HgAT). The HgRL operator analyzes hypergraph-structured MTS data over time. It propagates information between hypernodes across different time steps, which enables the model to capture the complex spatio-temporal dependencies between the hypergraphs. The HgRL operator utilizes the implicit hypergraph topology to propagate information between hypernodes by averaging the hypernode representations in their local neighborhood at each time step computed as follows,

U:+ − 1 = (︀ W [︀ (︀ I, X(:+ − 1)︀) || H− : − 1︀] + B)︀ R:+ − 1 = (︀ W [︀ (︀ I, X(:+ − 1)︀) || H− : − 1︀] + B)︀ where (︀ I, X(:+ − 1)︀) denote the HgAT operator. ||, and ⊗ denotes the concatenation operation and element-wise multiplication operation. The update and reset gates at time are represented by the matrices U:+ − 1 and R:+ − 1, respectively. W, W, and W are learnable weight matrices and B, B, and B are learnable biases. In summary, the node representation matrix, H:+ − 1 captures the spatio-temporal dynamics at different scales underlying the discrete-time dynamic hypergraphs, where each row in H:+ − 1 represents the hypernode representations hi() ∈ R × , ∀∈V. Some of the key advantages of T-HGCN operator over traditional methods include its ability to handle large and sparse spatio-temporal hypergraphs. The STHgCN method utilizes useful relational inductive bias encoded in the hypergraph-structured data for modeling the continuous-time nonlinear dynamics of the complex system to disentangle the various latent aspects underneath the data for better forecast accuracy. 4.2.2. Spatio-Temporal Transformer Network(STTN) The Spatio-temporal transformer network(STTN) operator is a new extension of transformer networks ([ 10 ]) that incorporates full attention as a desired inductive bias to model MTS data with arbitrary sparse hypergraph structures. This capability enables it to capture fine-grained spatio-temporal dependencies in MTS data, unconstrained by hierarchical structural information underlying the MTS data. By allowing attention to all hypernodes within the hypergraph, the neural operator can span large receptive fields and reason globally about complex dependencies in hypergraph-structured MTS data. As a result, it can serve as a drop-in replacement for existing methods that model hierarchical relationships among time-series variables in MTS data. Additionally, the neural operator is particularly suitable for downstream forecasting tasks in spatio-temporal hypergraphs. The transformer encoder comprises alternating layers of multiheaded self-attention(MSA) and multi-layer perceptron(MLP) blocks to capture both local and global contextual information. To enhance performance and regularize the transformer operator, each block is followed by layer normalization(LN([ 11 ])) and residual connections. The skipconnections are incorporated through an initial connection strategy inspired by ResNets([ 12 ]) to address vanishing gradients and over-smoothing issues and enable the learning of complex and deep representations of the data. Using a space-then-time(STT, [ 13 ]) approach, the STTN ifrst performs a temporal-encoding step to capture the long-term temporal dependencies(intradependencies) within the time series variables. This is followed by a spatial-encoding step, which captures the inter-dependencies among the time series variables. We model the intra- and inter-dependencies through a sequential operating temporal and spatial transformer networks, respectively. 4.2.2.1. Temporal Transformer In self-attention mechanism, the input sequences is transformed into three tensors: the query tensor, the key tensor, and the value tensor, where the input tensor is denoted by X(:+ − 1)∈ R× × and has three dimensions: number of time series variables(), forecast horizon( ), and embedding dimension(). The key tensor is searched using the query tensor to retrieve relevant information, and the value tensor is weighted by the resulting attention weights. The weighted value tensor is then summed to produce the final output. To begin with, we reshape the input tensors to split the embedding dimension into multiple heads: queries,, (:+ − 1)= queries,,ℎ* ℎ →queries,,ℎ,ℎ (:+ − 1) keys(,:,+ − 1)= keys(,:+,ℎ * −ℎ1)→keys(,:+,ℎ , ℎ−1) values,, (:+ − 1)= values,,ℎ* ℎ →values,,ℎ,ℎ (:+ − 1) where, ℎ and ℎ represents the index of the attention head, and head dimension, respectively. Here, and represent the indices of the query and key positions, respectively. We compute the energy between queries and keys, as described below.

energy(,:,+, ℎ− 1)=∑︁ queries(,:,+ℎ, ℎ−1) · keys(,:+,ℎ , ℎ−1) We compute the attention scores using the softmax function described below: attention(,:,+, ℎ− 1)= exp energy(,:,+, ℎ− 1)/√ℎ)︁

︁( ∑︀′ exp energy(,:,+′,−ℎ 1)/√ℎ︁)

︁( out(,:,+ℎ, ℎ−1)=∑︁ attention(,:,+, ℎ−1) · values,,ℎ,ℎ

To calculate the output tensor, we multiply the values tensor with the attention scores, which is described below,

We perform the concatenation operation along the ℎ dimension, which combines the outputs of all the heads. We apply a linear transformation to obtain the final output, as follows, (:+ − 1)= out(,:,+ℎ* ℎ−1)Wℎ* ℎ, ,, 4.2.2.2. Spatial Transformer The output of the temporal transformer, denoted by out(:+ − 1)∈ R(× × ), is passed to the spatial transformer as input, and it consists of three dimensions: number of time series variables(), forecast horizon( ), and embedding dimension(). The input sequences are first transformed to three tensors, namely the query tensor, the key tensor, and the value tensor, before applying the self-attention mechanism. In order to retrieve relevant information, the query tensor is employed to search through the key tensor. The resulting attention scores are then used to weight the value tensor. Finally, the weighted values are subsequently aggregated to produce the final output. We reshape the input tensors to split the embedding dimension into multiple heads: queries,, (:+ − 1)= queries,,ℎ * ℎ →queries,,ℎ,ℎ (:+ − 1) keys(,:,+ − 1)= keys(,:,ℎ+* −ℎ1)→keys(,:,ℎ+, ℎ − 1) values,, (:+ − 1)= values,,ℎ * ℎ →values,,ℎ,ℎ (:+ − 1) where, ℎ and ℎ denote the number of heads, and head dimension, respectively. We compute the energy between queries and keys, as described below.

We obtain the attention scores using the softmax function: energy(,:,+,ℎ − 1)=∑︁ queries(,:,ℎ+, ℎ − 1) · keys(,:,ℎ+, ℎ − 1) exp energy(,:,+,ℎ − 1)/√ℎ)︁

︁( ∑︀′ exp energy(,:′+,,ℎ− 1)/√ℎ︁)

︁( We then compute the output tensor by multiplying the attention scores with the values tensor: out(,:,ℎ+, ℎ − 1)=∑︁ attention(,:,+,ℎ − 1) · values,,ℎ,ℎ We apply a linear transformation to obtain the final output, as follows, out(,:,+ − 1)= (:+ − 1)Wℎ* ℎ, ,,ℎ * ℎ 4.2.3. Gating Mechanism The mixture-of-experts(MOE) mechanism in deep learning combines predictions from multiple subnetworks, such as “STHgCN" and “STTN" representation learning methods, through a gating mechanism that computes a weighted sum of their predictions based on the input. The aim is to ifnd the optimal weight assignment for the gating function and train the experts accordingly using these weights. From a cooperative game theory perspective, the MOE is a cooperative game where experts collaborate to optimize the system’s overall performance. The gating mechanism can optimize the weights assigned to each expert by evaluating their individual performance, as well as the system’s overall performance. The fused representations in MOE are obtained by combining expert predictions using the gating mechanism weights. This is described below: ′′= (︀ ′′(H:+ − 1) + ′′(out(:+ − 1)))︀

X̂︀ (:+ − 1)= (︀ ′′(H:+ − 1) + (1 − ′′)(out(:+ − 1)))︀ where, X̂︀ (:+ − 1) are model multi-horizon forecasts. H:+ − 1 and out(:+ − 1) denote the hypernode representation matrix computed by the STHgCN and STTN neural network methods, respectively. ′′ and ′′ are linear projections. Moreover, our framework variant(w/UncJHgRF-Net) ensures precise and reliable uncertainty estimates of multi-horizon forecasts by minimizing the negative Gaussian log likelihood. For more details, refer to the appendix. The proposed methods(JHgRF-Net, w/Unc-JHgRF-Net) enable end-to-end modeling of hidden interdependencies and their evolution over time in sensor network-based dynamical systems for highly accurate forecasting task.

5. Datasets

The study aims to evaluate the effectiveness of two new models, JHgRF-Net and w/Unc-JHgRFNet(JHgRF-Net with local-uncertainty estimation), on large-scale spatial-temporal datasets([ 14 ]) containing real-world trafcfi information. The datasets include PeMSD3, PeMSD4, PeMSD7, PeMSD7(M), and PeMSD8. The study includes a preprocessing step to ensure consistency with prior research by aggregating the 30-second interval data into 5-minute averages. Additionally, publicly accessible METR-LA and PEMS-BAY datasets([ 15 ]) were used for traffic flow prediction. The preprocessing step involves transforming the time series data into 5-minute interval averages to ensure a fair comparison with the prior research. For all the above-mentioned traffic datasets, we possess information about the underlying sensor graph. To create the sensor graph, we calculated the distances between sensors in the road network and utilized a thresholded Gaussian kernel to build the adjacency matrix. Our experimental findings, discussed in the next section, support the rationale of learning the implicit hypergraph relational structure of the variables underlying the MTS data and modeling the spatial-temporal dynamics for improved forecast accuracy compared to the learning to forecast on predefined(prior-known) sensor graphs.Furthermore, we utilize various multivariate datasets, including Electricity1, Solar-energy2, Exchange-rate3, and Traffic 4, for which no prior sensor graph structure exists. Additionally, the SWaT([ 16 ]) and WADI([ 17 ]) are sensor datasets that measure water treatment plants and also do not have a predefined sensor graph structure. They were first used in prior research for anomaly detection due to the presence of annotated anomalies, but later used in forecasting experiments because their training sets are anomaly-free. The experimental study conducted on benchmark datasets aims to showcase the effectiveness and advantages of the proposed methodology( JHgRF-Net and w/Unc-JHgRF-Net) in analyzing and modeling complex spatio-temporal MTS data, surpassing existing methods.

6. Experimental results

Table 2 provides a thorough comparison between the proposed models(JHgRF-Net and w/UncJHgRF-Net), and several baseline models on the MTSF task across vfie different benchmark datasets: PeMSD3, PeMSD4, PeMSD7, PeMSD7M, and PeMSD8. To evaluate the models effectiveness, we measured forecast errors for a well-established benchmark, involving a 12( )step-prior to 12( )-step-ahead forecasting task. We utilize a multi-metric approach in forecasting tasks to comprehensively evaluate the proposed models performance compared to the baseline models. We use several performance metrics, including mean absolute error(MAE), root mean 1archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 2www.nrel.gov/grid/solar-power-data.html 3github.com/laiguokun/multivariate-time-series-data 4https://pems.dot.ca.gov

Model squared error(RMSE), and mean absolute percentage error(MAPE) to provide an accurate estimate of the models performance. We reported the baseline model results from [18]. Our experimental findings indicate that the proposed models( JHgRF-Net and w/Unc–Net) consistently outperformed the baseline models, exhibiting lower forecast errors across the different benchmark datasets. On the PeMSD3, PeMSD4, PeMSD7, PeMSD8, and PeMSD7(M) datasets, the proposed model(JHgRF-Net) demonstrated significant improvement over the next-best baseline models, achieving a reduction of 20.71%, 7.49%, 2.81%, 11.08%, and 1.30% in the RMSE metric, respectively. Apart from pointwise forecasts, the w/Unc-JHgRF-Net model(which integrates JHgRF-Net with local uncertainty estimation) predicts time-varying uncertainty estimates of the multi-horizon forecasts. While it exhibits slightly lower performance than the JHgRF-Net model, it still outperforms several robust baselines found in the literature, as demonstrated by the reduced prediction error. Additionally, in Table 1 , we show the performance of JHgRF-Net and w/Unc-JHgRF-Net, and several baseline models on the MTSF task across multiple datasets: METR-LA, PEMS-BAY, Solar-energy, Electricity, Exchange-rate, Trafcfi, SWaT and WADI . The models were evaluated using various metrics, including MAE, RMSE, and MAPE, and corresponding forecast errors were reported for 3-, 6-, and 12-steps ahead forecast horizons. The proposed models, JHgRF-Net and w/Unc-JHgRF-Net, demonstrated superior performance compared to the baseline models, with significantly lower forecast errors observed on all the datasets. On the METR-LA, PEMS-BAY, Solar-energy, Electricity, Exchange-rate, Trafcfi, SWaT and WADI datasets, the proposed model(JHgRF-Net) shows superior performance over the next-best baseline models, achieving a reduction of 37.01%, 21.74%, 54.93%, 3.77%, 37.14%, 18.18%, 51.25% and 17.63% in the RMSE metric, respectively for the 6-step ahead forecast horizon. Our empirical findings validate the efficacy of the proposed neural forecasting architecture to capture the complex nonlinear spatio-temporal dynamics that are present in MTS data, leading to improved forecasting performance. Please refer to the appendix, for further details on the experimental methodology, ablation studies, and additional experimental results. The appendix includes a comprehensive analysis of the JHgRF-Net model’s ability to handle missing data, as well as a more detailed description of the w/Unc-JHgRF-Net model’s ability to estimate uncertainty. Additionally, the appendix offers comprehensive visualizations of model predictions with uncertainty estimates in comparison to the ground truth, along with additional information on brief overview of the baseline models.

7. Conclusion

Our proposed forecasting architecture accurately models the complex spatio-temporal dynamics within MTS data and achieves accurate multi-horizon forecasts compared to the several baselines. The experimental results obtained from real-world datasets demonstrate the effectiveness of our approach, as supported by improved forecast estimates and reliable uncertainty estimations. In the future, our focus will be on expanding the framework’s capabilities to handle large-scale graph datasets, enabling its utilization for a wide range of applications, such as anomaly detection, missing data imputation, etc. [18] J. Choi, H. Choi, J. Hwang, N. Park, Graph neural controlled differential equations for traffic forecasting, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 2022, pp. 6367–6374. [19] I. Marisca, A. Cini, C. Alippi, Learning to reconstruct missing data from spatiotemporal graphs with sparse observations, arXiv preprint arXiv:2205.13479 (2022). [20] A. Cini, D. Zambon, C. Alippi, Sparse graph learning for spatiotemporal time series, arXiv preprint arXiv:2205.13492 (2022). [21] D. A. Nix, A. S. Weigend, Estimating the mean and variance of the target probability distribution, in: Proceedings of 1994 ieee international conference on neural networks (ICNN’94), volume 1, IEEE, 1994, pp. 55–60. [22] A. Deng, B. Hooi, Graph neural network-based anomaly detection in multivariate time series, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 2021, pp. 4027–4035. [23] J. D. Hamilton, Time series analysis, Princeton university press, 2020. [24] S. Bai, J. Z. Kolter, V. Koltun, An empirical evaluation of generic convolutional and recurrent networks for sequence modeling, arXiv:1803.01271 (2018). [25] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks, in:

NeurIPS, 2014, pp. 3104–3112. [26] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using rnn encoder-decoder for statistical machine translation, in: EMNLP, 2014. [27] S. Huang, D. Wang, X. Wu, A. Tang, Dsanet: Dual self-attention network for multivariate time series forecasting, in: CIKM, 2019. [28] Y. Li, R. Yu, C. Shahabi, Y. Liu, Diffusion convolutional recurrent neural network: Datadriven traffic forecasting, in: ICLR, 2018. [29] B. Yu, H. Yin, Z. Zhu, Spatio-temporal graph convolutional networks: A deep learning framework for trafcfi forecasting, in: IJCAI, 2018. URL: https://doi.org/10.24963/ijcai. 2018/505. doi:10.24963/ijcai.2018/505. [30] Z. Wu, S. Pan, G. Long, J. Jiang, C. Zhang, Graph wavenet for deep spatial-temporal graph modeling, in: IJCAI, 2019, pp. 1907–1913. [31] S. Guo, Y. Lin, N. Feng, C. Song, H. Wan, Attention based spatial-temporal graph convolutional networks for trafcfi flow forecasting, in: AAAI, 2019. doi: 10.1609/aaai. v33i01.3301922. [32] L. Bai, L. Yao, S. S. Kanhere, X. Wang, Q. Z. Sheng, Stg2seq: Spatial-temporal graph to sequence model for multi-step passenger demand forecasting, in: IJCAI, 2019. URL: https://doi.org/10.24963/ijcai.2019/274. doi:10.24963/ijcai.2019/274. [33] C. Song, Y. Lin, S. Guo, H. Wan, Spatial-temporal synchronous graph convolutional networks: A new framework for spatial-temporal network data forecasting, in: AAAI, 2020. doi:10.1609/aaai.v34i01.5438. [34] R. Huang, C. Huang, Y. Liu, G. Dai, W. Kong, Lsgcn: Long short-term trafcfi prediction with graph convolutional networks., in: IJCAI, 2020, pp. 2355–2361. [35] L. Bai, L. Yao, C. Li, X. Wang, C. Wang, Adaptive graph convolutional recurrent network for traffic forecasting, in: NeurIPS, volume 33, 2020, pp. 17804–17815. [36] M. Li, Z. Zhu, Spatial-temporal fusion graph neural networks for trafcfi flow forecasting, in: AAAI, 2021. [37] Y. Chen, I. Segovia-Dominguez, Y. R. Gel, Z-gcnets: Time zigzags at graph convolutional networks for time series forecasting, in: ICML, 2021. [38] Z. Fang, Q. Long, G. Song, K. Xie, Spatial-temporal graph ode networks for trafcfi flow forecasting, in: KDD, 2021. [39] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, R. Zemel, Neural relational inference for interacting systems, in: International Conference on Machine Learning, PMLR, 2018, pp. 2688–2697.

8. APPENDIX

8.1. Ablation Study The JHgRF-Net framework, serving as the baseline for our ablation study, seamlessly integrates both spatial and temporal inference components to model complex inter- and intra-time series correlations in interconnected sensor networks. Its spatial inference component comprises of two modules: Spatio-Temporal Hypergraph Convolutional Network(STHgCN) and SpatioTemporal Transformer Network(STTN). In an extensive ablation study, we evaluate the impact of each component in the JHgRF-Net framework on the MTSF task. By selectively removing components, we can observe the impact of individual components on the overall framework performance, gaining valuable insight into their unique contributions towards the framework effectiveness. The study conducted a systematic elimination and creation of various ablated variants to identify critical components that enhance the framework performance. By comparing the impact of these components on the MTSF task against the baseline, valuable insights were gained into each component contribution to the overall framework performance. The ablation study led to an improved understanding of the relationship between the various ablated variants and the baseline, which resulted in a better understanding of the mechanisms that underlie their generalization performance. We present detailed information on each ablated variant created by systematically removing specific components, as follows: • “w/o - Spatial": A variant of JHgRF-Net framework that excluded the spatial inference component, and its degraded performance highlights the significance of using STHgCN and STTN neural operators for effective modeling of inter-series correlations among multiple time series variables present in complex interconnected sensor networks. • “w/o - Temporal": A variant of JHgRF-Net that excluded the temporal inference component, and its deteriorated performance highlighted the importance of incorporating the temporal inference component for effectively modeling the time-varying inter-series dependencies within multiple time series variables present in complex sensor network-based dynamical systems. • “w/o - STHgCN": A variant of JHgRF-Net that excluded the STHgCN method, and its substandard performance shed light on the importance of attention-based hypergraph convolution operation for modeling the spatio-temporal dynamics present in the highdimensional sensor network-based dynamic systems. • “w/o - STTN": A variant of JHgRF-Net that excluded the STTN method, and its subpar performance emphasized the significance of hypergraph transformer networks, which utilizes full attention as a structural inductive bias for modeling the complex dynamics present in the high-dimensional interconnected sensor networks.

In Tables 3 - 9, we present the findings of our ablation studies on benchmark datasets. We employed multiple forecasting accuracy metrics, including Mean Absolute Error(MAE), Root Mean Squared Error(RMSE), and Mean Absolute Percentage Error(MAPE), to offer a comprehensive understanding of the relative performance of ablated variants compared to the baseline. We evaluated the accuracy of multistep-ahead forecasting task by comparing pointwise forecasts with observed data(ground-truth) during the prediction interval and the results were reported using the previously mentioned forecast accuracy metrics. For additional clarity, we enclosed the relative percentage difference between the ablated variants and the baseline performance within parentheses. To ensure the accuracy of our findings, we conducted multiple experiments and reported the average results. Moreover, we evaluated the ablated variants ability to handle long-term predictions by setting the forecast horizon to 12 and comparing it with the baseline. Tables 3 - 9, demonstrate that the ablated variants have lower forecast accuracy and perform considerably worse than the baseline. Upon closer examination, it is apparent that, for achieving state-of-the-art performance on benchmark datasets, the spatial inference component within the JHgRF-Net framework is more important than the temporal inference component. The ablation studies yielded the following observations: • On the PeMSD8 dataset, analysis indicates that the “w/o - Spatial" variant shows a significant decline in performance relative to the baseline, with an increase of 21.12% in RMSE, 27.55% in MAE, and 40.77% in MAPE. Conversely, the “w/o - Temporal" variant exhibits slightly inferior performance compared to the baseline, with a modest rise of 16.23% in RMSE, 14.02% in MAE, and 10.15% in MAPE. • Likewise, similar trends are observed on the PeMSD4 dataset. The “w/o - Spatial" variant significantly underperforms the benchmark, with an increase of 20.86% in RMSE, 22.26% in MAE, and 22.34% in MAPE. In contrast, the “w/o - Temporal" variant exhibits a minor reduction in its performance when compared to the baseline, with a marginal rise of 8.76% in RMSE, 4.47% in MAE, and 2.86% in MAPE. • Analogous trends are observed for PeMSD7 dataset. In particular, the “w/o - Spatial" variant displays a notable decline in performance relative to the baseline, with an increase of 18.33% in RMSE, 21.05% in MAE, and 44.33% in MAPE. On the other hand, the “w/o - Temporal" variant indicates a minor drop in performance compared to the baseline, with a slight increase of 8.63% in RMSE, 7.75% in MAE, and 8.74% in MAPE.

The higher increase in the error metrics of the ablated variants performance, in comparison to the baseline, further emphasizes the relative significance of the mechanisms underlying the excluded components of the baseline. To put it briefly, the spatial inference component serves as a powerful backbone that fortifies the JHgRF-Net framework for improving forecasting performance. This component is responsible for capturing the intricate dependencies among multiple time series variables and learning the dynamics of interacting systems. The crucial role of the spatial inference component is evident from the substantial decline in performance when it is excluded as compared to the baseline, emphasizing its indispensable nature. Our proposed neural forecast architecture is built upon two fundamental methods known as the STHgCN and STTN neural operators, which collectively make up the spatial inference component. The following observations were made from the ablation studies: • The “w/o - STHgCN" variant yielded inferior results compared to the benchmark, with a difference of 12.03%, 10.66%, and 11.43% in terms of RMSE, MAE, and MAPE metrics, respectively, for PeMSD4 dataset. Similarly on PeMSD8, the variants exhibited a 10.52%, 9.97%, and 9.89% decrease in performance with respect to the same metrics as compared to the benchmark. These results provide evidence in support of the notion that incorporating Method JHgRF-Net w/o - Spatial w/o - Temporal w/o - STHgCN w/o - STTN

Method JHgRF-Net w/o - Spatial w/o - Temporal w/o - STHgCN w/o - STTN

STHgCN method in the learning process can result in better performance in multi-horizon forecasting tasks. • The “w/o - STTN" variant exhibited a slight increase in the RMSE, MAE, and MAPE metrics compared to the baseline, with differences of 3.79%, 0.31%, and 0.61% on PeMSD4, and 1.90%, 1.88%, and 2.29% on PeMSD8, respectively. Nonetheless, the integration of the STTN method was found to be crucial, as it resulted in a notable improvement in forecast accuracy.

Based on the ablation studies, we can conclude that STHgCN method is more effective than the STTN method in accurately modeling spatio-temporal dependencies in MTS data, leading to better multi-horizon forecasts. Additional results from the ablation study on benchmark datasets are presented in Tables 3 - 9. The results indicate that the proposed JHgRF-Net framework exhibits strong generalization capabilities, even when dealing with intricate patterns across an extensive variety of datasets, and it can efficiently scale to handle large-scale graph datasets. In summary, the ablation studies provide evidence in favor of the hypothesis that joint optimization of spatial-temporal inference components can lead to enhanced performance in multi-horizon forecasting tasks. In addition, the experimental findings support the rationale of inclusion of STHgCN and STTN neural operators to model the interdependencies among the multiple variables and learn the dynamics of the complex interconnected systems. BAY and Traffic benchmark datasets. and Electricity benchmark datasets. 8.2. Prediction error for multi-horizon forecasting We conducted comprehensive experiments to evaluate the capability of the neural forecasting architecture, JHgRF-Net, to generate accurate multi-horizon forecasts on several benchmark datasets. The forecast errors of the JHgRF-Net framework performance on benchmark datasets are shown in Figure 10. The framework performance was evaluated using various metrics, such as MAPE and MAE. Lower values of forecast errors indicate better model performance. The results demonstrate that the framework outperformed the baselines on all the prediction horizons. M e P M e P M 7 D S M e P 22 20 18 16 E14 A M 12 10 8 6 25

HgRNet STGODE STG-NCDE Z-GCNETs

AGCRN 10