=Paper=
{{Paper
|id=Vol-3762/480
|storemode=property
|title=Empowering Time-Series Forecasting in Official Statistics through Transformers
|pdfUrl=https://ceur-ws.org/Vol-3762/480.pdf
|volume=Vol-3762
|authors=Alberico Emanuele,Francesco Pugliese,Massimo De Cubellis,Angela Pappagallo
|dblpUrl=https://dblp.org/rec/conf/ital-ia/EmanuelePCP24
}}
==Empowering Time-Series Forecasting in Official Statistics through Transformers==
Empowering Time-Series Forecasting in Official Statistics
through Transformers
Alberico Emanuele1, Francesco Pugliese2, Massimo De Cubellis2, Angela
Pappagallo2
1Whitehall Reply, Via del Giorgione, 59, Rome, 00147, Italy
2Italian National Institute of Statistics – Istat, Via Cesare Balbo, 16, Rome, 00184, Italy
Abstract
Artificial Intelligence (AI) is playing a crucial role in the promotion of innovation in public
administrations. Extensive research and studies on the use of AI to support and improve
traditional statistical production processes have been carried out at Istat. This paper presents a
pioneering approach based on Transformer neural networks for forecasting time series. This
experiment confidently applies a neural network from Natural Language Processing to a new
context, specifically for predicting time series. The experiment analyzes four indicators of
significant socio-economic interest, namely Gross Domestic Product (GDP), unemployment rate,
inflation, and consumer confidence rate, using both Transformers and traditional methods and
models. This paper provides a comparative analysis between the performance of Transformers
and other statistical methods used in the context of time series forecasting, such as Auto
Regressive Integrated Moving Average (ARIMA), Long Short-Term Memory (LSTM) and Gated
Recurrent Unit (GRU). The analysis unequivocally demonstrates that Transformers outperform
the other methods in the chosen experiment.
Keywords
Deep Learning, Transformers, Forecasting, Time Series 1
1. Introduction in time series analysis. These models eliminate the
need for recurrent architectures, relying on attention
In recent years, the rapid progress of information mechanisms to capture contextual relationships,
technology has significantly advanced artificial extending their applications beyond NLP. This paper
intelligence (AI), especially machine learning and analyses Transformers applied to the time series
deep learning, transforming problem-solving and forecasting of relevant Istat’s socio-economic
obtaining results that were previously unachievable. indicators, such as Gross Domestic Product (GDP),
The innovations brought by the development of AI unemployment rate, inflation (CPI) and consumer
have also been applied in the context of official confidence index. The results obtained with
statistics. In time series analysis, in addition to Transformers will be compared with those obtained
traditional statistical methods such as ARIMA, deep with the traditional ARIMA, LSTM and GRU
neural network models such as LSTM and GRU have techniques for each of the mentioned indicators. In
proven to be particularly effective in capturing long- the following paragraphs, we will first provide an
term dependencies in sequential data. overview of the related work and then describe the
Transformers models, initially developed for Natural
Language Processing (NLP), are now also being used
Ital-IA 2024: 4th National Conference on Artificial Intelligence, francesco.pugliese@istat.it (F. Pugliese)
organized by CINI, May 29-30, 2024, Naples, Italy massimo.decubellis@istat.it (M. De Cubellis)
∗ Corresponding author.
† These authors contributed equally. © 2024 Copyright for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
angela.pappagallo@istat.it (A. Pappagallo);
al.emanuele@reply.it (A. Emanuele);
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
methods applied, results obtained and draw (RNNs) tailored to the challenge of capturing long-
conclusions. term dependencies. While RNNs excel at using past
information for current tasks, they can struggle when
2. Related Works the time gap between relevant information and its
application is substantial. Although they share a
Temporal data is ubiquitous in today's data-driven general structure with RNNs, LSTMs have a distinct
world. Time Series Forecasting (TSF) [1] is a long- architecture in their repetition module. Unlike RNNs,
standing task with a wide range of applications. Over which typically consist of a single neural network
the past decades, TSF solutions have evolved from layer, LSTMs have four interconnected layers. A
traditional statistical methods (such as ARIMA [2]) to central component of LSTMs is the cell state (C) which
deep learning-based solutions such as Recurrent persists throughout the chain and is modified by gate
Neural Networks (RNNs) [3] and Temporal mechanisms. These gates facilitate the selective
Convolutional Networks (TCNs) [4]. Another retention or addition of information to the cell state,
commonly employed method is Exponential thereby enhancing the network's ability to capture
Smoothing, including variants such as Holt-Winters and maintain long-term dependencies. On the other
seasonal method [5], which is effective for capturing side, the GRU [12] can be seen as a simplification of
trends and seasonality in time series data. Recently, the LSTM where explicit cell states are not used.
there has been a surge in Transformer-based Another difference is that the LSTM directly controls
solutions for time series analysis, as highlighted in [6]. the flow of information exchanged in the hidden state
The main strength of transformers lies in their multi- using separate forget and output gates. Instead, a GRU
head self-attention mechanism, which has a uses a single reset gate to achieve the same goal.
remarkable ability to extract semantic correlations However, the underlying idea of Gated Recurrent
between elements in a long sequence. However, Units is quite similar to that of LSTMs in terms of how
although the use of different positional encoding hidden states are partially reset. Just as LSTM uses
techniques can preserve some information about the input, output and forget gates to decide how much
order, there is still an inevitable loss of temporal information to carry over from the previous time step
information after the self-attention mechanism is to the next, GRU uses update and reset gates. GRU has
applied. This is usually not a serious problem for no separate internal memory and also needs fewer
semantically rich applications such as NLP [7], where gates to perform the update from one hidden state to
the semantic meaning of a sentence is largely another. This raises the question of the specific
preserved even if some words are reordered [8]. function of the update and reset gates. The reset gate
However, in time series analysis, the semantic context determines the amount of the hidden state to transfer
of the numerical data itself is typically absent, and we from the previous time step for a matrix-based
are primarily interested in modelling temporal update, such as an RNN. The update gate determines
changes between a continuous set of points [9]. In the 'relative strength' of the contributions from this
other words, the order itself plays the most important matrix-based update and a more direct contribution
role. Consequently, some researchers have asked the from the hidden vector to the previous time step. By
following question: Are transformers really effective allowing a direct (partial) copy of hidden states from
for long-term forecasting of time series? [10]. the previous level, the gradient flow becomes more
stable during backpropagation. The update gate
3. Methods simultaneously serves as input and forget gates in
As mentioned in the previous sections, the aim of the LSTMs. Although GRU is a related simplification of
following paper is to demonstrate the effectiveness of LSTMs, it should not be considered a special case of
Transformer models for time series forecasting. In them. Research has shown that the two models
order to do this, a comparative analysis of these perform similarly, with relative performance
models has been carried out with methods that have depending on the task. GRU is easier to implement and
been the state of the art in time series forecasting for more efficient. It may generalize slightly better with
many years. Specifically, the performance of a less data due to fewer parameters, while LSTM would
Transformer model has been compared with that of be preferable with a larger amount of data. PROPHET,
three other models: Long Short-Term Memory which was developed by Facebook (Meta) in 2017
(LSTM) model, a Gated Recurrent Unit (GRU) and a [13], is a time series forecasting model that is
PROPHET model. LSTM networks [11] are a specifically designed to handle the common
specialised type of Recurrent Neural Networks characteristics of economic time series. It is important
to note that the model was designed with intuitive
parameters that can be adjusted without requiring for both the encoder and the decoder. An attention
knowledge of the underlying model details. This function maps a query and a set of key-value pairs to
allows analysts to effectively tune the model. The an output, where the query, keys, values, and output
model uses a decomposable time series model are all vectors. The query vector is derived from the
consisting of three main components, which are input sequence and is used to determine which parts
combined additively, like the ARIMA model. These of the sequence are relevant for the current token. The
components are trend, seasonality, and holidays. The query vector represents the current input token or
first two components have already been encountered element in the sequence, while the key vector
in the ARIMA model, while the third component, represents the context or information from other
holidays, represents the effects of holidays that occur tokens in the sequence. Similarly, the key vector is
at potentially irregular intervals over one or more also derived from the input sequence and serves as a
days. In this model, only time is used as a regressor. reference for comparing against the query vector to
The problem of forecasting is approached as a curve determine the relevance of each token in the
fitting exercise, which is fundamentally distinct from sequence. The value vector represents the
time series models that explicitly consider the information or content associated with each token in
temporal dependence structure of the data. Although the sequence. It is derived from the input sequence
this formulation sacrifices some important inferential and provides the actual representation of each token,
benefits of a generative model like ARIMA, it offers like the query and key vectors. The output is
several practical advantages. It can easily determined by a weighted sum of the values. The
accommodate seasonality with multiple periods and weight assigned to each value is calculated from a
enable the analyst to make different assumptions query's compatibility function with the corresponding
about trends. PROPHET is capable of handling key. The input comprises queries and keys of size dk
multiple cases and does not require regularly spaced and values of size dv. The dot product of the query
measurements like ARIMA models, which are with all keys is calculated, each divided by √dk, and a
designed specifically for forecasting univariate time SoftMax function is applied to obtain weights on the
series. No changes in content have been made. The values. The attention function is calculated on a
fitting process is fast and allows for interactive matrix Q, which contains a set of queries. Matrices K
exploration of various model specifications. Finally, and V contain the keys and values, respectively, which
the Transformers are a type of machine learning are also grouped together. This ensures a
model that was introduced in 2017 by Google’s simultaneous calculation of the attention function,
researchers [14]. They are an artificial neural network and so this process is fully parallelizable.
designed to process sequences of data, such as words
in text. Transformers differ from other neural 4. Experiment
network architectures such as RNNs in that they rely
on self-attention rather than recurrence functions. In this experimentation we prove Transformer
Self-attention allows for relevant parts of an input models' effectiveness in time series forecasting by
sequence to be given more weight during the comparing them with three other established models
processing of a specific data instance. This enables like PROPHET, GRU, and LSTM.
Transformers to process information in parallel,
rather than sequentially, unlike RNNs. The
4.1. Datasets and Pre-processing
Transformer model has been successfully applied in The experiment analised four socio-economic time
various natural language processing tasks, including series collected from Istat. Gross Domestic Product
automatic translation, text summarization, text (GDP) measures a country's economic activity over a
generation, and speech recognition. Its architecture period, usually a quarter or year. The Unemployment
consists of an encoder-decoder structure. The Rate reflects labour market conditions, with increases
encoder maps a sequence of input symbolic signaling economic contraction and decreases
representations to a sequence of continuous indicating recovery. Inflation reflects continuous
representations. The decoder generates an output price increases and is crucial for assessing consumer
sequence one element at a time based on these inner purchasing power. The Consumer Confidence Index
representations. The model is autoregressive at each measures public economic sentiment, which
step, using previously generated values as additional influences spending and investment decisions and
input for generating the next ones. The Transformer often predicts future economic trends, guiding policy.
implements this architecture using multiple self- The GDP data covers the period from March 1, 1990,
attention layers and fully connected point-wise layers to March 1, 2023, on a quarterly basis. The
unemployment data range from March 1, 2004, to mean. These results suggest that the attention
March 1, 2023, on a quarterly basis. Inflation data mechanism, along with the introduction of temporal
range from January 1, 1997, to June 1, 2023, on a encoding, offers significant advantages over standard
monthly basis. The Consumer Confidence Index data methods, such as recurrence. Figure 1 shows the
range from January 1, 1998, to May 1, 2023, on a predictions of the Transformer, LSTM, and GRU
monthly basis. models on the GDP test set. All predictions capture the
Before entering the data into the Transformer, a pre- trend of the series, with some deviations, particularly
processing phase was carried out. This involved in the case of LSTM and GRU. The Transformer
analysing the data, including cleaning, transforming, model's prediction is precise and captures the local
and preparing the raw data to make it suitable for maxima and minima of the series effectively. It is
further analysis or for use in machine learning models important to note that the test dataset includes the
and algorithms. This phase is crucial because real data period of the Covid-19 pandemic, which is identifiable
can be dirty, incomplete, or in formats unsuitable for in the depression exhibited by the curve in the figure.
analysis. Pre-processing aims to make the data more Despite being an unpredictable and anomalous event,
accurate, consistent, and usable. In the experiment at the Transformer model manages to capture the trend
hand, the pre-processing techniques used for the of the curve better than the two neural networks,
available data essentially consist of three steps: especially in the period following the pandemic, albeit
interpolation to convert quarterly series into monthly with a slight delay. Although LSTM and GRU models
series to increase the sample size, data normalization, lose their predictive capability after the Covid-19
and a final transformation of the series to a format period, the Transformer continues to accurately
suitable for supervised learning. capture the trend of the series.
4.2. Results
This section presents the results obtained by the
PROPHET, GRU, LSTM, and Transformer models on
GDP, Inflation, Consumer Confidence Index, and
Unemployment Rate data. The models' performance
was evaluated using Root Mean Square Error (RMSE),
Mean Absolute Error (MAE), and R2 (Coefficient of
Determination) metrics. RMSE is a measure of the
dispersion between observed values and values
predicted by a model. MAE calculates the average of
the absolute differences between the predicted and
observed values. R2 is a measure of how well a
regression model fits the data.
Table 1 displays the metrics calculated on the
denormalised GDP dataset.
Table 1
Metrics calculated on GDP dataset.
Model RMSE MAE R2
PROPHET 15899.18 9208.55 -0.003
GRU 11207.66 7490.36 0.775
LSTM 10492.32 6611.40 0.803
Transformer 4080.38 2296.78 0.970
Figure 1: Forecasting on GDP for GRU, LSTM, and
The metrics indicate that all the models perform Transformer models.
reasonably well on the GDP dataset, with the
Table 2, Table 3, and Table 4 display the metrics
Transformer architecture performing the best while
calculated respectively on the denormalised
LSTM has comparable performance to that of GRU.
Unemployment Rate, Inflation and Consumer
However, the PROPHET model performs poorly on the
Confidence index (CCI) datasets.
test data as its errors exceed those generated by the
Table 2 Table 3
Metrics calculated on Unemployment Rate dataset. Metrics calculated on Inflation dataset.
Model RMSE MAE R2 Model RMSE MAE R2
PROPHET 0.30 0.22 0.07 PROPHET 1.93 0.81 -0.00003
GRU 0.26 0.19 0.91 GRU 1.31 0.63 0.88
LSTM 0.26 0.20 0.91 LSTM 1.33 0.65 0.87
Transformer 0.21 0.14 0.94 Transformer 1.22 0.59 0.89
Despite a slightly lower performance compared to the Table 4
previous series, the transformer model still Metrics calculated on CCI dataset.
outperforms the other models based on the
Model RMSE MAE R2
unemployment rate dataset. Table 2 shows that the
Transformer metrics are comparable to those of LSTM PROPHET 4.62 3.56 -0.001
and GRU, although they are better. Figure 2 displays GRU 3.76 2.87 0.75
the predictions made by the Transformer, LSTM, and
LSTM 3.77 2.87 0.74
GRU models on the Unemployment Rate test set. This
series is identical to the previous one, including the Transformer 3.62 2.70 0.76
Covid-19 pandemic period in the test dataset. The
Transformer model remains the reference model in
terms of behavior, particularly when considering the From Table 3 and Table 4, once again, it can be
period after the pandemic. The LSTM and GRU models observed that the Transformer model outperforms
produce less accurate predictions. the other models, albeit in a smaller sample size
compared to the previous examples. This could be
attributed to the fact that Transformer models
perform better with larger amounts of data. With a
reduced sample size, the performance of this
architecture is closer to that of LSTM and GRU neural
networks, but still more efficient at a predictive level.
Due to space limitations, we will not display the
predictions made on the Inflation and Consumer
Confidence Index test sets.
5. Conclusions
In this work, an investigation was conducted on
the use of a Transformer-based architecture for Time
Series Forecasting (TSF). The architecture has been
applied to four different problems, and its
performance has been compared with that of three
classical TSF models, such as PROPHET, GRU, and
LSTM. The results indicate that the Transformer
architecture outperforms traditional methods in all
experiments, demonstrating its effectiveness in
forecasting historical series as well as in the field of
Natural Language Processing. However, in some
cases, the Transformer’s performance approached
Figure 2: Forecasting on Unemployment Rate for that of traditional recursive methods. This suggests
GRU, LSTM, and Transformer models. that, in the TSF domain, the attention mechanism's
benefits are more evident when processing high-
dimensional data, specifically datasets with a large
number of features. The Transformer model
performed best on the historical series of GDP, which
had the highest number of observations compared to
the other models. Thus, Transformers have been [12] Cho, Kyunghyun, et al. "Learning phrase
proven to be effective in long-term time series representations using RNN encoder-decoder for
forecasting. However, it is important to emphasize the statistical machine translation." arXiv preprint
importance of careful pre-processing and thorough arXiv:1406.1078 (2014).
data examination before integrating them into the [13] Taylor, Sean J., and Benjamin Letham.
Transformer model. Additionally, the size of the "Forecasting at scale." The American
dataset plays a crucial role in the performance of the Statistician 72.1 (2018): 37-45.
Transformer. Specifically, larger datasets tend to [14] Vaswani, Ashish, et al. "Attention is all you
produce better results due to the abundance of need." Advances in neural information
information available for analysis. processing systems 30 (2017).
References
[1] Chatfield, Chris. Time-series forecasting.
Chapman and Hall/CRC, 2000.
[2] Shumway, Robert H., et al. "ARIMA
models." Time series analysis and its
applications: with R examples (2017): 75-163.
[3] Hewamalage, Hansika, Christoph Bergmeir, and
Kasun Bandara. "Recurrent neural networks for
time series forecasting: Current status and
future directions." International Journal of
Forecasting 37.1 (2021): 388-427.
[4] Wan, Renzhuo, et al. "Multivariate temporal
convolutional network: A deep neural networks
approach for multivariate time series
forecasting." Electronics 8.8 (2019): 876.
[5] Hyndman, R. J., & Athanasopoulos, G.
(2018). Forecasting: principles and practice.
OTexts.
[6] Wen, Qingsong, et al. "Transformers in time
series: A survey." arXiv preprint
arXiv:2202.07125 (2022).
[7] Sun, Chi, et al. "How to fine-tune bert for text
classification?." Chinese computational
linguistics: 18th China national conference, CCL
2019, Kunming, China, October 18–20, 2019,
proceedings 18. Springer International
Publishing, 2019.
[8] Touvron, Hugo, et al. "Llama 2: Open foundation
and fine-tuned chat models." arXiv preprint
arXiv:2307.09288 (2023).
[9] Boussif, Oussama, et al. "Improving* day-ahead*
Solar Irradiance Time Series Forecasting by
Leveraging Spatio-Temporal
Context." Advances in Neural Information
Processing Systems 36 (2024).
[10] Zeng, Ailing, et al. "Are transformers effective for
time series forecasting?." Proceedings of the
AAAI conference on artificial intelligence. Vol.
37. No. 9. 2023.
[11] Hochreiter, Sepp, and Jürgen Schmidhuber.
"Long short-term memory." Neural
computation 9.8 (1997): 1735-1780.