Empowering Time-Series Forecasting in Official Statistics
                                through Transformers
                                Alberico Emanuele1, Francesco Pugliese2, Massimo De Cubellis2, Angela
                                Pappagallo2

                                1Whitehall Reply, Via del Giorgione, 59, Rome, 00147, Italy
                                2Italian National Institute of Statistics – Istat, Via Cesare Balbo, 16, Rome, 00184, Italy


                                                    Abstract
                                                    Artificial Intelligence (AI) is playing a crucial role in the promotion of innovation in public
                                                    administrations. Extensive research and studies on the use of AI to support and improve
                                                    traditional statistical production processes have been carried out at Istat. This paper presents a
                                                    pioneering approach based on Transformer neural networks for forecasting time series. This
                                                    experiment confidently applies a neural network from Natural Language Processing to a new
                                                    context, specifically for predicting time series. The experiment analyzes four indicators of
                                                    significant socio-economic interest, namely Gross Domestic Product (GDP), unemployment rate,
                                                    inflation, and consumer confidence rate, using both Transformers and traditional methods and
                                                    models. This paper provides a comparative analysis between the performance of Transformers
                                                    and other statistical methods used in the context of time series forecasting, such as Auto
                                                    Regressive Integrated Moving Average (ARIMA), Long Short-Term Memory (LSTM) and Gated
                                                    Recurrent Unit (GRU). The analysis unequivocally demonstrates that Transformers outperform
                                                    the other methods in the chosen experiment.


                                                    Keywords
                                                    Deep Learning, Transformers, Forecasting, Time Series 1


                                1. Introduction                                                     in time series analysis. These models eliminate the
                                                                                                    need for recurrent architectures, relying on attention
                                In recent years, the rapid progress of information                  mechanisms to capture contextual relationships,
                                technology has significantly advanced artificial                    extending their applications beyond NLP. This paper
                                intelligence (AI), especially machine learning and                  analyses Transformers applied to the time series
                                deep learning, transforming problem-solving and                     forecasting of relevant Istat’s socio-economic
                                obtaining results that were previously unachievable.                indicators, such as Gross Domestic Product (GDP),
                                The innovations brought by the development of AI                    unemployment rate, inflation (CPI) and consumer
                                have also been applied in the context of official                   confidence index. The results obtained with
                                statistics. In time series analysis, in addition to                 Transformers will be compared with those obtained
                                traditional statistical methods such as ARIMA, deep                 with the traditional ARIMA, LSTM and GRU
                                neural network models such as LSTM and GRU have                     techniques for each of the mentioned indicators. In
                                proven to be particularly effective in capturing long-              the following paragraphs, we will first provide an
                                term dependencies in sequential data.                               overview of the related work and then describe the
                                Transformers models, initially developed for Natural
                                Language Processing (NLP), are now also being used


                                Ital-IA 2024: 4th National Conference on Artificial Intelligence,     francesco.pugliese@istat.it (F. Pugliese)
                                organized by CINI, May 29-30, 2024, Naples, Italy                     massimo.decubellis@istat.it (M. De Cubellis)
                                ∗ Corresponding author.
                                † These authors contributed equally.                                               © 2024 Copyright for this paper by its authors. Use permitted under
                                                                                                                   Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                   angela.pappagallo@istat.it (A. Pappagallo);
                                al.emanuele@reply.it (A. Emanuele);


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
methods applied,       results   obtained    and    draw     (RNNs) tailored to the challenge of capturing long-
conclusions.                                                 term dependencies. While RNNs excel at using past
                                                             information for current tasks, they can struggle when
2. Related Works                                             the time gap between relevant information and its
                                                             application is substantial. Although they share a
Temporal data is ubiquitous in today's data-driven           general structure with RNNs, LSTMs have a distinct
world. Time Series Forecasting (TSF) [1] is a long-          architecture in their repetition module. Unlike RNNs,
standing task with a wide range of applications. Over        which typically consist of a single neural network
the past decades, TSF solutions have evolved from            layer, LSTMs have four interconnected layers. A
traditional statistical methods (such as ARIMA [2]) to       central component of LSTMs is the cell state (C) which
deep learning-based solutions such as Recurrent              persists throughout the chain and is modified by gate
Neural Networks (RNNs) [3] and Temporal                      mechanisms. These gates facilitate the selective
Convolutional Networks (TCNs) [4]. Another                   retention or addition of information to the cell state,
commonly employed method is Exponential                      thereby enhancing the network's ability to capture
Smoothing, including variants such as Holt-Winters           and maintain long-term dependencies. On the other
seasonal method [5], which is effective for capturing        side, the GRU [12] can be seen as a simplification of
trends and seasonality in time series data. Recently,        the LSTM where explicit cell states are not used.
there has been a surge in Transformer-based                  Another difference is that the LSTM directly controls
solutions for time series analysis, as highlighted in [6].   the flow of information exchanged in the hidden state
The main strength of transformers lies in their multi-       using separate forget and output gates. Instead, a GRU
head self-attention mechanism, which has a                   uses a single reset gate to achieve the same goal.
remarkable ability to extract semantic correlations          However, the underlying idea of Gated Recurrent
between elements in a long sequence. However,                Units is quite similar to that of LSTMs in terms of how
although the use of different positional encoding            hidden states are partially reset. Just as LSTM uses
techniques can preserve some information about the           input, output and forget gates to decide how much
order, there is still an inevitable loss of temporal         information to carry over from the previous time step
information after the self-attention mechanism is            to the next, GRU uses update and reset gates. GRU has
applied. This is usually not a serious problem for           no separate internal memory and also needs fewer
semantically rich applications such as NLP [7], where        gates to perform the update from one hidden state to
the semantic meaning of a sentence is largely                another. This raises the question of the specific
preserved even if some words are reordered [8].              function of the update and reset gates. The reset gate
However, in time series analysis, the semantic context       determines the amount of the hidden state to transfer
of the numerical data itself is typically absent, and we     from the previous time step for a matrix-based
are primarily interested in modelling temporal               update, such as an RNN. The update gate determines
changes between a continuous set of points [9]. In           the 'relative strength' of the contributions from this
other words, the order itself plays the most important       matrix-based update and a more direct contribution
role. Consequently, some researchers have asked the          from the hidden vector to the previous time step. By
following question: Are transformers really effective        allowing a direct (partial) copy of hidden states from
for long-term forecasting of time series? [10].              the previous level, the gradient flow becomes more
                                                             stable during backpropagation. The update gate
3. Methods                                                   simultaneously serves as input and forget gates in
As mentioned in the previous sections, the aim of the        LSTMs. Although GRU is a related simplification of
following paper is to demonstrate the effectiveness of       LSTMs, it should not be considered a special case of
Transformer models for time series forecasting. In           them. Research has shown that the two models
order to do this, a comparative analysis of these            perform similarly, with relative performance
models has been carried out with methods that have           depending on the task. GRU is easier to implement and
been the state of the art in time series forecasting for     more efficient. It may generalize slightly better with
many years. Specifically, the performance of a               less data due to fewer parameters, while LSTM would
Transformer model has been compared with that of             be preferable with a larger amount of data. PROPHET,
three other models: Long Short-Term Memory                   which was developed by Facebook (Meta) in 2017
(LSTM) model, a Gated Recurrent Unit (GRU) and a             [13], is a time series forecasting model that is
PROPHET model. LSTM networks [11] are a                      specifically designed to handle the common
specialised type of Recurrent Neural Networks                characteristics of economic time series. It is important
                                                             to note that the model was designed with intuitive
parameters that can be adjusted without requiring         for both the encoder and the decoder. An attention
knowledge of the underlying model details. This           function maps a query and a set of key-value pairs to
allows analysts to effectively tune the model. The        an output, where the query, keys, values, and output
model uses a decomposable time series model               are all vectors. The query vector is derived from the
consisting of three main components, which are            input sequence and is used to determine which parts
combined additively, like the ARIMA model. These          of the sequence are relevant for the current token. The
components are trend, seasonality, and holidays. The      query vector represents the current input token or
first two components have already been encountered        element in the sequence, while the key vector
in the ARIMA model, while the third component,            represents the context or information from other
holidays, represents the effects of holidays that occur   tokens in the sequence. Similarly, the key vector is
at potentially irregular intervals over one or more       also derived from the input sequence and serves as a
days. In this model, only time is used as a regressor.    reference for comparing against the query vector to
The problem of forecasting is approached as a curve       determine the relevance of each token in the
fitting exercise, which is fundamentally distinct from    sequence. The value vector represents the
time series models that explicitly consider the           information or content associated with each token in
temporal dependence structure of the data. Although       the sequence. It is derived from the input sequence
this formulation sacrifices some important inferential    and provides the actual representation of each token,
benefits of a generative model like ARIMA, it offers      like the query and key vectors. The output is
several practical advantages. It can easily               determined by a weighted sum of the values. The
accommodate seasonality with multiple periods and         weight assigned to each value is calculated from a
enable the analyst to make different assumptions          query's compatibility function with the corresponding
about trends. PROPHET is capable of handling              key. The input comprises queries and keys of size dk
multiple cases and does not require regularly spaced      and values of size dv. The dot product of the query
measurements like ARIMA models, which are                 with all keys is calculated, each divided by √dk, and a
designed specifically for forecasting univariate time     SoftMax function is applied to obtain weights on the
series. No changes in content have been made. The         values. The attention function is calculated on a
fitting process is fast and allows for interactive        matrix Q, which contains a set of queries. Matrices K
exploration of various model specifications. Finally,     and V contain the keys and values, respectively, which
the Transformers are a type of machine learning           are also grouped together. This ensures a
model that was introduced in 2017 by Google’s             simultaneous calculation of the attention function,
researchers [14]. They are an artificial neural network   and so this process is fully parallelizable.
designed to process sequences of data, such as words
in text. Transformers differ from other neural            4. Experiment
network architectures such as RNNs in that they rely
on self-attention rather than recurrence functions.       In this experimentation we prove Transformer
Self-attention allows for relevant parts of an input      models' effectiveness in time series forecasting by
sequence to be given more weight during the               comparing them with three other established models
processing of a specific data instance. This enables      like PROPHET, GRU, and LSTM.
Transformers to process information in parallel,
rather than sequentially, unlike RNNs. The
                                                          4.1. Datasets and Pre-processing
Transformer model has been successfully applied in        The experiment analised four socio-economic time
various natural language processing tasks, including      series collected from Istat. Gross Domestic Product
automatic translation, text summarization, text           (GDP) measures a country's economic activity over a
generation, and speech recognition. Its architecture      period, usually a quarter or year. The Unemployment
consists of an encoder-decoder structure. The             Rate reflects labour market conditions, with increases
encoder maps a sequence of input symbolic                 signaling economic contraction and decreases
representations to a sequence of continuous               indicating recovery. Inflation reflects continuous
representations. The decoder generates an output          price increases and is crucial for assessing consumer
sequence one element at a time based on these inner       purchasing power. The Consumer Confidence Index
representations. The model is autoregressive at each      measures public economic sentiment, which
step, using previously generated values as additional     influences spending and investment decisions and
input for generating the next ones. The Transformer       often predicts future economic trends, guiding policy.
implements this architecture using multiple self-         The GDP data covers the period from March 1, 1990,
attention layers and fully connected point-wise layers    to March 1, 2023, on a quarterly basis. The
unemployment data range from March 1, 2004, to            mean. These results suggest that the attention
March 1, 2023, on a quarterly basis. Inflation data       mechanism, along with the introduction of temporal
range from January 1, 1997, to June 1, 2023, on a         encoding, offers significant advantages over standard
monthly basis. The Consumer Confidence Index data         methods, such as recurrence. Figure 1 shows the
range from January 1, 1998, to May 1, 2023, on a          predictions of the Transformer, LSTM, and GRU
monthly basis.                                            models on the GDP test set. All predictions capture the
Before entering the data into the Transformer, a pre-     trend of the series, with some deviations, particularly
processing phase was carried out. This involved           in the case of LSTM and GRU. The Transformer
analysing the data, including cleaning, transforming,     model's prediction is precise and captures the local
and preparing the raw data to make it suitable for        maxima and minima of the series effectively. It is
further analysis or for use in machine learning models    important to note that the test dataset includes the
and algorithms. This phase is crucial because real data   period of the Covid-19 pandemic, which is identifiable
can be dirty, incomplete, or in formats unsuitable for    in the depression exhibited by the curve in the figure.
analysis. Pre-processing aims to make the data more       Despite being an unpredictable and anomalous event,
accurate, consistent, and usable. In the experiment at    the Transformer model manages to capture the trend
hand, the pre-processing techniques used for the          of the curve better than the two neural networks,
available data essentially consist of three steps:        especially in the period following the pandemic, albeit
interpolation to convert quarterly series into monthly    with a slight delay. Although LSTM and GRU models
series to increase the sample size, data normalization,   lose their predictive capability after the Covid-19
and a final transformation of the series to a format      period, the Transformer continues to accurately
suitable for supervised learning.                         capture the trend of the series.

4.2. Results
This section presents the results obtained by the
PROPHET, GRU, LSTM, and Transformer models on
GDP, Inflation, Consumer Confidence Index, and
Unemployment Rate data. The models' performance
was evaluated using Root Mean Square Error (RMSE),
Mean Absolute Error (MAE), and R2 (Coefficient of
Determination) metrics. RMSE is a measure of the
dispersion between observed values and values
predicted by a model. MAE calculates the average of
the absolute differences between the predicted and
observed values. R2 is a measure of how well a
regression model fits the data.
Table 1 displays the metrics calculated on the
denormalised GDP dataset.

Table 1
Metrics calculated on GDP dataset.
    Model        RMSE      MAE           R2
  PROPHET       15899.18 9208.55         -0.003
     GRU        11207.66 7490.36         0.775
     LSTM       10492.32 6611.40         0.803
 Transformer 4080.38 2296.78             0.970
                                                          Figure 1: Forecasting on GDP for GRU, LSTM, and
The metrics indicate that all the models perform          Transformer models.
reasonably well on the GDP dataset, with the
                                                          Table 2, Table 3, and Table 4 display the metrics
Transformer architecture performing the best while
                                                          calculated respectively on the denormalised
LSTM has comparable performance to that of GRU.
                                                          Unemployment Rate, Inflation and Consumer
However, the PROPHET model performs poorly on the
                                                          Confidence index (CCI) datasets.
test data as its errors exceed those generated by the
Table 2                                                  Table 3
Metrics calculated on Unemployment Rate dataset.         Metrics calculated on Inflation dataset.
    Model        RMSE     MAE           R2                   Model         RMSE      MAE            R2
  PROPHET         0.30    0.22          0.07               PROPHET          1.93     0.81           -0.00003
     GRU          0.26    0.19          0.91                  GRU           1.31     0.63           0.88
    LSTM          0.26    0.20          0.91                  LSTM          1.33     0.65           0.87
 Transformer      0.21    0.14          0.94              Transformer       1.22     0.59           0.89


Despite a slightly lower performance compared to the     Table 4
previous series, the transformer model still             Metrics calculated on CCI dataset.
outperforms the other models based on the
                                                             Model         RMSE      MAE            R2
unemployment rate dataset. Table 2 shows that the
Transformer metrics are comparable to those of LSTM        PROPHET          4.62     3.56           -0.001
and GRU, although they are better. Figure 2 displays          GRU           3.76     2.87           0.75
the predictions made by the Transformer, LSTM, and
                                                              LSTM          3.77     2.87           0.74
GRU models on the Unemployment Rate test set. This
series is identical to the previous one, including the    Transformer       3.62     2.70           0.76
Covid-19 pandemic period in the test dataset. The
Transformer model remains the reference model in
terms of behavior, particularly when considering the     From Table 3 and Table 4, once again, it can be
period after the pandemic. The LSTM and GRU models       observed that the Transformer model outperforms
produce less accurate predictions.                       the other models, albeit in a smaller sample size
                                                         compared to the previous examples. This could be
                                                         attributed to the fact that Transformer models
                                                         perform better with larger amounts of data. With a
                                                         reduced sample size, the performance of this
                                                         architecture is closer to that of LSTM and GRU neural
                                                         networks, but still more efficient at a predictive level.
                                                         Due to space limitations, we will not display the
                                                         predictions made on the Inflation and Consumer
                                                         Confidence Index test sets.

                                                         5. Conclusions
                                                             In this work, an investigation was conducted on
                                                         the use of a Transformer-based architecture for Time
                                                         Series Forecasting (TSF). The architecture has been
                                                         applied to four different problems, and its
                                                         performance has been compared with that of three
                                                         classical TSF models, such as PROPHET, GRU, and
                                                         LSTM. The results indicate that the Transformer
                                                         architecture outperforms traditional methods in all
                                                         experiments, demonstrating its effectiveness in
                                                         forecasting historical series as well as in the field of
                                                         Natural Language Processing. However, in some
                                                         cases, the Transformer’s performance approached
Figure 2: Forecasting on Unemployment Rate for           that of traditional recursive methods. This suggests
GRU, LSTM, and Transformer models.                       that, in the TSF domain, the attention mechanism's
                                                         benefits are more evident when processing high-
                                                         dimensional data, specifically datasets with a large
                                                         number of features. The Transformer model
                                                         performed best on the historical series of GDP, which
                                                         had the highest number of observations compared to
the other models. Thus, Transformers have been              [12] Cho, Kyunghyun, et al. "Learning phrase
proven to be effective in long-term time series                  representations using RNN encoder-decoder for
forecasting. However, it is important to emphasize the           statistical machine translation." arXiv preprint
importance of careful pre-processing and thorough                arXiv:1406.1078 (2014).
data examination before integrating them into the           [13] Taylor, Sean J., and Benjamin Letham.
Transformer model. Additionally, the size of the                 "Forecasting      at    scale." The    American
dataset plays a crucial role in the performance of the           Statistician 72.1 (2018): 37-45.
Transformer. Specifically, larger datasets tend to          [14] Vaswani, Ashish, et al. "Attention is all you
produce better results due to the abundance of                   need." Advances      in    neural   information
information available for analysis.                              processing systems 30 (2017).

References
[1]  Chatfield,      Chris. Time-series      forecasting.
     Chapman and Hall/CRC, 2000.
[2] Shumway, Robert H., et al. "ARIMA
     models." Time series analysis and its
     applications: with R examples (2017): 75-163.
[3] Hewamalage, Hansika, Christoph Bergmeir, and
     Kasun Bandara. "Recurrent neural networks for
     time series forecasting: Current status and
     future directions." International Journal of
     Forecasting 37.1 (2021): 388-427.
[4] Wan, Renzhuo, et al. "Multivariate temporal
     convolutional network: A deep neural networks
     approach for multivariate time series
     forecasting." Electronics 8.8 (2019): 876.
[5] Hyndman, R. J., & Athanasopoulos, G.
     (2018). Forecasting: principles and practice.
     OTexts.
[6] Wen, Qingsong, et al. "Transformers in time
     series:       A        survey." arXiv      preprint
     arXiv:2202.07125 (2022).
[7] Sun, Chi, et al. "How to fine-tune bert for text
     classification?." Chinese            computational
     linguistics: 18th China national conference, CCL
     2019, Kunming, China, October 18–20, 2019,
     proceedings       18.     Springer    International
     Publishing, 2019.
[8] Touvron, Hugo, et al. "Llama 2: Open foundation
     and fine-tuned chat models." arXiv preprint
     arXiv:2307.09288 (2023).
[9] Boussif, Oussama, et al. "Improving* day-ahead*
     Solar Irradiance Time Series Forecasting by
     Leveraging                        Spatio-Temporal
     Context." Advances in Neural Information
     Processing Systems 36 (2024).
[10] Zeng, Ailing, et al. "Are transformers effective for
     time series forecasting?." Proceedings of the
     AAAI conference on artificial intelligence. Vol.
     37. No. 9. 2023.
[11] Hochreiter, Sepp, and Jürgen Schmidhuber.
     "Long         short-term          memory." Neural
     computation 9.8 (1997): 1735-1780.