Empowering Time-Series Forecasting in Official Statistics through Transformers Alberico Emanuele1, Francesco Pugliese2, Massimo De Cubellis2, Angela Pappagallo2 1Whitehall Reply, Via del Giorgione, 59, Rome, 00147, Italy 2Italian National Institute of Statistics – Istat, Via Cesare Balbo, 16, Rome, 00184, Italy Abstract Artificial Intelligence (AI) is playing a crucial role in the promotion of innovation in public administrations. Extensive research and studies on the use of AI to support and improve traditional statistical production processes have been carried out at Istat. This paper presents a pioneering approach based on Transformer neural networks for forecasting time series. This experiment confidently applies a neural network from Natural Language Processing to a new context, specifically for predicting time series. The experiment analyzes four indicators of significant socio-economic interest, namely Gross Domestic Product (GDP), unemployment rate, inflation, and consumer confidence rate, using both Transformers and traditional methods and models. This paper provides a comparative analysis between the performance of Transformers and other statistical methods used in the context of time series forecasting, such as Auto Regressive Integrated Moving Average (ARIMA), Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). The analysis unequivocally demonstrates that Transformers outperform the other methods in the chosen experiment. Keywords Deep Learning, Transformers, Forecasting, Time Series 1 1. Introduction in time series analysis. These models eliminate the need for recurrent architectures, relying on attention In recent years, the rapid progress of information mechanisms to capture contextual relationships, technology has significantly advanced artificial extending their applications beyond NLP. This paper intelligence (AI), especially machine learning and analyses Transformers applied to the time series deep learning, transforming problem-solving and forecasting of relevant Istat’s socio-economic obtaining results that were previously unachievable. indicators, such as Gross Domestic Product (GDP), The innovations brought by the development of AI unemployment rate, inflation (CPI) and consumer have also been applied in the context of official confidence index. The results obtained with statistics. In time series analysis, in addition to Transformers will be compared with those obtained traditional statistical methods such as ARIMA, deep with the traditional ARIMA, LSTM and GRU neural network models such as LSTM and GRU have techniques for each of the mentioned indicators. In proven to be particularly effective in capturing long- the following paragraphs, we will first provide an term dependencies in sequential data. overview of the related work and then describe the Transformers models, initially developed for Natural Language Processing (NLP), are now also being used Ital-IA 2024: 4th National Conference on Artificial Intelligence, francesco.pugliese@istat.it (F. Pugliese) organized by CINI, May 29-30, 2024, Naples, Italy massimo.decubellis@istat.it (M. De Cubellis) ∗ Corresponding author. † These authors contributed equally. © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). angela.pappagallo@istat.it (A. Pappagallo); al.emanuele@reply.it (A. Emanuele); CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings methods applied, results obtained and draw (RNNs) tailored to the challenge of capturing long- conclusions. term dependencies. While RNNs excel at using past information for current tasks, they can struggle when 2. Related Works the time gap between relevant information and its application is substantial. Although they share a Temporal data is ubiquitous in today's data-driven general structure with RNNs, LSTMs have a distinct world. Time Series Forecasting (TSF) [1] is a long- architecture in their repetition module. Unlike RNNs, standing task with a wide range of applications. Over which typically consist of a single neural network the past decades, TSF solutions have evolved from layer, LSTMs have four interconnected layers. A traditional statistical methods (such as ARIMA [2]) to central component of LSTMs is the cell state (C) which deep learning-based solutions such as Recurrent persists throughout the chain and is modified by gate Neural Networks (RNNs) [3] and Temporal mechanisms. These gates facilitate the selective Convolutional Networks (TCNs) [4]. Another retention or addition of information to the cell state, commonly employed method is Exponential thereby enhancing the network's ability to capture Smoothing, including variants such as Holt-Winters and maintain long-term dependencies. On the other seasonal method [5], which is effective for capturing side, the GRU [12] can be seen as a simplification of trends and seasonality in time series data. Recently, the LSTM where explicit cell states are not used. there has been a surge in Transformer-based Another difference is that the LSTM directly controls solutions for time series analysis, as highlighted in [6]. the flow of information exchanged in the hidden state The main strength of transformers lies in their multi- using separate forget and output gates. Instead, a GRU head self-attention mechanism, which has a uses a single reset gate to achieve the same goal. remarkable ability to extract semantic correlations However, the underlying idea of Gated Recurrent between elements in a long sequence. However, Units is quite similar to that of LSTMs in terms of how although the use of different positional encoding hidden states are partially reset. Just as LSTM uses techniques can preserve some information about the input, output and forget gates to decide how much order, there is still an inevitable loss of temporal information to carry over from the previous time step information after the self-attention mechanism is to the next, GRU uses update and reset gates. GRU has applied. This is usually not a serious problem for no separate internal memory and also needs fewer semantically rich applications such as NLP [7], where gates to perform the update from one hidden state to the semantic meaning of a sentence is largely another. This raises the question of the specific preserved even if some words are reordered [8]. function of the update and reset gates. The reset gate However, in time series analysis, the semantic context determines the amount of the hidden state to transfer of the numerical data itself is typically absent, and we from the previous time step for a matrix-based are primarily interested in modelling temporal update, such as an RNN. The update gate determines changes between a continuous set of points [9]. In the 'relative strength' of the contributions from this other words, the order itself plays the most important matrix-based update and a more direct contribution role. Consequently, some researchers have asked the from the hidden vector to the previous time step. By following question: Are transformers really effective allowing a direct (partial) copy of hidden states from for long-term forecasting of time series? [10]. the previous level, the gradient flow becomes more stable during backpropagation. The update gate 3. Methods simultaneously serves as input and forget gates in As mentioned in the previous sections, the aim of the LSTMs. Although GRU is a related simplification of following paper is to demonstrate the effectiveness of LSTMs, it should not be considered a special case of Transformer models for time series forecasting. In them. Research has shown that the two models order to do this, a comparative analysis of these perform similarly, with relative performance models has been carried out with methods that have depending on the task. GRU is easier to implement and been the state of the art in time series forecasting for more efficient. It may generalize slightly better with many years. Specifically, the performance of a less data due to fewer parameters, while LSTM would Transformer model has been compared with that of be preferable with a larger amount of data. PROPHET, three other models: Long Short-Term Memory which was developed by Facebook (Meta) in 2017 (LSTM) model, a Gated Recurrent Unit (GRU) and a [13], is a time series forecasting model that is PROPHET model. LSTM networks [11] are a specifically designed to handle the common specialised type of Recurrent Neural Networks characteristics of economic time series. It is important to note that the model was designed with intuitive parameters that can be adjusted without requiring for both the encoder and the decoder. An attention knowledge of the underlying model details. This function maps a query and a set of key-value pairs to allows analysts to effectively tune the model. The an output, where the query, keys, values, and output model uses a decomposable time series model are all vectors. The query vector is derived from the consisting of three main components, which are input sequence and is used to determine which parts combined additively, like the ARIMA model. These of the sequence are relevant for the current token. The components are trend, seasonality, and holidays. The query vector represents the current input token or first two components have already been encountered element in the sequence, while the key vector in the ARIMA model, while the third component, represents the context or information from other holidays, represents the effects of holidays that occur tokens in the sequence. Similarly, the key vector is at potentially irregular intervals over one or more also derived from the input sequence and serves as a days. In this model, only time is used as a regressor. reference for comparing against the query vector to The problem of forecasting is approached as a curve determine the relevance of each token in the fitting exercise, which is fundamentally distinct from sequence. The value vector represents the time series models that explicitly consider the information or content associated with each token in temporal dependence structure of the data. Although the sequence. It is derived from the input sequence this formulation sacrifices some important inferential and provides the actual representation of each token, benefits of a generative model like ARIMA, it offers like the query and key vectors. The output is several practical advantages. It can easily determined by a weighted sum of the values. The accommodate seasonality with multiple periods and weight assigned to each value is calculated from a enable the analyst to make different assumptions query's compatibility function with the corresponding about trends. PROPHET is capable of handling key. The input comprises queries and keys of size dk multiple cases and does not require regularly spaced and values of size dv. The dot product of the query measurements like ARIMA models, which are with all keys is calculated, each divided by √dk, and a designed specifically for forecasting univariate time SoftMax function is applied to obtain weights on the series. No changes in content have been made. The values. The attention function is calculated on a fitting process is fast and allows for interactive matrix Q, which contains a set of queries. Matrices K exploration of various model specifications. Finally, and V contain the keys and values, respectively, which the Transformers are a type of machine learning are also grouped together. This ensures a model that was introduced in 2017 by Google’s simultaneous calculation of the attention function, researchers [14]. They are an artificial neural network and so this process is fully parallelizable. designed to process sequences of data, such as words in text. Transformers differ from other neural 4. Experiment network architectures such as RNNs in that they rely on self-attention rather than recurrence functions. In this experimentation we prove Transformer Self-attention allows for relevant parts of an input models' effectiveness in time series forecasting by sequence to be given more weight during the comparing them with three other established models processing of a specific data instance. This enables like PROPHET, GRU, and LSTM. Transformers to process information in parallel, rather than sequentially, unlike RNNs. The 4.1. Datasets and Pre-processing Transformer model has been successfully applied in The experiment analised four socio-economic time various natural language processing tasks, including series collected from Istat. Gross Domestic Product automatic translation, text summarization, text (GDP) measures a country's economic activity over a generation, and speech recognition. Its architecture period, usually a quarter or year. The Unemployment consists of an encoder-decoder structure. The Rate reflects labour market conditions, with increases encoder maps a sequence of input symbolic signaling economic contraction and decreases representations to a sequence of continuous indicating recovery. Inflation reflects continuous representations. The decoder generates an output price increases and is crucial for assessing consumer sequence one element at a time based on these inner purchasing power. The Consumer Confidence Index representations. The model is autoregressive at each measures public economic sentiment, which step, using previously generated values as additional influences spending and investment decisions and input for generating the next ones. The Transformer often predicts future economic trends, guiding policy. implements this architecture using multiple self- The GDP data covers the period from March 1, 1990, attention layers and fully connected point-wise layers to March 1, 2023, on a quarterly basis. The unemployment data range from March 1, 2004, to mean. These results suggest that the attention March 1, 2023, on a quarterly basis. Inflation data mechanism, along with the introduction of temporal range from January 1, 1997, to June 1, 2023, on a encoding, offers significant advantages over standard monthly basis. The Consumer Confidence Index data methods, such as recurrence. Figure 1 shows the range from January 1, 1998, to May 1, 2023, on a predictions of the Transformer, LSTM, and GRU monthly basis. models on the GDP test set. All predictions capture the Before entering the data into the Transformer, a pre- trend of the series, with some deviations, particularly processing phase was carried out. This involved in the case of LSTM and GRU. The Transformer analysing the data, including cleaning, transforming, model's prediction is precise and captures the local and preparing the raw data to make it suitable for maxima and minima of the series effectively. It is further analysis or for use in machine learning models important to note that the test dataset includes the and algorithms. This phase is crucial because real data period of the Covid-19 pandemic, which is identifiable can be dirty, incomplete, or in formats unsuitable for in the depression exhibited by the curve in the figure. analysis. Pre-processing aims to make the data more Despite being an unpredictable and anomalous event, accurate, consistent, and usable. In the experiment at the Transformer model manages to capture the trend hand, the pre-processing techniques used for the of the curve better than the two neural networks, available data essentially consist of three steps: especially in the period following the pandemic, albeit interpolation to convert quarterly series into monthly with a slight delay. Although LSTM and GRU models series to increase the sample size, data normalization, lose their predictive capability after the Covid-19 and a final transformation of the series to a format period, the Transformer continues to accurately suitable for supervised learning. capture the trend of the series. 4.2. Results This section presents the results obtained by the PROPHET, GRU, LSTM, and Transformer models on GDP, Inflation, Consumer Confidence Index, and Unemployment Rate data. The models' performance was evaluated using Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and R2 (Coefficient of Determination) metrics. RMSE is a measure of the dispersion between observed values and values predicted by a model. MAE calculates the average of the absolute differences between the predicted and observed values. R2 is a measure of how well a regression model fits the data. Table 1 displays the metrics calculated on the denormalised GDP dataset. Table 1 Metrics calculated on GDP dataset. Model RMSE MAE R2 PROPHET 15899.18 9208.55 -0.003 GRU 11207.66 7490.36 0.775 LSTM 10492.32 6611.40 0.803 Transformer 4080.38 2296.78 0.970 Figure 1: Forecasting on GDP for GRU, LSTM, and The metrics indicate that all the models perform Transformer models. reasonably well on the GDP dataset, with the Table 2, Table 3, and Table 4 display the metrics Transformer architecture performing the best while calculated respectively on the denormalised LSTM has comparable performance to that of GRU. Unemployment Rate, Inflation and Consumer However, the PROPHET model performs poorly on the Confidence index (CCI) datasets. test data as its errors exceed those generated by the Table 2 Table 3 Metrics calculated on Unemployment Rate dataset. Metrics calculated on Inflation dataset. Model RMSE MAE R2 Model RMSE MAE R2 PROPHET 0.30 0.22 0.07 PROPHET 1.93 0.81 -0.00003 GRU 0.26 0.19 0.91 GRU 1.31 0.63 0.88 LSTM 0.26 0.20 0.91 LSTM 1.33 0.65 0.87 Transformer 0.21 0.14 0.94 Transformer 1.22 0.59 0.89 Despite a slightly lower performance compared to the Table 4 previous series, the transformer model still Metrics calculated on CCI dataset. outperforms the other models based on the Model RMSE MAE R2 unemployment rate dataset. Table 2 shows that the Transformer metrics are comparable to those of LSTM PROPHET 4.62 3.56 -0.001 and GRU, although they are better. Figure 2 displays GRU 3.76 2.87 0.75 the predictions made by the Transformer, LSTM, and LSTM 3.77 2.87 0.74 GRU models on the Unemployment Rate test set. This series is identical to the previous one, including the Transformer 3.62 2.70 0.76 Covid-19 pandemic period in the test dataset. The Transformer model remains the reference model in terms of behavior, particularly when considering the From Table 3 and Table 4, once again, it can be period after the pandemic. The LSTM and GRU models observed that the Transformer model outperforms produce less accurate predictions. the other models, albeit in a smaller sample size compared to the previous examples. This could be attributed to the fact that Transformer models perform better with larger amounts of data. With a reduced sample size, the performance of this architecture is closer to that of LSTM and GRU neural networks, but still more efficient at a predictive level. Due to space limitations, we will not display the predictions made on the Inflation and Consumer Confidence Index test sets. 5. Conclusions In this work, an investigation was conducted on the use of a Transformer-based architecture for Time Series Forecasting (TSF). The architecture has been applied to four different problems, and its performance has been compared with that of three classical TSF models, such as PROPHET, GRU, and LSTM. The results indicate that the Transformer architecture outperforms traditional methods in all experiments, demonstrating its effectiveness in forecasting historical series as well as in the field of Natural Language Processing. However, in some cases, the Transformer’s performance approached Figure 2: Forecasting on Unemployment Rate for that of traditional recursive methods. This suggests GRU, LSTM, and Transformer models. that, in the TSF domain, the attention mechanism's benefits are more evident when processing high- dimensional data, specifically datasets with a large number of features. The Transformer model performed best on the historical series of GDP, which had the highest number of observations compared to the other models. Thus, Transformers have been [12] Cho, Kyunghyun, et al. "Learning phrase proven to be effective in long-term time series representations using RNN encoder-decoder for forecasting. However, it is important to emphasize the statistical machine translation." arXiv preprint importance of careful pre-processing and thorough arXiv:1406.1078 (2014). data examination before integrating them into the [13] Taylor, Sean J., and Benjamin Letham. Transformer model. Additionally, the size of the "Forecasting at scale." The American dataset plays a crucial role in the performance of the Statistician 72.1 (2018): 37-45. Transformer. Specifically, larger datasets tend to [14] Vaswani, Ashish, et al. "Attention is all you produce better results due to the abundance of need." Advances in neural information information available for analysis. processing systems 30 (2017). References [1] Chatfield, Chris. Time-series forecasting. Chapman and Hall/CRC, 2000. [2] Shumway, Robert H., et al. "ARIMA models." Time series analysis and its applications: with R examples (2017): 75-163. [3] Hewamalage, Hansika, Christoph Bergmeir, and Kasun Bandara. "Recurrent neural networks for time series forecasting: Current status and future directions." International Journal of Forecasting 37.1 (2021): 388-427. [4] Wan, Renzhuo, et al. "Multivariate temporal convolutional network: A deep neural networks approach for multivariate time series forecasting." Electronics 8.8 (2019): 876. [5] Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: principles and practice. OTexts. [6] Wen, Qingsong, et al. "Transformers in time series: A survey." arXiv preprint arXiv:2202.07125 (2022). [7] Sun, Chi, et al. "How to fine-tune bert for text classification?." Chinese computational linguistics: 18th China national conference, CCL 2019, Kunming, China, October 18–20, 2019, proceedings 18. Springer International Publishing, 2019. [8] Touvron, Hugo, et al. "Llama 2: Open foundation and fine-tuned chat models." arXiv preprint arXiv:2307.09288 (2023). [9] Boussif, Oussama, et al. "Improving* day-ahead* Solar Irradiance Time Series Forecasting by Leveraging Spatio-Temporal Context." Advances in Neural Information Processing Systems 36 (2024). [10] Zeng, Ailing, et al. "Are transformers effective for time series forecasting?." Proceedings of the AAAI conference on artificial intelligence. Vol. 37. No. 9. 2023. [11] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.