<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Empowering Time-Series Forecasting in Official Statistics through Transformers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alberico Emanuele</string-name>
          <email>al.emanuele@reply.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Pugliese</string-name>
          <email>francesco.pugliese@istat.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Massimo De Cubellis</string-name>
          <email>massimo.decubellis@istat.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angela Pappagallo</string-name>
          <email>angela.pappagallo@istat.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ital-IA 2024: 4th National Conference on Artificial Intelligence</institution>
          ,
          <addr-line>organized by CINI</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Italian National Institute of Statistics - Istat</institution>
          ,
          <addr-line>Via Cesare Balbo, 16, Rome, 00184</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Whitehall Reply</institution>
          ,
          <addr-line>Via del Giorgione, 59, Rome, 00147</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Artificial Intelligence (AI) is playing a crucial role in the promotion of innovation in public administrations. Extensive research and studies on the use of AI to support and improve traditional statistical production processes have been carried out at Istat. This paper presents a pioneering approach based on Transformer neural networks for forecasting time series. This experiment confidently applies a neural network from Natural Language Processing to a new context, specifically for predicting time series. The experiment analyzes four indicators of significant socio-economic interest, namely Gross Domestic Product (GDP), unemployment rate, inflation, and consumer confidence rate, using both Transformers and traditional methods and models. This paper provides a comparative analysis between the performance of Transformers and other statistical methods used in the context of time series forecasting, such as Auto Regressive Integrated Moving Average (ARIMA), Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). The analysis unequivocally demonstrates that Transformers outperform the other methods in the chosen experiment.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Deep Learning</kwd>
        <kwd>Transformers</kwd>
        <kwd>Forecasting</kwd>
        <kwd>Time Series 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years, the rapid progress of information
technology has significantly advanced artificial
intelligence (AI), especially machine learning and
deep learning, transforming problem-solving and
obtaining results that were previously unachievable.
The innovations brought by the development of AI
have also been applied in the context of official
statistics. In time series analysis, in addition to
traditional statistical methods such as ARIMA, deep
neural network models such as LSTM and GRU have
proven to be particularly effective in capturing
longterm dependencies in sequential data.</p>
      <p>Transformers models, initially developed for Natural
Language Processing (NLP), are now also being used
in time series analysis. These models eliminate the
need for recurrent architectures, relying on attention
mechanisms to capture contextual relationships,
extending their applications beyond NLP. This paper
analyses Transformers applied to the time series
forecasting of relevant Istat’s socio-economic
indicators, such as Gross Domestic Product (GDP),
unemployment rate, inflation (CPI) and consumer
confidence index. The results obtained with
Transformers will be compared with those obtained
with the traditional ARIMA, LSTM and GRU
techniques for each of the mentioned indicators. In
the following paragraphs, we will first provide an
overview of the related work and then describe the
© 2024 Copyright for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
methods applied, results obtained and draw
conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Temporal data is ubiquitous in today's data-driven
world. Time Series Forecasting (TSF) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a
longstanding task with a wide range of applications. Over
the past decades, TSF solutions have evolved from
traditional statistical methods (such as ARIMA [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) to
deep learning-based solutions such as Recurrent
Neural Networks (RNNs) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and Temporal
Convolutional Networks (TCNs) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Another
commonly employed method is Exponential
Smoothing, including variants such as Holt-Winters
seasonal method [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which is effective for capturing
trends and seasonality in time series data. Recently,
there has been a surge in Transformer-based
solutions for time series analysis, as highlighted in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
The main strength of transformers lies in their
multihead self-attention mechanism, which has a
remarkable ability to extract semantic correlations
between elements in a long sequence. However,
although the use of different positional encoding
techniques can preserve some information about the
order, there is still an inevitable loss of temporal
information after the self-attention mechanism is
applied. This is usually not a serious problem for
semantically rich applications such as NLP [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], where
the semantic meaning of a sentence is largely
preserved even if some words are reordered [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
However, in time series analysis, the semantic context
of the numerical data itself is typically absent, and we
are primarily interested in modelling temporal
changes between a continuous set of points [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In
other words, the order itself plays the most important
role. Consequently, some researchers have asked the
following question: Are transformers really effective
for long-term forecasting of time series? [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>
        As mentioned in the previous sections, the aim of the
following paper is to demonstrate the effectiveness of
Transformer models for time series forecasting. In
order to do this, a comparative analysis of these
models has been carried out with methods that have
been the state of the art in time series forecasting for
many years. Specifically, the performance of a
Transformer model has been compared with that of
three other models: Long Short-Term Memory
(LSTM) model, a Gated Recurrent Unit (GRU) and a
PROPHET model. LSTM networks [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] are a
specialised type of Recurrent Neural Networks
(RNNs) tailored to the challenge of capturing
longterm dependencies. While RNNs excel at using past
information for current tasks, they can struggle when
the time gap between relevant information and its
application is substantial. Although they share a
general structure with RNNs, LSTMs have a distinct
architecture in their repetition module. Unlike RNNs,
which typically consist of a single neural network
layer, LSTMs have four interconnected layers. A
central component of LSTMs is the cell state (C) which
persists throughout the chain and is modified by gate
mechanisms. These gates facilitate the selective
retention or addition of information to the cell state,
thereby enhancing the network's ability to capture
and maintain long-term dependencies. On the other
side, the GRU [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] can be seen as a simplification of
the LSTM where explicit cell states are not used.
Another difference is that the LSTM directly controls
the flow of information exchanged in the hidden state
using separate forget and output gates. Instead, a GRU
uses a single reset gate to achieve the same goal.
However, the underlying idea of Gated Recurrent
Units is quite similar to that of LSTMs in terms of how
hidden states are partially reset. Just as LSTM uses
input, output and forget gates to decide how much
information to carry over from the previous time step
to the next, GRU uses update and reset gates. GRU has
no separate internal memory and also needs fewer
gates to perform the update from one hidden state to
another. This raises the question of the specific
function of the update and reset gates. The reset gate
determines the amount of the hidden state to transfer
from the previous time step for a matrix-based
update, such as an RNN. The update gate determines
the 'relative strength' of the contributions from this
matrix-based update and a more direct contribution
from the hidden vector to the previous time step. By
allowing a direct (partial) copy of hidden states from
the previous level, the gradient flow becomes more
stable during backpropagation. The update gate
simultaneously serves as input and forget gates in
LSTMs. Although GRU is a related simplification of
LSTMs, it should not be considered a special case of
them. Research has shown that the two models
perform similarly, with relative performance
depending on the task. GRU is easier to implement and
more efficient. It may generalize slightly better with
less data due to fewer parameters, while LSTM would
be preferable with a larger amount of data. PROPHET,
which was developed by Facebook (Meta) in 2017
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], is a time series forecasting model that is
specifically designed to handle the common
characteristics of economic time series. It is important
to note that the model was designed with intuitive
parameters that can be adjusted without requiring
knowledge of the underlying model details. This
allows analysts to effectively tune the model. The
model uses a decomposable time series model
consisting of three main components, which are
combined additively, like the ARIMA model. These
components are trend, seasonality, and holidays. The
first two components have already been encountered
in the ARIMA model, while the third component,
holidays, represents the effects of holidays that occur
at potentially irregular intervals over one or more
days. In this model, only time is used as a regressor.
The problem of forecasting is approached as a curve
fitting exercise, which is fundamentally distinct from
time series models that explicitly consider the
temporal dependence structure of the data. Although
this formulation sacrifices some important inferential
benefits of a generative model like ARIMA, it offers
several practical advantages. It can easily
accommodate seasonality with multiple periods and
enable the analyst to make different assumptions
about trends. PROPHET is capable of handling
multiple cases and does not require regularly spaced
measurements like ARIMA models, which are
designed specifically for forecasting univariate time
series. No changes in content have been made. The
fitting process is fast and allows for interactive
exploration of various model specifications. Finally,
the Transformers are a type of machine learning
model that was introduced in 2017 by Google’s
researchers [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. They are an artificial neural network
designed to process sequences of data, such as words
in text. Transformers differ from other neural
network architectures such as RNNs in that they rely
on self-attention rather than recurrence functions.
Self-attention allows for relevant parts of an input
sequence to be given more weight during the
processing of a specific data instance. This enables
Transformers to process information in parallel,
rather than sequentially, unlike RNNs. The
Transformer model has been successfully applied in
various natural language processing tasks, including
automatic translation, text summarization, text
generation, and speech recognition. Its architecture
consists of an encoder-decoder structure. The
encoder maps a sequence of input symbolic
representations to a sequence of continuous
representations. The decoder generates an output
sequence one element at a time based on these inner
representations. The model is autoregressive at each
step, using previously generated values as additional
input for generating the next ones. The Transformer
implements this architecture using multiple
selfattention layers and fully connected point-wise layers
for both the encoder and the decoder. An attention
function maps a query and a set of key-value pairs to
an output, where the query, keys, values, and output
are all vectors. The query vector is derived from the
input sequence and is used to determine which parts
of the sequence are relevant for the current token. The
query vector represents the current input token or
element in the sequence, while the key vector
represents the context or information from other
tokens in the sequence. Similarly, the key vector is
also derived from the input sequence and serves as a
reference for comparing against the query vector to
determine the relevance of each token in the
sequence. The value vector represents the
information or content associated with each token in
the sequence. It is derived from the input sequence
and provides the actual representation of each token,
like the query and key vectors. The output is
determined by a weighted sum of the values. The
weight assigned to each value is calculated from a
query's compatibility function with the corresponding
key. The input comprises queries and keys of size dk
and values of size dv. The dot product of the query
with all keys is calculated, each divided by √dk, and a
SoftMax function is applied to obtain weights on the
values. The attention function is calculated on a
matrix Q, which contains a set of queries. Matrices K
and V contain the keys and values, respectively, which
are also grouped together. This ensures a
simultaneous calculation of the attention function,
and so this process is fully parallelizable.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiment</title>
      <p>In this experimentation we prove Transformer
models' effectiveness in time series forecasting by
comparing them with three other established models
like PROPHET, GRU, and LSTM.</p>
      <sec id="sec-4-1">
        <title>4.1. Datasets and Pre-processing</title>
        <p>The experiment analised four socio-economic time
series collected from Istat. Gross Domestic Product
(GDP) measures a country's economic activity over a
period, usually a quarter or year. The Unemployment
Rate reflects labour market conditions, with increases
signaling economic contraction and decreases
indicating recovery. Inflation reflects continuous
price increases and is crucial for assessing consumer
purchasing power. The Consumer Confidence Index
measures public economic sentiment, which
influences spending and investment decisions and
often predicts future economic trends, guiding policy.
The GDP data covers the period from March 1, 1990,
to March 1, 2023, on a quarterly basis. The
unemployment data range from March 1, 2004, to
March 1, 2023, on a quarterly basis. Inflation data
range from January 1, 1997, to June 1, 2023, on a
monthly basis. The Consumer Confidence Index data
range from January 1, 1998, to May 1, 2023, on a
monthly basis.</p>
        <p>Before entering the data into the Transformer, a
preprocessing phase was carried out. This involved
analysing the data, including cleaning, transforming,
and preparing the raw data to make it suitable for
further analysis or for use in machine learning models
and algorithms. This phase is crucial because real data
can be dirty, incomplete, or in formats unsuitable for
analysis. Pre-processing aims to make the data more
accurate, consistent, and usable. In the experiment at
hand, the pre-processing techniques used for the
available data essentially consist of three steps:
interpolation to convert quarterly series into monthly
series to increase the sample size, data normalization,
and a final transformation of the series to a format
suitable for supervised learning.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results</title>
        <p>This section presents the results obtained by the
PROPHET, GRU, LSTM, and Transformer models on
GDP, Inflation, Consumer Confidence Index, and
Unemployment Rate data. The models' performance
was evaluated using Root Mean Square Error (RMSE),
Mean Absolute Error (MAE), and R2 (Coefficient of
Determination) metrics. RMSE is a measure of the
dispersion between observed values and values
predicted by a model. MAE calculates the average of
the absolute differences between the predicted and
observed values. R2 is a measure of how well a
regression model fits the data.</p>
        <p>Table 1 displays the metrics calculated on the
denormalised GDP dataset.
The metrics indicate that all the models perform
reasonably well on the GDP dataset, with the
Transformer architecture performing the best while
LSTM has comparable performance to that of GRU.
However, the PROPHET model performs poorly on the
test data as its errors exceed those generated by the
mean. These results suggest that the attention
mechanism, along with the introduction of temporal
encoding, offers significant advantages over standard
methods, such as recurrence. Figure 1 shows the
predictions of the Transformer, LSTM, and GRU
models on the GDP test set. All predictions capture the
trend of the series, with some deviations, particularly
in the case of LSTM and GRU. The Transformer
model's prediction is precise and captures the local
maxima and minima of the series effectively. It is
important to note that the test dataset includes the
period of the Covid-19 pandemic, which is identifiable
in the depression exhibited by the curve in the figure.
Despite being an unpredictable and anomalous event,
the Transformer model manages to capture the trend
of the curve better than the two neural networks,
especially in the period following the pandemic, albeit
with a slight delay. Although LSTM and GRU models
lose their predictive capability after the Covid-19
period, the Transformer continues to accurately
capture the trend of the series.
Despite a slightly lower performance compared to the
previous series, the transformer model still
outperforms the other models based on the
unemployment rate dataset. Table 2 shows that the
Transformer metrics are comparable to those of LSTM
and GRU, although they are better. Figure 2 displays
the predictions made by the Transformer, LSTM, and
GRU models on the Unemployment Rate test set. This
series is identical to the previous one, including the
Covid-19 pandemic period in the test dataset. The
Transformer model remains the reference model in
terms of behavior, particularly when considering the
period after the pandemic. The LSTM and GRU models
produce less accurate predictions.</p>
        <p>From Table 3 and Table 4, once again, it can be
observed that the Transformer model outperforms
the other models, albeit in a smaller sample size
compared to the previous examples. This could be
attributed to the fact that Transformer models
perform better with larger amounts of data. With a
reduced sample size, the performance of this
architecture is closer to that of LSTM and GRU neural
networks, but still more efficient at a predictive level.
Due to space limitations, we will not display the
predictions made on the Inflation and Consumer
Confidence Index test sets.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In this work, an investigation was conducted on
the use of a Transformer-based architecture for Time
Series Forecasting (TSF). The architecture has been
applied to four different problems, and its
performance has been compared with that of three
classical TSF models, such as PROPHET, GRU, and
LSTM. The results indicate that the Transformer
architecture outperforms traditional methods in all
experiments, demonstrating its effectiveness in
forecasting historical series as well as in the field of
Natural Language Processing. However, in some
cases, the Transformer’s performance approached
that of traditional recursive methods. This suggests
that, in the TSF domain, the attention mechanism's
benefits are more evident when processing
highdimensional data, specifically datasets with a large
number of features. The Transformer model
performed best on the historical series of GDP, which
had the highest number of observations compared to
the other models. Thus, Transformers have been
proven to be effective in long-term time series
forecasting. However, it is important to emphasize the
importance of careful pre-processing and thorough
data examination before integrating them into the
Transformer model. Additionally, the size of the
dataset plays a crucial role in the performance of the
Transformer. Specifically, larger datasets tend to
produce better results due to the abundance of
information available for analysis.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Chatfield</surname>
          </string-name>
          ,
          <string-name>
            <surname>Chris.</surname>
          </string-name>
          Time-series forecasting. Chapman and Hall/CRC,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Shumway</surname>
            ,
            <given-names>Robert H.</given-names>
          </string-name>
          , et al.
          <article-title>"ARIMA models." Time series analysis and its applications: with R examples (</article-title>
          <year>2017</year>
          ):
          <fpage>75</fpage>
          -
          <lpage>163</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Hewamalage</surname>
            , Hansika,
            <given-names>Christoph</given-names>
          </string-name>
          <string-name>
            <surname>Bergmeir</surname>
            , and
            <given-names>Kasun</given-names>
          </string-name>
          <string-name>
            <surname>Bandara</surname>
          </string-name>
          .
          <article-title>"Recurrent neural networks for time series forecasting: Current status and future directions."</article-title>
          <source>International Journal of Forecasting 37.1</source>
          (
          <year>2021</year>
          ):
          <fpage>388</fpage>
          -
          <lpage>427</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Renzhuo</surname>
          </string-name>
          , et al.
          <article-title>"Multivariate temporal convolutional network: A deep neural networks approach for multivariate time series forecasting</article-title>
          .
          <source>" Electronics 8</source>
          .8 (
          <year>2019</year>
          ):
          <fpage>876</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Hyndman</surname>
            ,
            <given-names>R. J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Athanasopoulos</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Forecasting: principles and practice</article-title>
          .
          <source>OTexts.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Qingsong</surname>
          </string-name>
          , et al.
          <article-title>"Transformers in time series: A survey."</article-title>
          <source>arXiv preprint arXiv:2202.07125</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Chi</surname>
          </string-name>
          , et al.
          <article-title>"How to fine-tune bert for text classification?." Chinese computational linguistics: 18th China national conference</article-title>
          ,
          <source>CCL</source>
          <year>2019</year>
          , Kunming, China,
          <source>October 18-20</source>
          ,
          <year>2019</year>
          , proceedings 18. Springer International Publishing,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hugo</surname>
          </string-name>
          , et al.
          <article-title>"Llama 2: Open foundation and fine-tuned chat models</article-title>
          .
          <source>" arXiv preprint arXiv:2307.09288</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Boussif</surname>
          </string-name>
          ,
          <string-name>
            <surname>Oussama</surname>
          </string-name>
          , et al.
          <article-title>"Improving* day-ahead* Solar Irradiance Time Series Forecasting by Leveraging Spatio-Temporal Context."</article-title>
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ailing</surname>
          </string-name>
          , et al.
          <article-title>"Are transformers effective for time series forecasting?."</article-title>
          <source>Proceedings of the AAAI conference on artificial intelligence</source>
          . Vol.
          <volume>37</volume>
          . No.
          <issue>9</issue>
          .
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Hochreiter</surname>
            , Sepp, and
            <given-names>Jürgen</given-names>
          </string-name>
          <string-name>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <article-title>"Long short-term memory</article-title>
          .
          <source>" Neural computation 9.8</source>
          (
          <year>1997</year>
          ):
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>Kyunghyun</surname>
          </string-name>
          , et al.
          <article-title>"Learning phrase representations using RNN encoder-decoder for statistical machine translation</article-title>
          .
          <source>" arXiv preprint arXiv:1406.1078</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , Sean J.,
          <source>and Benjamin Letham. "Forecasting at scale." The American Statistician 72.1</source>
          (
          <year>2018</year>
          ):
          <fpage>37</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ashish</surname>
          </string-name>
          , et al.
          <article-title>"Attention is all you need</article-title>
          .
          <source>" Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>