Forecasting Corporate Financial Time Series using
                Multi-phase Attention Recurrent Neural Networks
                              Shuhei Yoshimi                                                             Koji Eguchi
                           Kobe University                                                        Hiroshima University
                         Kobe, Hyogo, Japan                                                Higashi-Hiroshima, Hiroshima, Japan
                    shuhei@cs25.scitec.kobe-u.ac.jp                                                 eguchi@acm.org

ABSTRACT                                                                           and have been well accepted lately even in the fields of financial
These days, attention-based Recurrent Neural Networks (RNNs)                       time series analysis, such as stock price forecasting [8]. How-
have been widely used for learning the hidden temporal struc-                      ever, it may not be easy to achieve accurate long-term prediction
ture of raw time series data. More recently, attention-based RNNs                  for multivariate time series, since a part of multivariate explana-
have been further enhanced to represent multivariate temporal                      tory variables may not contribute to the prediction and even do
or spatio-temporal structure underlying multivariate time series.                  harm the prediction accuracy. It can be considered that, when
This latest study achieved more effective prediction by employ-                    some explanatory variables have relatively small contributions
ing attention structure that simultaneously capture the spatial                    to the prediction, those variables may result in noises. In an-
relationships among multiple different time series and the tem-                    other line of research, a time-series prediction model was pro-
poral structure of those time series. That method assumes single                   posed so that it uses the attention-based RNN to learn the atten-
time-series samples of multi- or uni-variate explanatory vari-                     tion weights of raw time series and further enhances the abil-
ables, and thus, no prediction method was designed for multi-                      ity to represent spatio-temporal features [2]. Qin et al. [12] and
ple time-series samples of multivariate explanatory variables.                     Yuxuan et al. [9] combined attention mechanisms with encoder-
Moreover, such previous studies have not explored on finan-                        decoder models to achieve better performance in predicting one
cial time series incorporating macroeconomic time series, such                     or several steps ahead. Liu et al. [11] developed Dual-Stage Two-
as Gross Domestic Product (GDP) and stock market indexes, to                       Phase (DSTP) attention-based RNN model, by capturing correla-
our knowledge. Also, no neural network structure has been de-                      tions among multivariate explanatory variables and embedding
signed for focusing a specific industry. We aim in this paper to                   past observations of target time series via multiple levels of at-
achieve effective forecasting of corporate financial time series                   tention mechanism. However, no prediction was made for multi-
from multiple time-series samples of multivariate explanatory                      ple time-series samples with multivariate explanatory variables,
variables. We propose a new industry specific model that appro-                    in those previous studies. Moreover, no previous studies have
priately captures corporate financial time series, incorporating                   explored on deep learning models for financial time series in-
the industry trends and macroeconomic time series as side infor-                   corporating macroeconomic time series, such as Gross Domes-
mation. We demonstrate the performance of our model through                        tic Product (GDP) and stock market indexes, to our knowledge.
experiments with Japanese corporate financial time series in the                   Also, no structure of deep learning models has been designed for
task of predicting the return on assets (ROA) for each company.                    focusing a specific industry, even the industry trend can be in-
                                                                                   fluencing. This paper aims to establish a useful method for fore-
                                                                                   casting corporate financial time series by appropriately learning
1    INTRODUCTION                                                                  from multiple time-series samples with multivariate explana-
In recent years, a huge amount of information is accumulated                       tory variables. We propose a new industry specific model that
day by day with the developing information technology. One                         appropriately captures business and industry trends, as well as
such information is corporate financial time series data in the                    macroeconomic time series, in an extension of attention-based
economic and financial fields. Many economic experts have in-                      RNN. Through experiments with Japanese corporate financial
terest in gaining new insights from these data. Corporate finan-                   time series, we demonstrate our proposed model focusing on the
cial time series are particularly complex since they are often af-                 wholesale industry works effectively in the task of predicting the
fected by various information, such as business conditions of                      return on assets (ROA) for each company.
each company, trends in the industry, and business sentiment in
society. Traditional time series analysis and modern deep learn-
ing technology have addressed the problem of time-series pre-
diction (or forecasting); however, there is plenty of room for new
                                                                                   2 RELATED WORK
research on complex multivariate time series, such as corporate                    This paper attempts to predict one step ahead of corporate fi-
financial time series. Among the widely recognized time series                     nancial indicators by deep learning. This section consists of four
analysis methods, the Autoregressive Integrated Moving Aver-                       topics. The first topic is about RNNs, which have been one of
age (ARIMA) model and the kernel methods can capture one                           the most popular deep learning methods for predicting time se-
aspect of spatio-temporal patterns; however, it is not easy to                     ries data. The second topic is about LSTMs, which have been
achieve accurate forecasting of multivariate time series[1, 10].                   extended from RNNs to capture long and short time dependen-
Moreover, Recurrent Neural Networks (RNN) and Long Short-                          cies. The third topic is on attention mechanisms, which have at-
Term Memory (LSTM) can take into account time dependencies                         tracted much attention recently due to the promising prediction
                                                                                   performance. Those topics provide basic techniques for time se-
© 2020 Copyright for this paper by its author(s). Published in the Workshop Pro-   ries analysis based on deep learning. As the last topic, we briefly
ceedings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copen-
hagen, Denmark) on CEUR-WS.org. Use permitted under Creative Commons Li-           review state-of-the-art related work on deep learning-based fi-
cense Attribution 4.0 International (CC BY 4.0)                                    nancial time series prediction.
           ‫ܡ‬௧
                                                                                               ‫ܡ‬௧
 ܹ ሺ௢௨௧ሻ
           ‫ܐ‬௧                  ܹ                              t=T                              ‫ܐ‬௧

 ܹ ሺ௜௡ሻ                                                                                        ‫ܠ‬௧
           ‫ܠ‬௧                                                                         t                                   ‫ܡ‬௧
                                                      t=3

                                                                                                                                            ‫ܐ‬௧ାଵ
                                                t=2
                                                                                                                              ‫ܐ‬௧
                                   t=1                                                                            tanh

                  Figure 1: Structure of RNN.

                                                                                            ‫ܐ‬௧ିଵ                          ‫ܠ‬௧
2.1       Recurrent Neural Networks
Sequential data refers to any kind of data where the order of the
samples is important. Especially, sequential data is also called                    Figure 2: A unit of middle layer in RNN.
as time series data in case that the order is based on time. It is
known that prediction performance can be improved by con-
sidering the dependencies between the current samples and the
                                                                                               ‫ܡ‬௧
past samples. One most popular method is RNN, which is an ex-
tension of feed-forward neural networks for handling sequential                                ‫ܐ‬௧
data [3]. Now, we suppose that the RNN receives input sequence
xt for each time t and also returns one output sequence yt at                                  ‫ܠ‬௧
                                                                                      t                                   ‫ܡ‬௧
the time t. At time t = n, we can assume that output yn is suc-
cessfully produced from input sequence x1, x2, x3, · · · , xn . This
                                                                                                                                            ‫ܐ‬௧ାଵ
is because the RNN is based on a neural network with directed                               ‫ܛ‬௧ିଵ              .
                                                                                                        σ                ‫ܛ‬௧
closed loop called ‘return path’. This structure makes it possi-                                                     +

ble to store temporal information and change behaviors. Figure                                         tanh   .           tanh         ‫ܐ‬௧
1 shows the structure of the RNN and its structure expanded in                                          σ                          .

time dimension.                                                                                         σ

   Now, we describe the calculation process in RNNs. First, sup-
                                                                                            ‫ܐ‬௧ିଵ                          ‫ܠ‬௧
pose xt is input to the network; ht is output from the middle
layer; yt is output from the output layer; W (in) is input weight
matrix that represents connections from the input layer to the                     Figure 3: A unit of middle layer in LSTM.
middle layer; andW (out ) is output weight matrix that represents
connections from the middle layer to the output layer.
   In the return path, the RNN returns the output of the middle           However, the gradient usually vanishes or explodes after a cer-
layer to its own input. The RNN assumes the connection be-                tain number of iterations during learning in case of RNNs [4, 5].
tween the middle layer at time t − 1 to that at time t. Therefore,        This limit is caused by the so-called gradient vanishing problem
a weight w is assigned to each recurrent connection from an ar-           —when calculating weights, the value of the gradient explosively
bitrary unit of the middle layer at time t − 1 to an arbitrary unit       decreases or increases as it is propagated backward through the
of the middle layer at time t. We use the notation W as a recur-          RNN network. To address this problem, LSTMs [6] were pro-
rent weight matrix consisting of each recurrent connection’s w.           posed to achieve long- and short-term memory. Compared to the
Figure 2 shows one unit of middle layer in RNN.                           RNN, each unit in the middle layer of the LSTM has a memory
   Hidden variables in the middle layer at time t, ht , can be ob-        cell and three gates: an input gate, an output gate, and a forget
tained by xt , W (in) , ht −1 , W , activation function f and bias b,     gate, while the other structure is basically the same as that of
as follows:                                                               the RNN. Figure 3 shows one unit of middle layer in the LSTM.
                  ht = f (W (in) xt + W ht −1 + b)                (1)         Now, let Wi , Wo , Wf , and Ws be input weight matrices, where
  Output yt are then obtained by f , W (in) , and ht from each            the subscript indicates the input gate i, output gate o, forget gate
unit in the middle layer:                                                 f , or memory cell s in the LSTM network. Also, let Ui , Uo , Uf ,
                                                                          and Us be recurrent weight matrices, and bi , bo , bf , and bs be
                        yt = f (W (out ) ht )                       (2)
                                                                          biases. Suppose σ is the sigmoid function, and tanh is the hy-
   In this paper, we assume that activation function f in Eq. (1)         perbolic tangent function. At time t, we suppose that it is the
is the hyperbolic tangent function (tanh) and f in Eq. (2), also          output of the input gate; ŝt is a new candidate state of the mem-
called as loss function, is the squared error.                            ory cell; ot is the output of the output gate; ft is the output of
                                                                          the forget gate; st is the state of the memory cell; and ht is the
2.2       Long Short-Term Memory                                          output of the memory cell. We obtain these variables, as follows:
RNNs can capture the context of sequential data. In this case, it is                         it = σ (Wi xi + Ui ht −1 + bi )                  (3)
important to understand the length of past sequence that should
be captured in the model, in other words, how long past inputs                             ŝt = tanh(Ws xt + Us ht −1 + bs )                 (4)
from the current time should be reflected to predict the output.                            ft = σ (Wf xt + Uf ht −1 + bf )                   (5)
                       st = it      ŝt + ft   st −1                   (6)    where [∗; ∗] is a concatenation operation, and va , ba ∈ RT ,Wa ∈
                    ot = σ (Wo xt + Uo ht −1 + bo )                    (7)    RT ×2m , Ua ∈ RT ×T are parameters to learn, ht −1 ∈ Rm and
                                                                              st −1 ∈ Rm are respectively the hidden state and cell state vectors
                          ht = ot      tanh(st )                       (8)    of time t − 1, and m is the number of hidden states of this atten-
where indicates the Hadamard (or element-wise) product.                       tion module. Spatial attention weight at time t, (α t1, · · · , α tn ), is
   Now, we will discuss how the LSTM’s units work in the four                 determined by the hidden states and cell states at time t − 1, and
steps below.                                                                  inputs of explanatory attributes at time t. It represents the effect
     • Step 1 —Update the output of the forget gate ft :                      of each explanatory attribute on the forecast of the target. Using
       First, the model needs to determine what to forget from                the attention weight associated with each explanatory attribute,
       the cell state, as shown in Eq. (5). ft is obtained by the out-        the multivariate input at time t, xt = (x t1, · · · , x tn ), is weighted
       put of the previous step ht −1 and the input xt . σ is used            as follows:
                                                                                                                                                    T
       to give output values between 0 and 1. For instance, when                                   x̃t = (α t1x t1, α t2x t2, · · · , α tn x tn )         (11)
       the value is 1, the current cell state is stored completely.
                                                                              Let fspat ial be the LSTM with the spatial attention mechanism
       When the value is 0, it is forgotten completely.
     • Step 2 —Update the output of input gate it and new can-                mentioned previously. Then we can get the following equation．
       didate state of memory cell ŝt :                                                         (ht , st ) = fspat ial (ht −1, st −1, x̃t )              (12)
       Second, the model needs to determine what information
       is going to be added to the cell state, as shown in Eqs. (3)              2.3.2 Temporal-attention LSTM. The purpose of the tempo-
       and (4). In this step again, the input is obtained by ht −1            ral attention mechanism is to maintain the temporal relation-
       and xt . Input gate it first applies σ to determine which              ships of the spatial attention. The spatio-temporal relationships
       previous cell state will be updated. tanh is then used to              in a fixed-size window is extracted using the spatial relationship
       obtain a new candidate value ŝt . In the next step, these             among multivariate time series in the time window of length T ,
       two will be combined to update the cell state st −1 .                  which was mentioned previously. It is not sufficient to under-
     • Step 3 —Update the state of the memory cell st :                       stand the temporal relationships in a fixed-size window, so an at-
       Third, the model updates the cell state, as shown in Eq. (6).          tention mechanism for selecting hidden states is promising. The
       Now, the old cell state st −1 is multiplied by ft to forget            hidden state most relevant to the target (or objective) variable
       unnecessary information. Then, the product of it by st −1              is selected. For each i-th hidden state, the attention mechanism
       is added to the cell state memory.                                     gives temporal attention weightes (βt1, · · · , βtT ), as follows:
     • Step 4 —Update the output of the memory cell ht and the
       output of output gate ot :                                                          bti = vbT tanh(Wb [dt −1 ; st0 −1 ] + Ub hi + bb )             (13)
       Finally, the model determines the output unit, as shown
       in Eqs. (7) and (8). ht is based on the cell state but in a fil-                                                  exp(bti )
                                                                                                            βti = ∑                                       (14)
       tered version. First, σ is applied to the previous memory                                                       T exp(b j )
                                                                                                                       j=1    t
       output ht −1 and input xt , to determine output gate ot .
                                                                              where hi represents the i-th hidden state vector obtained in the
       This value indicates which cell state is going to be output
                                                                              spatial attention module mentioned previously. dt −1 ∈ Rp and
       in the range from 0 to 1. Then, cell state st is transformed
                                                                              st0 −1 ∈ Rp are respectively the hidden state and cell state vec-
       by tanh in the range from -1 to 1. Next, this transformed
                                                                              tors of time t − 1, and p is the number of hidden states of this
       cell state is multiplied by output gate ot , resulting in ht .
                                                                              attention module. vb , bb ∈ Rp , Wb ∈ Rp×2p , and Ub ∈ Rp×m
       This output will be forwarded to the next step in the net-
                                                                              are the parameters to learn. Next, context vector ct are defined
       work.
                                                                              as follows:
   These structures improve the limitation of RNNs in which                                                  ∑T
limited-term memory can only be captured, achieving a more                                                        j
                                                                                                        ct =     βt hj                     (15)
accurate estimation that captures a longer-term memory.                                                                 j=1
                                                                              Context vector ct represents the information of all the hidden
2.3     Attention mechanisms
                                                                              states, representing the temporal relationships within a time win-
Attention mechanisms were successfully used for LSTM [11, 12].                dow. This context vector ct is then aligned with target variable
In time series analysis with LSTM, the attention mechanisms can               yt , as follows:
simultaneously capture the spatial relationships among multiple
                                                                                                           ỹt = w̃T [yt ; ct ] + b̃                      (16)
different time series and the temporal structure of those time
series. In the rest of this subsection, we will briefly review the            where w̃ ∈ Rm+1 and b̃ ∈ R are parameters that map the con-
attention mechanism developed by Liu et al. [11].                             catenation to the target variable. Aligning the target time series
                                                                              with the context vector makes it easier to maintain temporal re-
   2.3.1 Spatial-attention LSTM. The purpose of the spatial at-
                                                                              lationships and makes use of the results to update the hidden
tention mechanism is to obtain the spatial correlations among
                                                                              state and cell state. Let ft empor al be the LSTM with the tempo-
multivariate input time series. Given the time series (of length T )
                                                                              ral attention mechanism mentioned previously. Then we can get
of k-th explanatory attribute at time t, xkt = (x tk−T +1, · · · , x tk ) ∈
                                                                              the following equation．
RT , the following attention mechanism can be used:
                                                                                               (dt , st0 ) = ft empor al (dt −1, st0 −1, ỹt −1 )         (17)
            akt = vaT tanh(Wa [ht −1 ; st −1 ] + Ua xkt + ba )         (9)
                                                                                 Given T -length multivariate explanatory time series XT =
                                   exp(akt )
                         α tk = ∑                                     (10)    (x1, · · · , xt , · · · , xT ), where xt = (x t1, · · · , x tn ), and target time
                                 n exp(a j )                                  series yT = (y1, · · · , yt , · · · , yT ), context vector cT and hidden
                                  j=1        t
        ்ܺ                                                                   for our study. The collection period is from the first quarter of
                                                                             2003 to the fourth quarter of 2016. The surveys consist of annual
                                                                             survey and quarterly survey. The total number of companies are
                                                                             57, 775 in the quarterly survey and 60, 516 in the annual survey.
                                                                             These surveyed items include the financial indexes shown in Ta-
                                                      Temporal-              ble 1.
                     Spatial-        Spatial -
                                                       attention
                     attention       attention                       …          We use financial statements in the quarterly survey, and cal-
                                                        LSTM
                      LSTM            LSTM
        …                                                                    culate various financial ratios as explanatory and target vari-
                                                                   ࢟ ்ାଵ     ables in our analysis. We need to perform preprocessing for the
                                                                             use of time-series analysis. First, the survey dataset contains
    T                                                                        both long-term and short-term companies; however, it is not
                                                                             easy to include short-lived companies for time-series analysis
        ்ܻ                                                                   and thus we excluded short-term companies for our analysis.
                                                                             Secondly, the survey dataset contains a number of missing val-
        …                                                                    ues, so we need to take care of them before calculating financial
                                                                             ratios. In summary, we employed the following three steps, in
    T                                                                        this order, for preprocessing of the data: (1) extraction of survey
             Figure 4: Structure of the DSTP-based model.                    items for long-term companies (exclusion processing), (2) impu-
                                                                             tation of missing values in survey items (imputation processing),
                                                                             (3) calculation of financial ratios using the survey items (calcu-
state vector hT are concatenated to make the final prediction of             lation processing). We will describe the details in the following
one step ahead of target variable yT +1 , as follows:                        subsections.

                  ŷT +1   =     F (XT , yT )                         (18)   3.1 Exclusion processing
                           =     vyT (Wy [dT ; cT ] + by ) + by0      (19)   Before imputation processing, we do firstly exclusion process-
                                                                             ing. This is because some of company data greatly diverge from
where F denotes the predictor. Wy ∈ Rp×(p+m) and by ∈ Rp
                                                                             true data even if imputation processing is performed at this step.
map the concatenation [hT ; cT ] ∈ Rp+m to p-dimensional latent
                                                                             We performed exclusion processing in the following cases:
space. A linear function with weights vy ∈ Rp and bias by0 ∈ R
produces final prediction ŷT +1 .                                                • Case 1: companies that do not have all of 56 time steps.
                                                                                  • Case 2: companies of which financial statements do have
2.4          State-of-the-art financial time series                                 no data in the entire time steps.
             prediction with LSTM                                               After the exclusion, the number of companies in the dataset
                                                                             for the experiments became 2296.
Quite recently, Liu et al. [11] focused on spatial correlations and
incorporated target time series to develop Dual-Stage Two-Phase
                                                                             3.2 Imputation processing
(DSTP) attention-based LSTM model. Here, we briefly introduce
the structure of their model in Figure 4. Let X t be multivari-              Before calculation processing, we do secondly imputation pro-
ate explanatory time series and Yt be target time series. At the             cessing. This is because some of company data have missing val-
second phase of Spatial-attention LSTM, x k in Eq. (9) is replaced           ues, and if so, we cannot calculate financial ratios. We show the
with [x̃ k ; y k ], where the (observed) target time series is concate-      details on the imputation processing in the appendix.
nated with the corresponding explanatory time series.
   They enhanced the attention mechanisms to incorporate both
                                                                             3.3 Calculation processing
spacial correlations and temporal relationships. However, we have            In this study, a number of financial ratios are used as explanatory
three concerns when we employ the DSTP-based model to achieve                variables. Each financial ratio is based on formula of the financial
our goal. First, the dataset they used was NASDAQ 100 stock                  sales ratio in corporate enterprise statistics, as defined in Table 2.
data1 , which involves multiple time series samples with a uni-
variate explanatory variable, not with multivariate explanatory              3.4 External data
variables. Second, their model did not incorporate macroeco-                 In addition to financial ratios, we use two macroeconomic time
nomic time series. Third, their model did not consider industry-             series as external data: one is Nikkei Average closing price3 (N )
wide trends.                                                                 and the other is Japanese GDP4 (G). The Nikkei Average closing
                                                                             price (N ) is extracted from January in 2003 to December in 2016
3       DATA                                                                 on monthly basis. The Japanese GDP (G) is extracted from the
In this study, we use “Surveys for the Financial Statements Sta-             1st quarter of 2003 to the 4th quarter of 2016 on quarterly basis.
tistics of Corporations by Industry”2 collected by the Ministry of
Finance Japan. The surveys are based on sampling in which tar-               4 PROPOSED MODEL
get commercial corporations are general partnership companies,               In this paper, we propose a model for predicting one step ahead
limited partnership companies, limited liability companies, and              of the corporate financial time series in a target industry. Fig-
stock companies, all of whose head offices are located in Japan.             ure 5 shows the model’s flow from input time series to output
We excluded the companies in ‘finance and insurance’ industry                time series. The 1st phase aims to extract the spatial correlations
1 https://cseweb.ucsd.edu/ yaq007/NASDAQ100_stock_data.html                  3 https://indexes.nikkei.co.jp/nkave//index/profile?idx=nk225
2 https://www.mof.go.jp/english/pri/reference/ssc/outline.htm                4 https://www.esri.cao.go.jp/jp/sna/data/data_list/sokuhou/files/2019/toukei_2019.html
Table 1: Financial statements. ‘†’ and ‘‡’ indicate that the designated item is supposed to be recorded as of the beginning
of every term (fiscal year or quarter) and as of the end of every term, respectively.

                             Classifications      Quarterly surveys                               Annual surveys
                             Liabilities          Notes, accounts payable, and trade†,‡           Notes†,‡
                                                                                                  Accounts payable and trade†,‡
                             Fixed assets         Land †,‡                                        Land †,‡
                                                  Construction in progress †,‡                    Construction in progress †,‡
                                                  Other tangible assets †,‡                       Others †,‡
                                                  Intangible assets †,‡                           Excluded software †,‡
                                                                                                  Software †,‡
                                                  Total liabilities and net assets †,‡            Total assets †,‡
                             Personnel            Number of employees ‡                           Number of employees ‡
                             Profit and loss      Depreciation and amortization ‡                 Depreciation and amortization ‡
                                                                                                  Extraordinary depreciation and amortization ‡
                                                  Sales ‡                                         Sales ‡
                                                  Cost of sales ‡                                 Cost of sales ‡
                                                  Operating profit ‡                              Operating profit ‡
                                                  Ordinary profit ‡                               Ordinary profit ‡


Table 2: Financial ratios. ‘∗’ indicates that the designated item is obtained by averaging that as of the beginning of each
quarter and that as of the end of the quarter. ‘∗∗’ indicates that the designated item is obtained as the amount of increase
(or decrease) from that as of the beginning of each quarter to that as of the end of the quarter.

                                                                         (Operating profit)
              X0     Operating return on assets                  (Total liabilities and net assets∗ )
                                                                          (Ordinary profit)
              X1     Ordinary return on assets                   (Total liabilities and net assets∗ )
                                                                 (Operating profit)
              X2     Operating profit ratio                             Sales
                                                                 (Ordinary profit)
              X3     Ordinary profit ratio                             Sales
                                                                                 Sales
              X4     Total asset turnover ratio                  (Total liabilities and net assets∗ )
                                                                                    Sales
              X5     Tangible fixed assets turnover ratio        (Notes, accounts payable, and trade∗ )
                                                                                 Sales
              X6     Accounts payable turnover ratio             Land∗ +(Other tangible assets∗ )
                                                                                          (Depreciation and amortization)
              X7     Depreciation and amortization ratio         (Other tangible assets)+(Intangible assets)+(Depreciation and amortization)
                                                                 Land∗ +(Other tangible assets∗ )
              X8     Capital equipment                               (Number of employees)
                                                                 (Ordinary profit)+(Depreciation and amortization)
              X9     Cash flow ratio                                       (Total liabilities and net assets∗ )
                                                                 (Construction in progress∗∗ )+(Other tangible assets∗∗ )+(Intangible assets∗∗ )+(Depreciation and amortization)
              X 10   Capital Investment ratio                                                               (Total liabilities and net assets∗ )
                                                                 Sales−(Cost of sales)
              X 11   Gross profit ratio                                   Sales


among multivariate time series. The 2nd and 3rd phases aim to                                   We then use a linear function to aggregate all samples. This liner
make the prediction of a target variable at time T + 1, given the                               function summarizes all companies’ features to an aggregated
past explanatory and target time series of window size T . More                                 feature space. Here, let W be weights and b be biases in the liner
specifically, the 2nd phase aims to extract the spatial correlations                            function.
between the target time series and the multivariate explanatory                                                            Ã = W Á + b                      (21)
time series, while the 3rd phase aims to extract temporal rela-                                 Eqs. (20) and (21) allow the model to incorporate all financial
tionships.                                                                                      statements. Here, Á with shape (J, L, K) —J × L × K third-order
                                                                                                tensor— is aggregated to Ã with shape (1, L, K), where J , L, and K
   1st phase. This phase captures spatial correlations in all of the                            indicate the number of time-series samples (or companies), the
multivariate time series (including both explanatory and target                                 length of training data, and the number of dimensions of mul-
variables observed as time series) of the entire length. By ap-                                 tivariate variables (including both explanatory and target vari-
plying the attention mechanisms to the entire time series be-                                   ables), respectively. Since the closing price of Nikkei Average
fore splitting the time-series data, the correlations between time-                             (N ) is on monthly basis and its length is Ĺ = L × 3, the attention
series samples can be captured well. For multiple time-series                                   mechanism is applied to aggregate it on quarterly basis. Here, N
samples with multivariate variables, such that each time-series                                 with shape (1, Ĺ, 1) is aggregated to Ñ with shape (1, L, 1). More
sample corresponds to each company, it is not easy to well cap-                                 specifically, the following formula is applied when i ∈ {1, 2, 3}
ture the spatial correlations after the data splitting, which does                              indicates either of the first month to the third month for each
not distinguish which time-series slice belongs to which com-                                   quarter t:
pany.
                                                                                                                                  exp(Nti )
   First, Spatial Attention-based LSTM, denoted as F Spat ial , is                                                       γti = ∑                                 (22)
                                                                                                                                3 exp(N j )
applied to all company time series (A) to obtain the spatial cor-                                                                j=1        t
relations among the multivariate time series for every company,                                                                                                       T
as in Eqs. (9), (10), and (11).                                                                                              Ñt = (γt1 Nt1, γt2 Nt2, γt3 Nt3 )                    (23)
                                                                                                Then, the quarterly representations of Nikkei Average closing
                                                                                                price, Ñ , are concatenated with the aggregated company data,
                          Á = Fspat ial (A)                                    (20)            Ã, as well as Japanese GDP, G, and the multivariate explanatory
                 1st phase                                                                    2nd phase              3rd phase
                            Spatial-
                                                              Z                       ܼመ
                                         Linear
                            attention
                    A        LSTM
                                        function         ‫ܣ‬ሚ


                                    Attention
                                                         ෩
                                                         ܰ        Spatial-                               Spatial -   Temporal-
                    N
                                                                  attention                              attention    attention     …
                                                                                                   …
                                                                   LSTM                                   LSTM         LSTM
                                                         G
                    G                                                                                                             ࢟ ்ାଵ
                                                         I
                    I
                                                         L                           L
                    Y                                                   …


                                                                  T           ்ܻ              T
                     L


                                                   Figure 5: Structure of the proposed model.


time series in the target industry, I . Here, I indicates all the fi-
nancial ratio time series other than the target time series in its
industry.
                          Z = [Ã; Ñ ; G; I ]                    (24)
The shape of G is (1, L, 1) and the shape of I is (B, L, K −1), where
B indicates the mini-batch size for I for learning the model. There-
fore, the shape of the concatenated samples Z is (B, L, K × 2 + 1).
By applying Spatial-attention LSTM to this, we can obtain the                               Figure 6: Data splitting for validation step.
spatial correlations between the time series.
                         Ẑ = Fspat ial (Z )                          (25)
In this paper, Fspat ial in Eqs. (20) and (25) are obtained indepen-
dently from A and Z , respectively. As the final stage of the 1st
phase, Ẑ is sliced by shifting, one by one, the time steps of length
T . Therefore, the number of time series samples is B × (L − T ),
and the shape is (B × (L − T ),T , K × 2 + 1).
    2nd phase. First, Ẑ of length T and the target time series YT
of the same length are concatenated at each time t ∈ {1, · · · ,T }.                              Figure 7: Data splitting for test step.
Here, YT is obatained by slicing in the same manner of Z men-
tioned previously.
                               ZÛ = [Ẑ ; YT ]                  (26)               5 EXPERIMENTS
After the concatenation, the shape of the time series samples ZÛ                   5.1 Settings
is (B × (L −T ),T , K × 2 + 2). By applying Spatial-attention LSTM                 For experiments, we carry out validation step and test step, as
to this, we can obtain the spatial correlations between the target                 below. In both steps, we slide data one step forward, resulting in
time series and the other time series.                                             15 datasets.
                           ZÛ̂ = Fspat ial (ZÛ )                (27)                     • Validation step: We set the length of training data L to
                                                                                           40 steps, as shown in Figure 6. Training is carried out to
   3rd phase. At this phase, we apply Temporal-attention LSTM
                                                                                           search the best hyperparameter of the number of LSTM
(denoted as Ft empor al ) to capture the temporal relationships in
                                                                                           units. We assume the search range to be U ∈ {16, 32, 64, 128}.
the spatial attentions, as in Eq. (18). In other words, it captures
                                                                                         • Test step: We set the length of training data L to 41 steps,
the spatio-temporal relationships of multiple time series starting
                                                                                           as shown in Figure 7. For training, we use the best hyper-
at different times.
                                                                                           parameter selected in the validation step.
                                          Û̂ YT )
                    ŷT +1 = Ft empor al (Z,                   (28)                Moreover, we set the other hyperparameters, as follows. Win-
The generated ŷT +1 is the final prediction. Here, all the models                 dow length T = 12, the number of epochs is 500, learning rate
in this paper use a back-propagation algorithm to train the mod-                   is 0.001, and the mini-batch size B = 64. These are determined
els. During the training process, the mean squared error (MSE)                     empirically.
between the predicted target vector ŷT +1 and the ground-truth                       In the setting above, we assume ‘Whole sale and trade’ as a
vector yT +1 is minimized using the Adam optimizer [7].                            target industry. We also assume Return on assets (ROA), denoted
as X 1 in Table 2, as target variable Y . As for multivariate explana-      further proposed a new model structure that appropriately cap-
tory variables, we use X = {X 0, X 2, X 3, · · · , X 11 }. Therefore, the   tures macroeconomic time series. In particular, we showed the
number of the variables in time series is K = 12, including both            effectiveness of the proposing model, especially focusing on the
explanatory and target variables.                                           following three points. First, the model can capture macroeco-
   In order to demonstrate the effectiveness of the proposed model,         nomic time series, such as GDPs, more appropriately than the
we compare the prediction performance of the proposed model                 DSTP- and LSTM-based models. Second, the model can be fo-
to that of the DSTP-based model and simple LSTM model. The                  cused to a specific industry. Third, the model is designed to be
Nikkei Average closing price (N ) was converted to quarterly by             learned from multiple time-series samples of multivariate ex-
averaging every three months for the DSTP-based model and                   planatory variables, producing modestly more effective or com-
simple LSTM model. We assumed that the number of LSTM lay-                  parable performance of the prediction compared to the DSTP-
ers is two.                                                                 based model. We developed a model for predicting corporate
   For evaluation, we use Mean Squared Error (MSE) with the                 ROA in the wholesale trade industry. Our model may be effective
sample standard deviation. We further confirm whether the im-               in forecasting other target variables in other industries; however,
provement of the proposed model was statistically significant               this extension is left for the future work.
via Wilcoxon signed rank testing at 0.05 level, compared with
the baselines.                                                              ACKNOWLEDGMENTS
                                                                            We thank Takuji Kinkyo and Shigeyuki Hamori for valuable dis-
5.2     Results and discussions                                             cussions and comments. This work was supported in part by
We show in Table 3 the evaluation results in terms of mean                  the Grant-in-Aid for Scientific Research (#15H02703) from JSPS,
squared errors (MSE) and the sample standard deviations (SD)                Japan.
using the proposed model, DSTP-based model, and simple LSTM
model in various settings. Here, I indicates a target industry data         REFERENCES
and I = [X ; Y ] ∈ A. Also, ‘# of units’ indicates the number of the         [1] M. Hadi Amini, Amin Kargarian, and Orkun Karabasoglu. 2016. ARIMA-
LSTM units for each model, which was determined in the vali-                     based decoupled time series forecasting of electric vehicle charging demand
                                                                                 for stochastic power system operation. Electric Power Systems Research 140
dation step.                                                                     (2016), 378–390.
   The following are discussions to clarify the contributions from           [2] Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F. Stewart,
                                                                                 and Jimeng Sun. 2016. RETAIN: Interpretable predictive model in healthcare
the three points of view:                                                        using reverse time attention mechanism. Advances in Neural Information Pro-
      • Using macroeconomic time series: Comparing the proposed                  cessing Systems (NIPS 2016) 29 (2016), 3504–3512.
                                                                             [3] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning.
        models with macroeconomic time series (Model 1) and                      MIT Press.
        those without macroeconomic time series (Model 3) in                 [4] Sepp Hochreiter. 1991. Untersuchungen zu dynamischen neuronalen Netzen.
        Table 3, Model 1 works more effectively than Model 3,                    Diploma thesis, Institut für Informatik, Technische Universiät München.
                                                                             [5] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber.
        on average, by successfully capturing properties of the                  2001. Gradient flow in recurrent nets: the difficulty of learning long-term
        macroeconomic time series. We confirmed that the im-                     dependencies. In A Field Guide to Dynamical Recurrent Neural Networks, S. C.
                                                                                 Kremer and J. F. Kolen (Eds.). IEEE Press.
        provement brought by Model 1 was statistically signifi-              [6] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.
        cant via Wilcoxon signed rank test at 0.05 level, compared               Neural Computation 9, 8 (1997), 1735–1780.
        to Model 3.                                                          [7] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic
                                                                                 Optimization. In Proceedings of the 3rd International Conference on Learning
      • Focusing a specific industry: When we compare the pro-                   Representations (ICLR 2015).
        posed model with all the time series (Model 1) and that              [8] Hao Li, Yanyan Shen, and Yanmin Zhu. 2018. Stock Price Prediction Using
        without considering spatial correlations among all com-                  Attention-based Multi-Input LSTM. Proceedings of Machine Learning Research
                                                                                 95 (2018), 454–469.
        panies’ time series (Model 2), Model 1 works moderately              [9] Yuxuan Liang, Songyu Ke, Junbo Zhang, Xiuwen Yi, and Yu Zheng. 2018. Geo-
        more effectively than Model 2, on average. We confirmed                  MAN: Multi-level attention networks for geo-sensory time series prediction.
                                                                                 In Proceedings of the 27th International Joint Conference on Artificial Intelli-
        that this improvement was statistically significant in the               gence (IJCAI-18). 3428–3434.
        same manner mentioned previously.                                   [10] Jie Liu and Enrico Zio. 2017. SVM hyperparameters tuning for recursive
      • Using multiple time-series samples with multivariate ex-                 multi-step-ahead prediction. Neural Computing & Applications 28, 12 (2017),
                                                                                 3749–3763.
        planatory variables: Given all the time series, Model 1 was         [11] Yeqi Liu, Chuanyang Gong, Ling Yang, and Yingyi Chen. 2019. DSTP-RNN:
        more effective on average (with the statistical significance)            A dual-stage two-phase attention-based recurrent neural network for long-
        than Models 4 and 7; however, this is not the case in the                term and multivariate time series prediction. Expert Systems with Applications
                                                                                 143 (2019).
        other situations. It can be said that our proposed mod-             [12] Yao Qin, Dongjin Song, Haifeng Cheng, Wei Cheng, Guofei Jiang, and Garri-
        els work modestly more effective than or comparable to                   son W. Cottrell. 2017. A dual-stage attention-based recurrent neural network
                                                                                 for time series prediction. In Proceedings of the 26th International Joint Con-
        the DSTP-based models, depending on the situations, and                  ference on Artificial Intelligence (IJCAI-17). 2627–2633.
        that our models work greatly more effective than the sim-
        ple LSTM models. More detailed evaluation is left for the
        future work.                                                        A IMPUTATION PROCESSING
                                                                            Now, we briefly describe imputation processing to handle time
6     CONCLUSIONS                                                           series with missing values. We suppose time series д in the quar-
In order to establish a useful method for forecasting corporate             terly survey dataset and time series G in the annual survey dataset
financial time series data, we aimed in this paper to appropri-             with regard to each financial index. We carry out imputation
ately forecast one step ahead using multiple time-series sam-               processing for each financial index in the two steps below. Here,
                                                                                 y
ples of multivariate explanatory variables. For this objective, we          let q j be a value as of quarter j of year y in a specific financial
proposed an industry specific model that simultaneously cap-                index time series, as shown in Table 4.
tures corporate financial time series and the industry trends. We                • Step 1 —using annual data:
Table 3: Mean squared errors (MSE) and the sample standard deviations (SD) using the proposed model, DSTP-based model,
and simple LSTM model in various settings.

                          Model IDs          Models                           Data used       # of units       MSE                 SD
                           Model 1      Proposed model                       (I), A, N, G         16        2.08 × 10−4        8.48 × 10−5
                           Model 2      Proposed model (w/o ‘A’)                I, N, G           32        2.11 × 10−4        7.43 × 10−5
                           Model 3      Proposed model (w/o ‘N’ and ‘G’)         (I), A          128        2.28 × 10−4        7.64 × 10−5
                           Model 4      DSTP-based model                       A, N, G            16        2.12 × 10−4        6.47 × 10−5
                           Model 5      DSTP-based model                        I, N, G           16        2.26 × 10−4        9.40 × 10−5
                           Model 6      DSTP-based model                           A              16        2.11 × 10−4        7.43 × 10−5
                           Model 7      Simple LSTM model                      A, N, G            16        3.86 × 10−4        6.07 × 10−4
                           Model 8      Simple LSTM model                          A              16        3.10 × 10−4        3.72 × 10−4
                           Model 9      Simple LSTM model                           I             16        2.95 × 10−4        3.43 × 10−4


                                                                                                            y
         Table 4: Notations of years and terms.                                               – Case 4: q 1 is equal to the sum of Gy and the other fi-
                                                                                                nancial index Gy0 .
        Year   Quarters   Quarterly surveys     Annual surveys
                                                                                                                      q 1 = Gy + Gy0
                                  y                                                                                    y
                  1              q1                                                                                                                         (35)
                                  y
         y        2              q2                   Gy                                                 y
                  3
                                  y
                                 q3                                                           – Case 5: q 4 is equal to the sum of Gy and the other fi-
                  4
                                  y
                                 q4                                                             nancial index Gy0 .

                                                                                                                      q 4 = Gy + Gy0
                                                                                                                       y
                                                                                                                                                            (36)
                                                                                                                    y y y               y
     For the imputation, we make use of the annual survey                                     – Case 6: The sum of q 1 , q 2 , q 3 and q 4 is equal to the sum
                                y
     dataset. When the value q j is missing, we attempt to look                                 of Gy and the other financial index Gy0 .
     up Gy in the annual survey dataset. If Gy is also missing,                                  The imputation in this case depends on the number of
                     y                                                                                                     y y y y
     this step for q j is skipped. This imputation processing                                    missing values in {q 1 , q 2 , q 3 , q 4 }. We assume Cases 6-1,
     consists of the following six cases, depending on finan-                                    6-2, 6-3 and 6-4, as below, where a, b, c, d ∈ {1, 2, 3, 4}:
     cial indexes.                                                                               ∗ Case 6-1 —when the number of missing values in
                                                                                                      y y y y
                 y
     – Case 1: q 1 is equal to Gy .                                                                {qa , qb , qc , qd } is one:
                                                                                                                 y
                                y                                                                  Suppose qa is missing.
                               q 1 = Gy                               (29)
                                                                                                           qa = Gy + Gy0 − qb − qc − qd
                                                                                                            y                       y        y   y
                                                                                                                                                            (37)
                y
     – Case 2: q 4 is equal to Gy .
                                                                                                 ∗ Case 6-2 —when the number of missing values in
                                                                                                     y y y y
                             y
                           q 4 = Gy                           (30)                                 {qa , qb , qc , qd } is two:
                                                                                                                 y         y
                               y y y           y                                                   Suppose qa and qb are missing.
     – Case 3: The sum of q 1 , q 2 , q 3 and q 4 is equal to Gy .
                                                                                                                           Gy + Gy0 − qc − qd
                                                                                                                                             y   y
       The imputation in this case depends on the number of                                                 y     y
                                 y y y y
       missing values in {q 1 , q 2 , q 3 , q 4 }. We assume Cases 3-1,                                    q a = qb =                        (38)
                                                                                                                                  2
       3-2, 3-3, and 3-4, as below, where a, b, c, d ∈ {1, 2, 3, 4}:                             ∗ Case 6-3 —when the number of missing values in
       ∗ Case 3-1 —when the number of missing values in                                              y y y y
                                                                                                   {qa , qb , qc , qd } is three:
            y y y y
         {qa , qb , qc , qd } is one:                                                                            y y           y
                       y                                                                           Suppose qa , qb and qc are missing.
         Suppose qa is missing.
                                                                                                                            Gy + Gy0 − qd
                                                                                                                                        y
                    y          y     y    y                                                                 y     y     y
                   q a = Gy − qb − q c − qd                           (31)                                 q a = qb = q c =                  (39)
                                                                                                                                   3
        ∗ Case 3-2 —when the number of missing values in                                         ∗ Case 6-4 —when the number of missing values in
            y y y y                                                                                  y y y y
          {qa , qb , qc , qd } is two:                                                             {qa , qb , qc , qd } is four:
                        y         y                                                                              y y y           y
          Suppose qa and qb are missing.                                                           Suppose qa , qb , qc and qd are missing.
                                    y    y
                              Gy − qc − qd                                                                  y     y        y       y     Gy + Gy0
                    y     y
                   q a = qb =                       (32)                                                   q a = qb = q c = qd =                      (40)
                                         2                                                                                         4
        ∗ Case 3-3 —when the number of missing values in                                    • Step 2 — linear interpolation and extrapolation:
            y y y y
          {qa , qb , qc , qd } is three:                                                      We perform commonly-used linear interpolation and ex-
                        y y           y                                                       trapolation for missing values in each financial index time
          Suppose qa , qb and qc are missing.
                                                                                              series in the quarterly survey dataset.
                                                 y
                    y      y        y     Gy − qd
                   q a = qb = q c =                 (33)
                                          3
        ∗ Case 3-4 —when the number of missing values in
            y y y y
          {qa , qb , qc , qd } is four:
                        y y y           y
          Suppose qa , qb , qc and qd are missing.

                    y      y        y     y    Gy
                   q a = qb = q c = qd =                              (34)
                                                4