Forecasting Corporate Financial Time Series using Multi-phase Attention Recurrent Neural Networks Shuhei Yoshimi Koji Eguchi Kobe University Hiroshima University Kobe, Hyogo, Japan Higashi-Hiroshima, Hiroshima, Japan shuhei@cs25.scitec.kobe-u.ac.jp eguchi@acm.org ABSTRACT and have been well accepted lately even in the fields of financial These days, attention-based Recurrent Neural Networks (RNNs) time series analysis, such as stock price forecasting [8]. How- have been widely used for learning the hidden temporal struc- ever, it may not be easy to achieve accurate long-term prediction ture of raw time series data. More recently, attention-based RNNs for multivariate time series, since a part of multivariate explana- have been further enhanced to represent multivariate temporal tory variables may not contribute to the prediction and even do or spatio-temporal structure underlying multivariate time series. harm the prediction accuracy. It can be considered that, when This latest study achieved more effective prediction by employ- some explanatory variables have relatively small contributions ing attention structure that simultaneously capture the spatial to the prediction, those variables may result in noises. In an- relationships among multiple different time series and the tem- other line of research, a time-series prediction model was pro- poral structure of those time series. That method assumes single posed so that it uses the attention-based RNN to learn the atten- time-series samples of multi- or uni-variate explanatory vari- tion weights of raw time series and further enhances the abil- ables, and thus, no prediction method was designed for multi- ity to represent spatio-temporal features [2]. Qin et al. [12] and ple time-series samples of multivariate explanatory variables. Yuxuan et al. [9] combined attention mechanisms with encoder- Moreover, such previous studies have not explored on finan- decoder models to achieve better performance in predicting one cial time series incorporating macroeconomic time series, such or several steps ahead. Liu et al. [11] developed Dual-Stage Two- as Gross Domestic Product (GDP) and stock market indexes, to Phase (DSTP) attention-based RNN model, by capturing correla- our knowledge. Also, no neural network structure has been de- tions among multivariate explanatory variables and embedding signed for focusing a specific industry. We aim in this paper to past observations of target time series via multiple levels of at- achieve effective forecasting of corporate financial time series tention mechanism. However, no prediction was made for multi- from multiple time-series samples of multivariate explanatory ple time-series samples with multivariate explanatory variables, variables. We propose a new industry specific model that appro- in those previous studies. Moreover, no previous studies have priately captures corporate financial time series, incorporating explored on deep learning models for financial time series in- the industry trends and macroeconomic time series as side infor- corporating macroeconomic time series, such as Gross Domes- mation. We demonstrate the performance of our model through tic Product (GDP) and stock market indexes, to our knowledge. experiments with Japanese corporate financial time series in the Also, no structure of deep learning models has been designed for task of predicting the return on assets (ROA) for each company. focusing a specific industry, even the industry trend can be in- fluencing. This paper aims to establish a useful method for fore- casting corporate financial time series by appropriately learning 1 INTRODUCTION from multiple time-series samples with multivariate explana- In recent years, a huge amount of information is accumulated tory variables. We propose a new industry specific model that day by day with the developing information technology. One appropriately captures business and industry trends, as well as such information is corporate financial time series data in the macroeconomic time series, in an extension of attention-based economic and financial fields. Many economic experts have in- RNN. Through experiments with Japanese corporate financial terest in gaining new insights from these data. Corporate finan- time series, we demonstrate our proposed model focusing on the cial time series are particularly complex since they are often af- wholesale industry works effectively in the task of predicting the fected by various information, such as business conditions of return on assets (ROA) for each company. each company, trends in the industry, and business sentiment in society. Traditional time series analysis and modern deep learn- ing technology have addressed the problem of time-series pre- diction (or forecasting); however, there is plenty of room for new 2 RELATED WORK research on complex multivariate time series, such as corporate This paper attempts to predict one step ahead of corporate fi- financial time series. Among the widely recognized time series nancial indicators by deep learning. This section consists of four analysis methods, the Autoregressive Integrated Moving Aver- topics. The first topic is about RNNs, which have been one of age (ARIMA) model and the kernel methods can capture one the most popular deep learning methods for predicting time se- aspect of spatio-temporal patterns; however, it is not easy to ries data. The second topic is about LSTMs, which have been achieve accurate forecasting of multivariate time series[1, 10]. extended from RNNs to capture long and short time dependen- Moreover, Recurrent Neural Networks (RNN) and Long Short- cies. The third topic is on attention mechanisms, which have at- Term Memory (LSTM) can take into account time dependencies tracted much attention recently due to the promising prediction performance. Those topics provide basic techniques for time se- © 2020 Copyright for this paper by its author(s). Published in the Workshop Pro- ries analysis based on deep learning. As the last topic, we briefly ceedings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copen- hagen, Denmark) on CEUR-WS.org. Use permitted under Creative Commons Li- review state-of-the-art related work on deep learning-based fi- cense Attribution 4.0 International (CC BY 4.0) nancial time series prediction. ‫ܡ‬௧ ‫ܡ‬௧ ܹ ሺ௢௨௧ሻ ‫ܐ‬௧ ܹ t=T ‫ܐ‬௧ ܹ ሺ௜௡ሻ ‫ܠ‬௧ ‫ܠ‬௧ t ‫ܡ‬௧ t=3 ‫ܐ‬௧ାଵ t=2 ‫ܐ‬௧ t=1 tanh Figure 1: Structure of RNN. ‫ܐ‬௧ିଵ ‫ܠ‬௧ 2.1 Recurrent Neural Networks Sequential data refers to any kind of data where the order of the samples is important. Especially, sequential data is also called Figure 2: A unit of middle layer in RNN. as time series data in case that the order is based on time. It is known that prediction performance can be improved by con- sidering the dependencies between the current samples and the ‫ܡ‬௧ past samples. One most popular method is RNN, which is an ex- tension of feed-forward neural networks for handling sequential ‫ܐ‬௧ data [3]. Now, we suppose that the RNN receives input sequence xt for each time t and also returns one output sequence yt at ‫ܠ‬௧ t ‫ܡ‬௧ the time t. At time t = n, we can assume that output yn is suc- cessfully produced from input sequence x1, x2, x3, · · · , xn . This ‫ܐ‬௧ାଵ is because the RNN is based on a neural network with directed ‫ܛ‬௧ିଵ . σ ‫ܛ‬௧ closed loop called ‘return path’. This structure makes it possi- + ble to store temporal information and change behaviors. Figure tanh . tanh ‫ܐ‬௧ 1 shows the structure of the RNN and its structure expanded in σ . time dimension. σ Now, we describe the calculation process in RNNs. First, sup- ‫ܐ‬௧ିଵ ‫ܠ‬௧ pose xt is input to the network; ht is output from the middle layer; yt is output from the output layer; W (in) is input weight matrix that represents connections from the input layer to the Figure 3: A unit of middle layer in LSTM. middle layer; andW (out ) is output weight matrix that represents connections from the middle layer to the output layer. In the return path, the RNN returns the output of the middle However, the gradient usually vanishes or explodes after a cer- layer to its own input. The RNN assumes the connection be- tain number of iterations during learning in case of RNNs [4, 5]. tween the middle layer at time t − 1 to that at time t. Therefore, This limit is caused by the so-called gradient vanishing problem a weight w is assigned to each recurrent connection from an ar- —when calculating weights, the value of the gradient explosively bitrary unit of the middle layer at time t − 1 to an arbitrary unit decreases or increases as it is propagated backward through the of the middle layer at time t. We use the notation W as a recur- RNN network. To address this problem, LSTMs [6] were pro- rent weight matrix consisting of each recurrent connection’s w. posed to achieve long- and short-term memory. Compared to the Figure 2 shows one unit of middle layer in RNN. RNN, each unit in the middle layer of the LSTM has a memory Hidden variables in the middle layer at time t, ht , can be ob- cell and three gates: an input gate, an output gate, and a forget tained by xt , W (in) , ht −1 , W , activation function f and bias b, gate, while the other structure is basically the same as that of as follows: the RNN. Figure 3 shows one unit of middle layer in the LSTM. ht = f (W (in) xt + W ht −1 + b) (1) Now, let Wi , Wo , Wf , and Ws be input weight matrices, where Output yt are then obtained by f , W (in) , and ht from each the subscript indicates the input gate i, output gate o, forget gate unit in the middle layer: f , or memory cell s in the LSTM network. Also, let Ui , Uo , Uf , and Us be recurrent weight matrices, and bi , bo , bf , and bs be yt = f (W (out ) ht ) (2) biases. Suppose σ is the sigmoid function, and tanh is the hy- In this paper, we assume that activation function f in Eq. (1) perbolic tangent function. At time t, we suppose that it is the is the hyperbolic tangent function (tanh) and f in Eq. (2), also output of the input gate; ŝt is a new candidate state of the mem- called as loss function, is the squared error. ory cell; ot is the output of the output gate; ft is the output of the forget gate; st is the state of the memory cell; and ht is the 2.2 Long Short-Term Memory output of the memory cell. We obtain these variables, as follows: RNNs can capture the context of sequential data. In this case, it is it = σ (Wi xi + Ui ht −1 + bi ) (3) important to understand the length of past sequence that should be captured in the model, in other words, how long past inputs ŝt = tanh(Ws xt + Us ht −1 + bs ) (4) from the current time should be reflected to predict the output. ft = σ (Wf xt + Uf ht −1 + bf ) (5) st = it ŝt + ft st −1 (6) where [∗; ∗] is a concatenation operation, and va , ba ∈ RT ,Wa ∈ ot = σ (Wo xt + Uo ht −1 + bo ) (7) RT ×2m , Ua ∈ RT ×T are parameters to learn, ht −1 ∈ Rm and st −1 ∈ Rm are respectively the hidden state and cell state vectors ht = ot tanh(st ) (8) of time t − 1, and m is the number of hidden states of this atten- where indicates the Hadamard (or element-wise) product. tion module. Spatial attention weight at time t, (α t1, · · · , α tn ), is Now, we will discuss how the LSTM’s units work in the four determined by the hidden states and cell states at time t − 1, and steps below. inputs of explanatory attributes at time t. It represents the effect • Step 1 —Update the output of the forget gate ft : of each explanatory attribute on the forecast of the target. Using First, the model needs to determine what to forget from the attention weight associated with each explanatory attribute, the cell state, as shown in Eq. (5). ft is obtained by the out- the multivariate input at time t, xt = (x t1, · · · , x tn ), is weighted put of the previous step ht −1 and the input xt . σ is used as follows: T to give output values between 0 and 1. For instance, when x̃t = (α t1x t1, α t2x t2, · · · , α tn x tn ) (11) the value is 1, the current cell state is stored completely. Let fspat ial be the LSTM with the spatial attention mechanism When the value is 0, it is forgotten completely. • Step 2 —Update the output of input gate it and new can- mentioned previously. Then we can get the following equation. didate state of memory cell ŝt : (ht , st ) = fspat ial (ht −1, st −1, x̃t ) (12) Second, the model needs to determine what information is going to be added to the cell state, as shown in Eqs. (3) 2.3.2 Temporal-attention LSTM. The purpose of the tempo- and (4). In this step again, the input is obtained by ht −1 ral attention mechanism is to maintain the temporal relation- and xt . Input gate it first applies σ to determine which ships of the spatial attention. The spatio-temporal relationships previous cell state will be updated. tanh is then used to in a fixed-size window is extracted using the spatial relationship obtain a new candidate value ŝt . In the next step, these among multivariate time series in the time window of length T , two will be combined to update the cell state st −1 . which was mentioned previously. It is not sufficient to under- • Step 3 —Update the state of the memory cell st : stand the temporal relationships in a fixed-size window, so an at- Third, the model updates the cell state, as shown in Eq. (6). tention mechanism for selecting hidden states is promising. The Now, the old cell state st −1 is multiplied by ft to forget hidden state most relevant to the target (or objective) variable unnecessary information. Then, the product of it by st −1 is selected. For each i-th hidden state, the attention mechanism is added to the cell state memory. gives temporal attention weightes (βt1, · · · , βtT ), as follows: • Step 4 —Update the output of the memory cell ht and the output of output gate ot : bti = vbT tanh(Wb [dt −1 ; st0 −1 ] + Ub hi + bb ) (13) Finally, the model determines the output unit, as shown in Eqs. (7) and (8). ht is based on the cell state but in a fil- exp(bti ) βti = ∑ (14) tered version. First, σ is applied to the previous memory T exp(b j ) j=1 t output ht −1 and input xt , to determine output gate ot . where hi represents the i-th hidden state vector obtained in the This value indicates which cell state is going to be output spatial attention module mentioned previously. dt −1 ∈ Rp and in the range from 0 to 1. Then, cell state st is transformed st0 −1 ∈ Rp are respectively the hidden state and cell state vec- by tanh in the range from -1 to 1. Next, this transformed tors of time t − 1, and p is the number of hidden states of this cell state is multiplied by output gate ot , resulting in ht . attention module. vb , bb ∈ Rp , Wb ∈ Rp×2p , and Ub ∈ Rp×m This output will be forwarded to the next step in the net- are the parameters to learn. Next, context vector ct are defined work. as follows: These structures improve the limitation of RNNs in which ∑T limited-term memory can only be captured, achieving a more j ct = βt hj (15) accurate estimation that captures a longer-term memory. j=1 Context vector ct represents the information of all the hidden 2.3 Attention mechanisms states, representing the temporal relationships within a time win- Attention mechanisms were successfully used for LSTM [11, 12]. dow. This context vector ct is then aligned with target variable In time series analysis with LSTM, the attention mechanisms can yt , as follows: simultaneously capture the spatial relationships among multiple ỹt = w̃T [yt ; ct ] + b̃ (16) different time series and the temporal structure of those time series. In the rest of this subsection, we will briefly review the where w̃ ∈ Rm+1 and b̃ ∈ R are parameters that map the con- attention mechanism developed by Liu et al. [11]. catenation to the target variable. Aligning the target time series with the context vector makes it easier to maintain temporal re- 2.3.1 Spatial-attention LSTM. The purpose of the spatial at- lationships and makes use of the results to update the hidden tention mechanism is to obtain the spatial correlations among state and cell state. Let ft empor al be the LSTM with the tempo- multivariate input time series. Given the time series (of length T ) ral attention mechanism mentioned previously. Then we can get of k-th explanatory attribute at time t, xkt = (x tk−T +1, · · · , x tk ) ∈ the following equation. RT , the following attention mechanism can be used: (dt , st0 ) = ft empor al (dt −1, st0 −1, ỹt −1 ) (17) akt = vaT tanh(Wa [ht −1 ; st −1 ] + Ua xkt + ba ) (9) Given T -length multivariate explanatory time series XT = exp(akt ) α tk = ∑ (10) (x1, · · · , xt , · · · , xT ), where xt = (x t1, · · · , x tn ), and target time n exp(a j ) series yT = (y1, · · · , yt , · · · , yT ), context vector cT and hidden j=1 t ்ܺ for our study. The collection period is from the first quarter of 2003 to the fourth quarter of 2016. The surveys consist of annual survey and quarterly survey. The total number of companies are 57, 775 in the quarterly survey and 60, 516 in the annual survey. These surveyed items include the financial indexes shown in Ta- Temporal- ble 1. Spatial- Spatial - attention attention attention … We use financial statements in the quarterly survey, and cal- LSTM LSTM LSTM … culate various financial ratios as explanatory and target vari- ࢟ ்ାଵ ables in our analysis. We need to perform preprocessing for the use of time-series analysis. First, the survey dataset contains T both long-term and short-term companies; however, it is not easy to include short-lived companies for time-series analysis ்ܻ and thus we excluded short-term companies for our analysis. Secondly, the survey dataset contains a number of missing val- … ues, so we need to take care of them before calculating financial ratios. In summary, we employed the following three steps, in T this order, for preprocessing of the data: (1) extraction of survey Figure 4: Structure of the DSTP-based model. items for long-term companies (exclusion processing), (2) impu- tation of missing values in survey items (imputation processing), (3) calculation of financial ratios using the survey items (calcu- state vector hT are concatenated to make the final prediction of lation processing). We will describe the details in the following one step ahead of target variable yT +1 , as follows: subsections. ŷT +1 = F (XT , yT ) (18) 3.1 Exclusion processing = vyT (Wy [dT ; cT ] + by ) + by0 (19) Before imputation processing, we do firstly exclusion process- ing. This is because some of company data greatly diverge from where F denotes the predictor. Wy ∈ Rp×(p+m) and by ∈ Rp true data even if imputation processing is performed at this step. map the concatenation [hT ; cT ] ∈ Rp+m to p-dimensional latent We performed exclusion processing in the following cases: space. A linear function with weights vy ∈ Rp and bias by0 ∈ R produces final prediction ŷT +1 . • Case 1: companies that do not have all of 56 time steps. • Case 2: companies of which financial statements do have 2.4 State-of-the-art financial time series no data in the entire time steps. prediction with LSTM After the exclusion, the number of companies in the dataset for the experiments became 2296. Quite recently, Liu et al. [11] focused on spatial correlations and incorporated target time series to develop Dual-Stage Two-Phase 3.2 Imputation processing (DSTP) attention-based LSTM model. Here, we briefly introduce the structure of their model in Figure 4. Let X t be multivari- Before calculation processing, we do secondly imputation pro- ate explanatory time series and Yt be target time series. At the cessing. This is because some of company data have missing val- second phase of Spatial-attention LSTM, x k in Eq. (9) is replaced ues, and if so, we cannot calculate financial ratios. We show the with [x̃ k ; y k ], where the (observed) target time series is concate- details on the imputation processing in the appendix. nated with the corresponding explanatory time series. They enhanced the attention mechanisms to incorporate both 3.3 Calculation processing spacial correlations and temporal relationships. However, we have In this study, a number of financial ratios are used as explanatory three concerns when we employ the DSTP-based model to achieve variables. Each financial ratio is based on formula of the financial our goal. First, the dataset they used was NASDAQ 100 stock sales ratio in corporate enterprise statistics, as defined in Table 2. data1 , which involves multiple time series samples with a uni- variate explanatory variable, not with multivariate explanatory 3.4 External data variables. Second, their model did not incorporate macroeco- In addition to financial ratios, we use two macroeconomic time nomic time series. Third, their model did not consider industry- series as external data: one is Nikkei Average closing price3 (N ) wide trends. and the other is Japanese GDP4 (G). The Nikkei Average closing price (N ) is extracted from January in 2003 to December in 2016 3 DATA on monthly basis. The Japanese GDP (G) is extracted from the In this study, we use “Surveys for the Financial Statements Sta- 1st quarter of 2003 to the 4th quarter of 2016 on quarterly basis. tistics of Corporations by Industry”2 collected by the Ministry of Finance Japan. The surveys are based on sampling in which tar- 4 PROPOSED MODEL get commercial corporations are general partnership companies, In this paper, we propose a model for predicting one step ahead limited partnership companies, limited liability companies, and of the corporate financial time series in a target industry. Fig- stock companies, all of whose head offices are located in Japan. ure 5 shows the model’s flow from input time series to output We excluded the companies in ‘finance and insurance’ industry time series. The 1st phase aims to extract the spatial correlations 1 https://cseweb.ucsd.edu/ yaq007/NASDAQ100_stock_data.html 3 https://indexes.nikkei.co.jp/nkave//index/profile?idx=nk225 2 https://www.mof.go.jp/english/pri/reference/ssc/outline.htm 4 https://www.esri.cao.go.jp/jp/sna/data/data_list/sokuhou/files/2019/toukei_2019.html Table 1: Financial statements. ‘†’ and ‘‡’ indicate that the designated item is supposed to be recorded as of the beginning of every term (fiscal year or quarter) and as of the end of every term, respectively. Classifications Quarterly surveys Annual surveys Liabilities Notes, accounts payable, and trade†,‡ Notes†,‡ Accounts payable and trade†,‡ Fixed assets Land †,‡ Land †,‡ Construction in progress †,‡ Construction in progress †,‡ Other tangible assets †,‡ Others †,‡ Intangible assets †,‡ Excluded software †,‡ Software †,‡ Total liabilities and net assets †,‡ Total assets †,‡ Personnel Number of employees ‡ Number of employees ‡ Profit and loss Depreciation and amortization ‡ Depreciation and amortization ‡ Extraordinary depreciation and amortization ‡ Sales ‡ Sales ‡ Cost of sales ‡ Cost of sales ‡ Operating profit ‡ Operating profit ‡ Ordinary profit ‡ Ordinary profit ‡ Table 2: Financial ratios. ‘∗’ indicates that the designated item is obtained by averaging that as of the beginning of each quarter and that as of the end of the quarter. ‘∗∗’ indicates that the designated item is obtained as the amount of increase (or decrease) from that as of the beginning of each quarter to that as of the end of the quarter. (Operating profit) X0 Operating return on assets (Total liabilities and net assets∗ ) (Ordinary profit) X1 Ordinary return on assets (Total liabilities and net assets∗ ) (Operating profit) X2 Operating profit ratio Sales (Ordinary profit) X3 Ordinary profit ratio Sales Sales X4 Total asset turnover ratio (Total liabilities and net assets∗ ) Sales X5 Tangible fixed assets turnover ratio (Notes, accounts payable, and trade∗ ) Sales X6 Accounts payable turnover ratio Land∗ +(Other tangible assets∗ ) (Depreciation and amortization) X7 Depreciation and amortization ratio (Other tangible assets)+(Intangible assets)+(Depreciation and amortization) Land∗ +(Other tangible assets∗ ) X8 Capital equipment (Number of employees) (Ordinary profit)+(Depreciation and amortization) X9 Cash flow ratio (Total liabilities and net assets∗ ) (Construction in progress∗∗ )+(Other tangible assets∗∗ )+(Intangible assets∗∗ )+(Depreciation and amortization) X 10 Capital Investment ratio (Total liabilities and net assets∗ ) Sales−(Cost of sales) X 11 Gross profit ratio Sales among multivariate time series. The 2nd and 3rd phases aim to We then use a linear function to aggregate all samples. This liner make the prediction of a target variable at time T + 1, given the function summarizes all companies’ features to an aggregated past explanatory and target time series of window size T . More feature space. Here, let W be weights and b be biases in the liner specifically, the 2nd phase aims to extract the spatial correlations function. between the target time series and the multivariate explanatory à = W Á + b (21) time series, while the 3rd phase aims to extract temporal rela- Eqs. (20) and (21) allow the model to incorporate all financial tionships. statements. Here, Á with shape (J, L, K) —J × L × K third-order tensor— is aggregated to à with shape (1, L, K), where J , L, and K 1st phase. This phase captures spatial correlations in all of the indicate the number of time-series samples (or companies), the multivariate time series (including both explanatory and target length of training data, and the number of dimensions of mul- variables observed as time series) of the entire length. By ap- tivariate variables (including both explanatory and target vari- plying the attention mechanisms to the entire time series be- ables), respectively. Since the closing price of Nikkei Average fore splitting the time-series data, the correlations between time- (N ) is on monthly basis and its length is Ĺ = L × 3, the attention series samples can be captured well. For multiple time-series mechanism is applied to aggregate it on quarterly basis. Here, N samples with multivariate variables, such that each time-series with shape (1, Ĺ, 1) is aggregated to Ñ with shape (1, L, 1). More sample corresponds to each company, it is not easy to well cap- specifically, the following formula is applied when i ∈ {1, 2, 3} ture the spatial correlations after the data splitting, which does indicates either of the first month to the third month for each not distinguish which time-series slice belongs to which com- quarter t: pany. exp(Nti ) First, Spatial Attention-based LSTM, denoted as F Spat ial , is γti = ∑ (22) 3 exp(N j ) applied to all company time series (A) to obtain the spatial cor- j=1 t relations among the multivariate time series for every company, T as in Eqs. (9), (10), and (11). Ñt = (γt1 Nt1, γt2 Nt2, γt3 Nt3 ) (23) Then, the quarterly representations of Nikkei Average closing price, Ñ , are concatenated with the aggregated company data, Á = Fspat ial (A) (20) Ã, as well as Japanese GDP, G, and the multivariate explanatory 1st phase 2nd phase 3rd phase Spatial- Z ܼመ Linear attention A LSTM function ‫ܣ‬ሚ Attention ෩ ܰ Spatial- Spatial - Temporal- N attention attention attention … … LSTM LSTM LSTM G G ࢟ ்ାଵ I I L L Y … T ்ܻ T L Figure 5: Structure of the proposed model. time series in the target industry, I . Here, I indicates all the fi- nancial ratio time series other than the target time series in its industry. Z = [Ã; Ñ ; G; I ] (24) The shape of G is (1, L, 1) and the shape of I is (B, L, K −1), where B indicates the mini-batch size for I for learning the model. There- fore, the shape of the concatenated samples Z is (B, L, K × 2 + 1). By applying Spatial-attention LSTM to this, we can obtain the Figure 6: Data splitting for validation step. spatial correlations between the time series. Ẑ = Fspat ial (Z ) (25) In this paper, Fspat ial in Eqs. (20) and (25) are obtained indepen- dently from A and Z , respectively. As the final stage of the 1st phase, Ẑ is sliced by shifting, one by one, the time steps of length T . Therefore, the number of time series samples is B × (L − T ), and the shape is (B × (L − T ),T , K × 2 + 1). 2nd phase. First, Ẑ of length T and the target time series YT of the same length are concatenated at each time t ∈ {1, · · · ,T }. Figure 7: Data splitting for test step. Here, YT is obatained by slicing in the same manner of Z men- tioned previously. ZÛ = [Ẑ ; YT ] (26) 5 EXPERIMENTS After the concatenation, the shape of the time series samples ZÛ 5.1 Settings is (B × (L −T ),T , K × 2 + 2). By applying Spatial-attention LSTM For experiments, we carry out validation step and test step, as to this, we can obtain the spatial correlations between the target below. In both steps, we slide data one step forward, resulting in time series and the other time series. 15 datasets. ZÛ̂ = Fspat ial (ZÛ ) (27) • Validation step: We set the length of training data L to 40 steps, as shown in Figure 6. Training is carried out to 3rd phase. At this phase, we apply Temporal-attention LSTM search the best hyperparameter of the number of LSTM (denoted as Ft empor al ) to capture the temporal relationships in units. We assume the search range to be U ∈ {16, 32, 64, 128}. the spatial attentions, as in Eq. (18). In other words, it captures • Test step: We set the length of training data L to 41 steps, the spatio-temporal relationships of multiple time series starting as shown in Figure 7. For training, we use the best hyper- at different times. parameter selected in the validation step. Û̂ YT ) ŷT +1 = Ft empor al (Z, (28) Moreover, we set the other hyperparameters, as follows. Win- The generated ŷT +1 is the final prediction. Here, all the models dow length T = 12, the number of epochs is 500, learning rate in this paper use a back-propagation algorithm to train the mod- is 0.001, and the mini-batch size B = 64. These are determined els. During the training process, the mean squared error (MSE) empirically. between the predicted target vector ŷT +1 and the ground-truth In the setting above, we assume ‘Whole sale and trade’ as a vector yT +1 is minimized using the Adam optimizer [7]. target industry. We also assume Return on assets (ROA), denoted as X 1 in Table 2, as target variable Y . As for multivariate explana- further proposed a new model structure that appropriately cap- tory variables, we use X = {X 0, X 2, X 3, · · · , X 11 }. Therefore, the tures macroeconomic time series. In particular, we showed the number of the variables in time series is K = 12, including both effectiveness of the proposing model, especially focusing on the explanatory and target variables. following three points. First, the model can capture macroeco- In order to demonstrate the effectiveness of the proposed model, nomic time series, such as GDPs, more appropriately than the we compare the prediction performance of the proposed model DSTP- and LSTM-based models. Second, the model can be fo- to that of the DSTP-based model and simple LSTM model. The cused to a specific industry. Third, the model is designed to be Nikkei Average closing price (N ) was converted to quarterly by learned from multiple time-series samples of multivariate ex- averaging every three months for the DSTP-based model and planatory variables, producing modestly more effective or com- simple LSTM model. We assumed that the number of LSTM lay- parable performance of the prediction compared to the DSTP- ers is two. based model. We developed a model for predicting corporate For evaluation, we use Mean Squared Error (MSE) with the ROA in the wholesale trade industry. Our model may be effective sample standard deviation. We further confirm whether the im- in forecasting other target variables in other industries; however, provement of the proposed model was statistically significant this extension is left for the future work. via Wilcoxon signed rank testing at 0.05 level, compared with the baselines. ACKNOWLEDGMENTS We thank Takuji Kinkyo and Shigeyuki Hamori for valuable dis- 5.2 Results and discussions cussions and comments. This work was supported in part by We show in Table 3 the evaluation results in terms of mean the Grant-in-Aid for Scientific Research (#15H02703) from JSPS, squared errors (MSE) and the sample standard deviations (SD) Japan. using the proposed model, DSTP-based model, and simple LSTM model in various settings. Here, I indicates a target industry data REFERENCES and I = [X ; Y ] ∈ A. Also, ‘# of units’ indicates the number of the [1] M. Hadi Amini, Amin Kargarian, and Orkun Karabasoglu. 2016. ARIMA- LSTM units for each model, which was determined in the vali- based decoupled time series forecasting of electric vehicle charging demand for stochastic power system operation. Electric Power Systems Research 140 dation step. (2016), 378–390. The following are discussions to clarify the contributions from [2] Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F. Stewart, and Jimeng Sun. 2016. RETAIN: Interpretable predictive model in healthcare the three points of view: using reverse time attention mechanism. Advances in Neural Information Pro- • Using macroeconomic time series: Comparing the proposed cessing Systems (NIPS 2016) 29 (2016), 3504–3512. [3] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. models with macroeconomic time series (Model 1) and MIT Press. those without macroeconomic time series (Model 3) in [4] Sepp Hochreiter. 1991. Untersuchungen zu dynamischen neuronalen Netzen. Table 3, Model 1 works more effectively than Model 3, Diploma thesis, Institut für Informatik, Technische Universiät München. [5] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. on average, by successfully capturing properties of the 2001. Gradient flow in recurrent nets: the difficulty of learning long-term macroeconomic time series. We confirmed that the im- dependencies. In A Field Guide to Dynamical Recurrent Neural Networks, S. C. Kremer and J. F. Kolen (Eds.). IEEE Press. provement brought by Model 1 was statistically signifi- [6] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. cant via Wilcoxon signed rank test at 0.05 level, compared Neural Computation 9, 8 (1997), 1735–1780. to Model 3. [7] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning • Focusing a specific industry: When we compare the pro- Representations (ICLR 2015). posed model with all the time series (Model 1) and that [8] Hao Li, Yanyan Shen, and Yanmin Zhu. 2018. Stock Price Prediction Using without considering spatial correlations among all com- Attention-based Multi-Input LSTM. Proceedings of Machine Learning Research 95 (2018), 454–469. panies’ time series (Model 2), Model 1 works moderately [9] Yuxuan Liang, Songyu Ke, Junbo Zhang, Xiuwen Yi, and Yu Zheng. 2018. Geo- more effectively than Model 2, on average. We confirmed MAN: Multi-level attention networks for geo-sensory time series prediction. In Proceedings of the 27th International Joint Conference on Artificial Intelli- that this improvement was statistically significant in the gence (IJCAI-18). 3428–3434. same manner mentioned previously. [10] Jie Liu and Enrico Zio. 2017. SVM hyperparameters tuning for recursive • Using multiple time-series samples with multivariate ex- multi-step-ahead prediction. Neural Computing & Applications 28, 12 (2017), 3749–3763. planatory variables: Given all the time series, Model 1 was [11] Yeqi Liu, Chuanyang Gong, Ling Yang, and Yingyi Chen. 2019. DSTP-RNN: more effective on average (with the statistical significance) A dual-stage two-phase attention-based recurrent neural network for long- than Models 4 and 7; however, this is not the case in the term and multivariate time series prediction. Expert Systems with Applications 143 (2019). other situations. It can be said that our proposed mod- [12] Yao Qin, Dongjin Song, Haifeng Cheng, Wei Cheng, Guofei Jiang, and Garri- els work modestly more effective than or comparable to son W. Cottrell. 2017. A dual-stage attention-based recurrent neural network for time series prediction. In Proceedings of the 26th International Joint Con- the DSTP-based models, depending on the situations, and ference on Artificial Intelligence (IJCAI-17). 2627–2633. that our models work greatly more effective than the sim- ple LSTM models. More detailed evaluation is left for the future work. A IMPUTATION PROCESSING Now, we briefly describe imputation processing to handle time 6 CONCLUSIONS series with missing values. We suppose time series д in the quar- In order to establish a useful method for forecasting corporate terly survey dataset and time series G in the annual survey dataset financial time series data, we aimed in this paper to appropri- with regard to each financial index. We carry out imputation ately forecast one step ahead using multiple time-series sam- processing for each financial index in the two steps below. Here, y ples of multivariate explanatory variables. For this objective, we let q j be a value as of quarter j of year y in a specific financial proposed an industry specific model that simultaneously cap- index time series, as shown in Table 4. tures corporate financial time series and the industry trends. We • Step 1 —using annual data: Table 3: Mean squared errors (MSE) and the sample standard deviations (SD) using the proposed model, DSTP-based model, and simple LSTM model in various settings. Model IDs Models Data used # of units MSE SD Model 1 Proposed model (I), A, N, G 16 2.08 × 10−4 8.48 × 10−5 Model 2 Proposed model (w/o ‘A’) I, N, G 32 2.11 × 10−4 7.43 × 10−5 Model 3 Proposed model (w/o ‘N’ and ‘G’) (I), A 128 2.28 × 10−4 7.64 × 10−5 Model 4 DSTP-based model A, N, G 16 2.12 × 10−4 6.47 × 10−5 Model 5 DSTP-based model I, N, G 16 2.26 × 10−4 9.40 × 10−5 Model 6 DSTP-based model A 16 2.11 × 10−4 7.43 × 10−5 Model 7 Simple LSTM model A, N, G 16 3.86 × 10−4 6.07 × 10−4 Model 8 Simple LSTM model A 16 3.10 × 10−4 3.72 × 10−4 Model 9 Simple LSTM model I 16 2.95 × 10−4 3.43 × 10−4 y Table 4: Notations of years and terms. – Case 4: q 1 is equal to the sum of Gy and the other fi- nancial index Gy0 . Year Quarters Quarterly surveys Annual surveys q 1 = Gy + Gy0 y y 1 q1 (35) y y 2 q2 Gy y 3 y q3 – Case 5: q 4 is equal to the sum of Gy and the other fi- 4 y q4 nancial index Gy0 . q 4 = Gy + Gy0 y (36) y y y y For the imputation, we make use of the annual survey – Case 6: The sum of q 1 , q 2 , q 3 and q 4 is equal to the sum y dataset. When the value q j is missing, we attempt to look of Gy and the other financial index Gy0 . up Gy in the annual survey dataset. If Gy is also missing, The imputation in this case depends on the number of y y y y y this step for q j is skipped. This imputation processing missing values in {q 1 , q 2 , q 3 , q 4 }. We assume Cases 6-1, consists of the following six cases, depending on finan- 6-2, 6-3 and 6-4, as below, where a, b, c, d ∈ {1, 2, 3, 4}: cial indexes. ∗ Case 6-1 —when the number of missing values in y y y y y – Case 1: q 1 is equal to Gy . {qa , qb , qc , qd } is one: y y Suppose qa is missing. q 1 = Gy (29) qa = Gy + Gy0 − qb − qc − qd y y y y (37) y – Case 2: q 4 is equal to Gy . ∗ Case 6-2 —when the number of missing values in y y y y y q 4 = Gy (30) {qa , qb , qc , qd } is two: y y y y y y Suppose qa and qb are missing. – Case 3: The sum of q 1 , q 2 , q 3 and q 4 is equal to Gy . Gy + Gy0 − qc − qd y y The imputation in this case depends on the number of y y y y y y missing values in {q 1 , q 2 , q 3 , q 4 }. We assume Cases 3-1, q a = qb = (38) 2 3-2, 3-3, and 3-4, as below, where a, b, c, d ∈ {1, 2, 3, 4}: ∗ Case 6-3 —when the number of missing values in ∗ Case 3-1 —when the number of missing values in y y y y {qa , qb , qc , qd } is three: y y y y {qa , qb , qc , qd } is one: y y y y Suppose qa , qb and qc are missing. Suppose qa is missing. Gy + Gy0 − qd y y y y y y y y q a = Gy − qb − q c − qd (31) q a = qb = q c = (39) 3 ∗ Case 3-2 —when the number of missing values in ∗ Case 6-4 —when the number of missing values in y y y y y y y y {qa , qb , qc , qd } is two: {qa , qb , qc , qd } is four: y y y y y y Suppose qa and qb are missing. Suppose qa , qb , qc and qd are missing. y y Gy − qc − qd y y y y Gy + Gy0 y y q a = qb = (32) q a = qb = q c = qd = (40) 2 4 ∗ Case 3-3 —when the number of missing values in • Step 2 — linear interpolation and extrapolation: y y y y {qa , qb , qc , qd } is three: We perform commonly-used linear interpolation and ex- y y y trapolation for missing values in each financial index time Suppose qa , qb and qc are missing. series in the quarterly survey dataset. y y y y Gy − qd q a = qb = q c = (33) 3 ∗ Case 3-4 —when the number of missing values in y y y y {qa , qb , qc , qd } is four: y y y y Suppose qa , qb , qc and qd are missing. y y y y Gy q a = qb = q c = qd = (34) 4