Comparison of neural network models with GRU and LSTM layer for earthquake prediction* Wiktoria Plechta1,∗,† 1 Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, POLAND Abstract The prediction task is an important element of research as it allows for forecasting future events. In the case of earthquakes, it is possible to make such a prediction based on specific attributes of historical data. In this paper, we compare two solutions in the field of neural networks that allow for such an analysis. More specifically, we model two recurrent networks by using the LSTM and GRU layers. The comparison is based on sequences of one-element and two-element data to determine which model is more accurate. Also, such methodology allows us to pay attention to the amount of sequence data needed to train the recurrent classifier for earthquake prediction tasks. Keywords rnn, lstm, gru, earthquake prediction 1. Introduction Data prediction is an important tool that uses algorithms and mathematical models to predict a specific event. For this purpose, historical data is used, i.e. a record of information that recorded such phenomena. There is a great need to build such solutions because they enable their use in a wide range of applications. An example is weather prediction, which is based on the analysis of historical data and the current weather conditions. Please note that weather models also use other data such as solar irradiation [1]. Another example is the construction of models that predict the dynamics of the development of various diseases. An example is the analysis of COVID-19 solutions using the computational model [2], or Markov chains for epidemic simulations [3]. Medical solutions also show the potential to predict the occurrence of various diseases [4]. An interesting approach is also the use of various methods such as k nearest neighbors, and decision trees, to predict communication customer churn [5]. Huge technological development also allows for the automation of many tasks. One such example is predicting the throughput of autonomous guided vehicles [6]. The construction of predictive systems allows for the possibility of building solutions using artificial intelligence methods and other techniques indicating the possibilities of presentation and decision-making. An example is a hybrid approach to analyzing many different techniques [7]. There are also ranking methods such as weighting techniques [8]. Attention should also be paid to the time analysis of various objects, an example of which is the change in water level [9]. * IVUS2024: Information Society and University Studies 2024, May 17, Kaunas, Lithuania 1,∗ CEUR ceur-ws.org Corresponding author Workshop † Proceedings ISSN 1613-0073 These author contributed equally. wp311004@student.polsl.pl (W. Plechta) ©️ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). The authors proposed using segmentation tools and edge analysis through the use of graphics processing techniques. The most commonly used prediction techniques include recurrent neural networks. An example is the construction of a model of such a network for time series prediction [10]. Recurrent neural networks are quite often built based on recurrent layers. Currently, there are two main ones: Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). The difference between them is based on the goal structure, whereas GRU has fewer of them. An example of the use of GRU gates is the development of aircraft assembly technology [11], or coal heading temperatures [12]. Long-term prediction for energy consumption can be performed by LSTM layers with multi-attention mechanism [13], or photovoltaic energy forecasting [14]. In this paper, we want to compare the two most popular recursive layer architectures are LSTM and GRU. For this purpose, a publicly available database with significant earthquake predictions was used. Based on this database, a recurrent neural network model and one/two-element segmentation samples were built. 2. Neural network architecture Recurrent neural networks are extremely effective and capable of learning complicated issues. However, simple RNN is known for having problems with vanishing and exploding gradients. That is why more common are two recursive architectures used in this paper: LSTM and GRU. While there is a lot of research comparing these two methods, this work focuses on forecasting earthquakes, which are more random and, therefore not so commonly used with these networks. Predicting earthquakes is a complex challenge because seismic events are inherently unpredictable and influenced by many factors. While GRU and LSTM networks excel at capturing patterns in sequential data over time their comparison based on this database may bring riveting conclusions. 2.1. Long Short-Term Memory Long Short-Term Memory cells consist of two states: cell state and hidden state. The first one is responsible for encoding information from all previous steps and extracting important features. The other one focuses on the latest time step, giving a prediction for the next time step. However, the prediction is encoded and therefore is not the same as the output state. Another important factor, that is a must to describe when it comes to LSTM are three gates: forget gate, input gate and output gate. Forget gate filters data from the cell state - the closer to 1, the more important the information is and in reverse - the closer to 0, the more of the information is being forgotten. Similar comes to the input gate, except it is used with tanh function to charge of adding information to the cell state. In turn, the output gate decides what the next hidden state will be. The operation of an LSTM network can be described using several key equations. The forget gate similar to the update gate, 𝑓𝑡 is also computed using a sigmoid activation function. 𝑓𝑡 = 𝜎(𝑊𝑥𝑓 𝑥𝑡 + 𝑊ℎ𝑓 ℎ𝑡−1 + 𝑏𝑓 ) (1) Where: • 𝑥𝑡 - input at time 𝑡, • ℎ𝑡−1 - previous hidden state, • 𝑊𝑥𝑓 , 𝑊ℎ𝑓 , and 𝑏𝑓 - weights and biases for the forget gate, • 𝜎 - sigmoid function. The input gate is typically computed using a sigmoid activation function, resulting in values between 0 and 1. 𝑖𝑡 = 𝜎(𝑊𝑥𝑖𝑥𝑡 + 𝑊ℎ𝑖ℎ𝑡−1 + 𝑏𝑖) (2) Where: • 𝑊𝑥𝑖, 𝑊ℎ𝑖, and 𝑏𝑖 - weights and biases for the update gate. Finally, the output gate (𝑜𝑡) decides which information to pass on to the output. This is determined by the current hidden state (ℎ𝑡), which is filtered through the output gate. 𝑜𝑡 = 𝜎(𝑊𝑥𝑜𝑥𝑡 + 𝑊ℎ𝑜ℎ𝑡−1 + 𝑏𝑜) (3) Where: • 𝑊𝑥𝑜, 𝑊ℎ𝑜, and 𝑏𝑜 - weights and biases for the output gate. The cell state (𝐶𝑡) update is then calculated. This involves updating the cell state 𝐶𝑡−1 based on the result of the update gate 𝑖𝑡, and removing unnecessary information based on the result of the forget gate 𝑓𝑡. 𝐶𝑡 = 𝑓𝑡 · 𝐶𝑡−1 + 𝑖𝑡 · tanh(𝑊𝑥𝑐𝑥𝑡 + 𝑊ℎ𝑐ℎ𝑡−1 + 𝑏𝑐) (4) Where: • 𝑊𝑥𝑐, 𝑊ℎ𝑐, and 𝑏𝑐 - weights and biases for the cell state update, • tanh - hyperbolic tangent function. The final hidden state ℎ𝑡 is calculated by updating the cell state 𝐶𝑡 based on the output gate result 𝑜𝑡, resulting in the LSTM network output. ℎ𝑡 = 𝑜𝑡 · tanh(𝐶𝑡) (5) 2.2. Granted Recurrent Unit Granted Recurrent Unit has only two gates and does not contain the Cell State. All of the data goes through the hidden state, where the reset gate determines which information from the previous hidden state time-stamp will be forgotten. The update gate settles how much of the current input will be used to update the hidden state. The Gated Recurrent Unit (GRU) is another type of recurrent neural network (RNN) designed to address the vanishing gradient problem and improve learning on sequential data. The operation of a GRU network can be described using several key equations. The reset gate is typically computed using a sigmoid activation function. 𝑟𝑡 = 𝜎(𝑊𝑥𝑟𝑥𝑡 + 𝑊ℎ𝑟ℎ𝑡−1 + 𝑏𝑟) (6) Where: • 𝑥𝑡 - input at time 𝑡, • ℎ𝑡−1 - previous hidden state, • 𝑊𝑥𝑟, 𝑊ℎ𝑟, and 𝑏𝑟 - weights and biases for the reset gate, • 𝜎 - sigmoid function. Similar to the reset gate, the update gate is typically computed using a sigmoid activation function. 𝑧𝑡 = 𝜎(𝑊𝑥𝑧𝑥𝑡 + 𝑊ℎ𝑧ℎ𝑡−1 + 𝑏𝑧) (7) Where: • 𝑊𝑥𝑧, 𝑊ℎ𝑧, and 𝑏𝑧 - weights and biases for the update gate. The candidate hidden state (ℎ˜ 𝑡 ) is calculated using the current input 𝑥𝑡 and the reset gate 𝑟𝑡. This candidate’s hidden state represents the new information that could be added to the current hidden state. ℎ˜ 𝑡 = tanh(𝑊𝑥ℎ𝑥𝑡 + 𝑊ℎ𝑟(𝑟𝑡 ⊙ ℎ𝑡−1) + 𝑏ℎ) (8) Where: • 𝑊𝑥ℎ and 𝑏ℎ - weights and biases for the candidate hidden state, • ⊙ - element-wise multiplication, • tanh - hyperbolic tangent function. Finally, the hidden state ℎ𝑡 is updated using the update gate 𝑧𝑡, which determines how much of the candidate hidden state ℎ˜ 𝑡 to incorporate into the current hidden state. ℎ𝑡 = (1 − 𝑧𝑡) ⊙ ℎ𝑡−1 + 𝑧𝑡 ⊙ ℎ˜ 𝑡 (9) The differences in RNN, LSTM and GRU are shown in the simplified visualization in Fig. 1. 2.3. Proposed neural architectures for earthquake prediction To predict earthquakes, we propose a network architecture consisting of four layers: a recurrent layer and three dense layers. The networks will accept sequences of historical data: one or two elements. This action will allow the analysis of the solution in terms of predictions based on only one value, as well as two values indicating the time distance between these quakes. The modeled networks are presented in Tab. 1-2 with details about parameters number and as follows: Figure 1: Visualization of neurons’ architecture: LSTM and GRU 1. Input data: the shape was defined as (1, 3) or (2, 3) which meant that each input sequence has one/two-time step and three features (Latitude, Longitude, Depth), 2. LSTM/GRU Layer with 10 units, 3. Dense Layers: • Following the LSTM layer, there are three dense (fully connected) layers. • The first dense layer has 30 units and uses the ReLU activation function. • The second dense layer has 10 units with ReLU activation and dropout regularization technique with a threshold equal to 0.4. • The third dense layer has 1 unit that corresponds to the output classes which means ’Magnitude’. Layer (type) Output Shape Param # LSTM (None, 10) 560 dense (None, 30) 330 dense_1 (None, 10) 310 dense_2 (None, 2) 22 Total params 1,222 Trainable params 1,222 Non-trainable params 0 Table 1 Summary of LSTM model without distinction on the entered input Both models were trained using Adam optimizer and the following loss function: 𝑛 1 ∑︁ 𝐿 (𝑦 , 𝑦𝑝𝑟𝑒𝑑 )= (𝑦 𝑡𝑟𝑢𝑒 — 𝑦 𝑝𝑟𝑒𝑑 )2 (10) 𝑡𝑟𝑢𝑒 𝑛 𝑖=1 Layer (type) Output Shape Param # GRU (None, 10) 450 dense (None, 30) 330 dense_1 (None, 10) 310 dense_2 (None, 2) 22 Total params 1,122 Trainable params 1,122 Non-trainable params 0 Table 2 Summary of GRU model without distinction on the entered input where 𝑦𝑡𝑟𝑢𝑒 is a true label and 𝑦𝑝𝑟𝑒𝑑 is predicted value returned by network. The choice of using the presented loss function is motivated by the regression nature of the prediction task. In regression tasks where the goal is to predict continuous numerical values (such as Magnitude), MSE is a standard and effective loss function. MSE calculates the average squared difference between predicted values and true values, providing a measure of how well the model’s predictions align with the actual target values. The network architecture is composed of 52 neurons, which is small for a prediction task. However, the network accepts three numerical values defining the most important information about a given phenomenon, which should enable obtaining correct results. It is worth noting that the use of a layer with recurrent neurons contributes to making predictive decisions based on the context analyzed through the possibility of sequential analysis. Additionally, the recursive layer allows us to pay attention to long-term dependencies between values. Consequently, the modeled architecture consisting of only 52 neurons arranged in four layers and two types of neurons (LSTM/GRU and classic one in dense layer) is a model prepared for predictive analysis. 3. Experiments The database that was used contained information about various earthquakes with magnitude of 4.0 or higher that happened in Turkey since 1900. It consisted of a total of 14 columns. For this paper, we used four columns. We involved localization of the earthquake’s epicenter ("Longitude" and "Latitude"), hypo-center depth("Depth") and measure of the amount of energy released during an earthquake ("Magnitude"). A significant portion of the remaining columns contained NaN (missing) values or string data. Furthermore, in the other columns, the data was identical for the majority or all entries, thereby limiting their relevance for analysis. The database is available online on Kaggle1. Building the model did not involve scaling or normalizing data. The decision aimed to examine raw relationships and patterns in the dataset without changing the original feature distribution. Skipping normalization and scaling lets the model’s performance 1 https://www.kaggle.com/datasets/kmlyldrn/earthquakes Figure 2: Network model with an LSTM layer for single-element sequences Figure 3: Network model with a GRU layer for single-element sequences provide insights into how it handles varying data ranges and variances. This clear view reveals inherent strengths and weaknesses. However, it’s important to note that in practical applications or for improved generalization, preprocessing steps like normalization and scaling are typically recommended to enhance model performance and stability across varying datasets. The data set was divided into two subsets: training and validation. For this purpose, a 70:30 split was used, i.e. 70% of the data was allocated to the training set and the remaining 30% to the validation set. The first step was to check how the classifier is learned for a single data sequence. The results for both networks are shown in Fig. 2 and 3. In both cases, the MSE quickly decreases to a low value. It should be noted that there are small fluctuations in the obtained values. This is especially visible in the network with GRU neurons. Analysis of the loss function graphs indicates very small values and a slow decline. A larger decrease in the value of the loss function is visible from the LSTM analysis. Despite this, for both architectures, the accuracy values drop below 0.24, but for networks with GRU layers, spikes are increasing this value in the validation set. Figure 4: Network model with an LSTM layer for two-element sequences Figure 5: Network model with a GRU layer for two-element sequences In the case of a sequence consisting of two vectors, the learning results are not more stable, as we can see in Fig. 4-5. For the model with LSTM neurons, MSE decreases for both sets. However, for the validation set, the loss value drops below 0.22 after more than 50 training iterations. A similar situation was noticed for the second network consisting of the GRU layer. However, the decline in value for this architecture is faster. The main differences that have been noticed are that for networks with LSTM there are no large jumps as for GRU. However, in a network with a GRU layer, the values decrease faster than in LSTM. Moreover, for the training set, the GRU classifier achieved a lower value after 100 iterations. 4. Conclusion Prediction is a task based on an appropriate database with historical values. A very large number of records may contribute to more accurate results. This work uses a publicly available earthquake database. A neural network model with recurrent neurons was proposed as a classifier. The study analyzed LSTM and GRU neurons. Analysis of the solutions showed that both solutions can achieve very good results. The advantage of using GRU neuron models is a smaller number of training parameters. However, it should be noted that the database used allowed for rapid achievement of high results, which may result in the possibility of overfitting. Based on the experiments performed, it can be concluded that for networks with the LSTM layer, the drop in error values is more stable, while for networks with the GRU layer it contains larger jumps. However, with two-element sequences, the network with the GRU layer achieved better prediction results. In future works, we plan to use other tools and create an ensemble model with solutions like Monte Carlo, and Markov chains. Acknowledgments This work was supported by the Rector’s mentoring project "Spread your wings" at the Silesian University of Technology. References [1] S. Pereira, P. Canhoto, R. Salgado, Development and assessment of artificial neural network models for direct normal solar irradiance forecasting using operational numerical weather prediction data, Energy and AI 15 (2024) 100314. [2] A. Kloczkowski, J. L. Fernández-Martínez, Z. Fernández-Muñiz, Computational models for covid-19 dynamics prediction, in: International Conference on Artificial Intelligence and Soft Computing, Springer, 2023, pp. 228–238. [3] K. Kesik, Markov chains as a simulation technique for epidemic growth., in: Proceedings of the International Conference on Information Society and University Studies (IVUS 2019), 2019, pp. 1–4. [4] J. Rashid, S. Batool, J. Kim, S. Juneja, An augmented artificial intelligence approach for chronic diseases prediction, Frontiers in Public Health 10 (2022) 860396. [5] M. Zdanavičiu¯tė , R. Juozaitienė, T. Krilavičius, Telecommunication customer churn predic- tion using machine learning methods, Proceedings of the 27th International Conference on Information Society and University Studies (IVUS 2022) (2022). [6] K. Prokop, D. Połap, G. Srivastava, Agv quality of service throughput prediction via neural networks, in: 2023 IEEE International Conference on Big Data (BigData), IEEE, 2023, pp. 2493–2498. [7] P. Mahajan, S. Uddin, F. Hajati, M. A. Moni, Ensemble learning for disease prediction: A review, in: Healthcare, volume 11, MDPI, 2023, p. 1808. [8] A. Jaszcz, The impact of entropy weighting technique on mcdm-based rankings on patients using ambiguous medical data, in: International Conference on Information and Software Technologies, Springer, 2023, pp. 329–340. [9] K. Prokop, K. Połap, M. Włodarczyk-Sielicka, A. Jaszcz, End-to-end system for monitoring the state of rivers using a drone, Frontiers in Environmental Science (2023). [10] J. Siłka, M. Wieczorek, M. Woźniak, Recurrent neural network model for high-speed train vibration prediction from time series, Neural Computing and Applications 34 (2022) 13305–13318. [11] H. Zhang, L. Feng, J. Wang, N. Gao, Development of technology predicting based on eemd-gru: An empirical study of aircraft assembly technology, Expert Systems with Applications 246 (2024) 123208. [12] J. Guo, C. Chen, H. Wen, G. Cai, Y. Liu, Prediction model of goaf coal temperature based on pso-gru deep neural network, Case Studies in Thermal Engineering 53 (2024) 103813. [13] D. Połap, G. Srivastava, A. Jaszcz, Energy consumption prediction model for smart homes via decentralized federated learning with lstm, IEEE Transactions on Consumer Electronics (2023). [14] C. Xu, J. Yu, W. Chen, J. Xiong, Deep learning in photovoltaic power generation forecast- ing: Cnn-lstm hybrid neural network exploration and research, in: The 3rd International scientific and practical conference “Technologies in education in schools and universi- ties”(January 23-26, 2024) Athens, Greece. International Science Group. 2024. 363 p., 2024, p. 295.