Multivariate Time Series Regression on Seismic Data Using Adaptive Residual CNNs Jurgen O.D. van den Hoogen*1,2 , Stefan D. Bloemheuvel*1,2 and Martin Atzmueller3,4 1 Tilburg University, TSHD: Cognitive Science and Artificial Intelligence, Tilburg, The Netherlands 2 Jheronimus Academy of Data Science (JADS), ’s-Hertogenbosch, The Netherlands 3 Semantic Information Systems Group, Osnabrück University, Osnabrück, Germany 4 German Research Center for Artificial Intelligence (DFKI), Osnabrück, Germany Abstract Developments in Machine Learning and more specifically Deep Learning have boosted the analysis of large-scale datasets. These methods for automated learning are particularly useful for complex data originating from sensors, i. e., multivariate time series data. In this work, we present an adaptive residual CNN (ARes-CNN) that is able to process such multivariate time series data. The model utilizes an adaptive input layer that separately processes every time series (e. g., channels or sensors), learning their key individual characteristics. Furthermore, the model applies stacked residual learning throughout each layer. We compare ARes-CNN with traditional Machine Learning applications, as well as a CNN and LSTM developed for the task at hand. The models’ performance are compared on two datasets retrieved from sensors in a network of seismic stations located in Italy. Across all experiments, ARes-CNN reports a performance increase of 17% (MSE) on average compared to the best performing baseline. Therefore, processing the channels independently (by employing adaptive input layers), together with residual learning are fit for multivariate time series regression, e. g., for analysis of seismic sensor data. Keywords Time Series Regression, Deep Learning, Sensors, Convolutional Neural Networks, Residual Networks 1. Introduction Advances in Deep Learning (DL) have revolutionized data processing due to their ability to deal with raw data, automatically learn its structure and recognize patterns [1, 2, 3], e. g., for multivariate time series analysis. Thus, many researchers used DL techniques to analyze time series data in a variety of tasks such as anomaly detection and forecasting. A less common approach in time series analysis is Time Series Extrinsic Regression (TSER) [4, 5, 6]. Here, the predicted or target value relies on the whole time series, instead of the last few observations. In this paper, we focus on a TSER problem using DL methods, and more specifically Con- volutional Neural Networks (CNN) to improve model performance. Our proposed model is tested on two seismic datasets, both recorded in regions of Italy [7, 8]. We attempt to predict * Both authors contributed equally to this work. LWDA’22: Lernen, Wissen, Daten, Analysen. October 05–07, 2022, Hildesheim, Germany $ j.o.d.hoogen@jads.nl (J. O.D. v. d. Hoogen* ); s.d.bloemheuvel@jads.nl (S. D. Bloemheuvel* ); martin.atzmueller@uni-osnabrueck.de (M. Atzmueller)  0000-0001-5723-6629 (J. O.D. v. d. Hoogen* ); 0000-0002-2480-6901 (M. Atzmueller) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) maximum ground-shaking, expressed as intensity measurements (IMs), which can be described as a multivariate time series regression problem, similar to TSER-oriented approaches [5, 9]. The data is particularly interesting for applying neural networks due to its high-frequency sampling rate, high-dimensionality and the multiple regression outputs that need to be generated. We combine the strength of one-dimensional (1D) CNNs for learning feature representations of the raw time series, followed by residual stacks that are used to create a deeper neural network architecture that is able to learn more complex structures of the raw data. In addition, we utilize an adaptive input layer that separates the time series based on its dimensionality (e. g., channels or sensors). Thus, the model can be scaled depending on the number of time series variables available. Our contributions are summarized as follows: 1. We propose a method to perform multivariate regression on time series originating from seismic sensor data. For this, we present a new model called ARes-CNN that utilizes convolutional layers with residual stacks (both one-dimensional and two-dimensional). 2. We implement adaptive convolutional layers that are able to separate the multivariate time series data in distinct input layers – therefore learning unique characteristics for each individual time series (e. g., channels or sensors). 3. Finally, we evaluate our model thoroughly on two seismological datasets that differ significantly from one another, evidencing the generality and potential of the proposed model. We discuss our results in detail and perform a comparison against the baseline model proposed in [5], an LSTM model and traditional Machine Learning (ML) methods with rich feature engineering. The rest of the paper is structured as follows: in Section 2, we discuss related work on DL for time series data and also summarize the necessary background on Convolutional Neural Networks and Residual Networks. Next, we introduce our proposed model in Section 3, where we also describe the baseline models, a description of the problem setting and the data. After that, Section 4 presents and discusses the results from our experiments. Finally, Section 5 concludes with a summary and outlines interesting directions for future research. 2. Related Work and Background This section covers relevant work relating to Deep Learning in general and specifically for time series, focusing on the use of Convolutional Neural Networks and Residual Networks. In addition, this section also covers the implementation of Deep Learning for seismic analysis. 2.1. Deep Learning With the advances in computational power and hardware, traditional Machine Learning (ML) approaches make way for more automated learning techniques, i. e., Deep Learning (DL). A reason for this transition is that ML requires considerable efforts to acquire representative features, which is typically time-intensive and can potentially be a rather error-prone task. For this, DL provides an alternative solution, relying on “built-in” feature construction: it is able to automatically extract features from raw data using multiple layers that can utilise nonlinear processing, which is a powerful approach on complex data such as multivariate time series. At first, a multi-layer perceptron (MLP) was introduced by [1]. Here, layers are fully linked to one another, which is computationally expensive, in particular for large models with much training data. Therefore, in recent years more advanced methods have been developed such as the Convolutional Neural Network (CNN) [2] or Recurrent Neural Networks (RNN). Especially RNNs seemed to be well-suited for time series data since they memorize long-term dependencies. Nonetheless, storing time-related dependencies requires vast amounts of memory, resulting in long training times. Therefore, RNNs are less applicable on large-scale time series retrieved from sensors. 2.1.1. Convolutional Neural Networks Generally speaking, CNNs are regularized MLPs that are designed to process two-dimensional (2D) data. They were first proposed by LeCun in 1989 [2] and are frequently used in Computer Vision and Image Recognition, processing images with multiple color channels [2, 10]. Compared to an MLP, the main advantages of a CNN are the use of weight-sharing, sub-sampling and local receptive fields. Sharing weights in particular lowers the memory requirements, boosting algorithmic efficiency [11]. A convolutional layer typically includes three stages. Convolutions are performed in the first layer, then an activation function is initialized in the second layer to account for nonlinear relationships. Following that, pooling is applied [11]. However, in the domain of time series, and more specifically signal data from sensors, CNNs were less often employed than classic ML techniques, due to their 2D nature. Initially, to use one-dimensional (1D) data for CNNs, one had to reshape the data into a matrix using signal processing techniques. These 2D CNNs combined with conversion of the 1D data increases computational complexity drastically, which requires dedicated hardware for training. Then, [12] developed a 1D CNN that is able to directly process raw signal data. After that, the popularity of CNNs for time series significantly increased in various fields such as biomedical analysis [12] and fault diagnosis [3, 13, 14]. These 1D models are less computationally intensive than their 2D counterparts and are suitable for high-frequency and noisy time series data [13]. Hence, 1D CNNs are increasingly applied on time series data. To create output features, a convolutional layer convolves the input using filter kernels followed by an activation function. Through weight-sharing, each of these filters extracts local attributes from the input local region. These outputs are then fed to the activation unit that produces the final output features. The convolution operation is defined as: 𝑦𝑖𝑙+1 (𝑗) = 𝑘𝑖𝑙 * 𝑀 𝑙 (𝑗) + 𝑏𝑙𝑖 . (1) Here, 𝑀 𝑙 (𝑗) defines the 𝑗-th local region in layer 𝑙. 𝑏𝑙𝑖 denotes the bias and 𝑘𝑖𝑙 denotes the weights of the 𝑖-th filter kernel in layer 𝑙. In the convolution operation, * signifies the dot product of the kernel and the local regions and 𝑦𝑖𝑙+1 (𝑗) represents the input of the 𝑗-th neuron for feature map 𝑖 of layer 𝑙 + 1. 2.1.2. Residual Networks Generally, the more layers added to a neural network, the better its potential performance [10, 15]. However, models with a deeper structure take longer to train, meaning more computational resources are needed when increasing model size. Next to that, [16] showed that when adding more layers to the network, it becomes more difficult for the layers to propagate information from earlier layers, which is often called the degradation problem. As a result, deeper models tend to have a saturation point for their performance, and af- terwards rapidly decrease in performance when adding more layers. To tackle this problem, [16] developed a deep residual learning framework to ease the process of training for deeper networks. With this method, neural networks are able to scale up with hundreds of layers. A residual network (ResNet), see Figure 1, incorpo- rates a ’shortcut’ or ’skip connection’ between layers. 𝑋 In other words, the outcome of the previous neuron (identity) is directly added to the corresponding neuron of a follow-up layer. Between these layers, intermedi- 𝐹 (𝑥) 𝑁 Layers Identity ate layers (residuals), which can be all kinds of layers (CNN, RNN, LSTM...), learn only changes from their 𝐹 (𝑥) + 𝑥 + input 𝑋. Afterwards, these residuals are added to the original identity and act as input for the follow-up layer. This provides an alternative shortcut for the gradients Figure 1: Example of a residual block. to pass through. As a result, one can make a building block for residual learning, visualized in Figure 1, which can be denoted as; 𝑌 (𝑥) = 𝑅(𝑥, 𝑊 𝑖 ) + 𝑥 (2) 𝑥 represents the input from the original layer, that is the ’skip connection’, and 𝑅(𝑥, 𝑊 𝑖 ) represents the residual part to be learned. Combining both parts results in the output vector 𝑌 (𝑥) that is used for the follow-up layer. The benefit of including skip connections is that if any layer degrades the model’s performance, it will be skipped. 2.2. Deep Learning for Seismic Analysis With developments in data gathering and storage in the last decades, the seismic community collected enormous amounts of sensor data [17, 18]. The availability of these vast collections of time series data piqued the interest for seismologists towards ML and DL applications throughout the last years [19]. The data, representing waveform signals, are frequently used for earthquake detection [20], predicting maximum ground-shaking [5] or magnitude estimation [21]. In addition, [22] analyzed waveforms from multiple seismic stations, i. e., multivariate time series, for earthquake location estimation. In this paper, we propose a technique that applies 1D and 2D CNNs combined with residual stacks to perform multivariate time series regression. We show how to improve performance of an existing CNN model for seismic data proposed by [5], while significantly adapting their architecture. 3. Method This section starts with a description of the problem setting and introduces the used datasets. Afterwards, we present our ARes-CNN model. There, we separate between the construc- tion of the residual stack and the overall architecture of the model. At last, we describe model training, our baseline models for comparison, and describe the software and com- putational resources that we utilized. The code for the models can be accessed on Github: https://github.com/JvdHoogen/ARes-CNN. 3.1. Problem Setting We perform regression on two seismic datasets recorded in Italy [7, 8]. The datasets consist of sensor readings retrieved from seismometers or accelerometers that are installed on seismic stations across Italy. During earthquake events, these sensors continually recorded the amplitude of seismic waves along three dimensions of ground motion, i. e., north-south, east-west and up-down. These recordings help seismologists to better understand the behavior of earthquakes. Due to the fast transmission of information through telecommunication, seismologists were able to develop algorithms to predict the maximum intensity measurement (IM) of ground- shaking before the actual seismic wave arrives. This can be achieved by analyzing the data transmitted from stations that recorded the earthquake first, e. g., stations close to the epicen- ter/hypocenter. Within the seismological community this is called "Earthquake Early Warning" (EEW), which is important for the evacuation of residents affected by an earthquake. These IMs are divided into five separate metrics; peak ground velocity (PGV), peak ground acceleration (PGA) and spectral acceleration (SA) at 0.3, 1 and 3 second periods. The actual values of these metrics are used as target values for every earthquake. Hence, every earthquake recording contains three waveforms from the seismic stations surrounding the epicenter of the geographical location. We used data with a length of 10 seconds sampled at 100 Hz for every recording related to the origin of the earthquake. Based on this information, we predict the five IMs for all stations in the geographical region represented in the dataset. Depending on the location of the earthquake recording, certain stations nearby will reveal more information related to the earthquake than stations far away. However, it is worth mentioning that for many earthquakes, most stations have not yet recorded the earthquake related ground-shaking within this 10-second period. In Figure 2, we visualize the task of our experiments based on an example. 3.2. The Data Concerning the two datasets, the first dataset (CW) consists of three-component waveforms of 266 earthquakes recorded on 39 seismic stations in Central-Western Italy. The earthquake epicenters and station locations are within the coordinates [41.13°, 46.13°] (latitude) and [8.5°, 13.1°] (longitude), with recordings started from 1-1-2013 and 20-11-2017. The earthquakes’ magnitudes vary between 2.9 < M ≤ 5.1, with crustal depth ranging between 3.3 km and 64.7 km. Therefore, the dataset characterizes itself as having the earthquake recordings scattered across a large geographical region. 𝑋 Hidden 20' 40' 13°E 20' 40' 14°E (𝑁, 5) 44°N 44°N Channel 1 𝑌pga 40' 40' 𝑌pgv 20' 20' Channel 2 𝑌sa(.3s) 43°N FEMA 43°N 𝑌sa(1s) 40' 40' 𝑌sa(3s) Channel 3 20' 20' 50 km 20' 40' 13°E 20' 40' 14°E 0 5 10 15 20 25 Time (seconds) Figure 2: An example earthquake is shown on the left as a red star, with its P- and S-wave, green and orange respectively, cf. [4]. We predict the 𝑌 values that characterize the earthquake for each seismic station (with 3 channels of time series data) using the input 𝑋 (10 seconds). The second dataset (CI) has more recorded earthquakes (915) on a set of 39 stations containing three-component waveforms in Central Italy. The coordinates of the earthquake epicenters and station locations are within [42°, 42.75°] (latitude) and [12.3°, 14°] (longitude), with recordings starting from the 01-01-2016 until 29-11-2016. The crustal depth ranges between 1.6 km ≤ z ≤ 28.9 km with magnitudes varying between 2.9 < M ≤ 6.5. The earthquakes are more concentrated than the earthquakes measured in the CW dataset, and are spread across a smaller geographical region. Thus, with the many more samples, we hypothesize that the performance on the CI dataset will be better than on the CW dataset since the task is expected to be less complex. Both datasets are normalized using the input maximum, i. e., the highest amplitude detected across multiple stations during the occurrence of an earthquake [5]. 3.3. Residual Stack Within the architecture of our proposed model, we implement residual stacks between the convolutional layers. For this, we developed two types of residual stacks, a 1D and 2D version. Generally, a residual unit consists of one or multiple layers with similar dimensionality as its input data, i. e., same amount of filters with kernel size and stride both set to 1. However, based on experimentation, we customized a residual stack (multiple units) to yield better performance. Our residual stack includes two residual units. The first residual unit consists of two convolu- tional layers with ReLU and Tanh activation, processing the output from the past layer. Next, the residual output and the identity are combined by addition, as shown in Figure 3. Then, the new representation is used for the second residual unit, with the same settings as the first one. However, the second residual unit is learning the residuals of the identity plus residuals from the previous unit. Therefore, we call it a residual stack. After processing, the residuals of the second unit are added to the next identity, which already contains a certain amount of residual learning from the previous unit. The final output is then fed to the follow-up layer in the model. Figure 3 visualizes the architecture of our residual stack, also showing the skip connections from its input to the output. 3.4. ARes-CNN Architecture X Our model is inspired by [5], which we also ap- 0.5 0.1 0.2 0.3 0.7 0.3 ply as a baseline for comparison. We exploit an 0.9 0.8 0.4 adaptive input layer that processes multivari- ate time series in a univariate way, learning the characteristics of each time series individually Convolution (Stride = 1) F = X(f) (e. g., channels or sensors), which has proven to Residual Unit 1 K=1 Identity X enhance model performance in [3, 14]. Hence, Convolution (Stride = 1) the amount of input layers needs to be adapted F = X(f) K=1 towards the amount of variables available which are processed in a univariate way. It can then Addition be scaled towards use-cases containing more channels or sensors by adding the respective 1D adaptive layers (see Figure 4). Convolution (Stride = 1) Thereby, the first part of the CNN architec- F = X(f) K=1 Residual Unit 2 Identity X ture consists of three separate 1D convolutional layers with 10 filters each parallel processing Convolution (Stride = 1) F = X(f) the first 10 seconds of the three channels from K=1 the seismic waveform data, followed by a 1D Addition residual stack and max pooling. After that, the outputs of the separate layers are concatenated. Legend: X ReLU Next, a second 1D convolutional layer with 64 0.4 0.2 0.4 Tanh F Filters filters is utilized to further process the outputs 0.3 0.6 0.7 0.8 0.9 0.5 K Kernel size of the separate layers, again followed by a 1D X(f) Filters in X residual stack and max pooling. These convo- Figure 3: Residual stack architecture: Both resid- lutional layers specifically function as feature ual units have a skip connection with extractors to learn the underlying temporal pat- identity 𝑋 and ReLU/Tanh activation. terns of every station. Both convolutional layers The amount of filters for each residual use wide kernels combined with small strides layer is the same as for its input 𝑋. and increasing filters, which has proven to be effective for signal data [3, 5, 14]. After the temporal patterns are extracted, a 2D convolutional layer with 64 filters is initialized to capture the inter-station relationship [5]. Then, also a 2D residual stack is utilized, followed by a flatten layer and dropout (0.4) [11]. Empirical testing showed that the respective residual stacks between the convolutional layers improved model performance. However, adding more residual stacks here did not improve performance further. This provides for some indication of a potential saturation point for residual learning within the existing architecture. In the last stage of the model, we concatenate the flattened output with a metadata vector consisting of latitude and longitude information of every station which has been proven to be valuable in [9]. Afterwards, a fully connected (FC) layer processes the previous outputs with 128 filters, followed by 5 fully connected regression layers representing the five distinct IMs. The output of the five regression layers represent the IMs for every of the 39 station in the sensor network for every earthquake. Figure 4 shows a full overview of our proposed model. 1000 Adaptive layer 1D (Stride = 1) Adaptive layer 1D (Stride = 1) Adaptive layer 1D (Stride = 1) Adaptive layer 1D (Stride = 1) Adaptive layer 1D (Stride = 1) Adaptive layer 1D (Stride = 1) Adaptive layer 1D (Stride = 1) Adaptive layer 1D (Stride = 1) Adaptive layer 1D (Stride = 1) F = 10 F = 10 F = 10 K = 41 K = 41 K = 41 Residual Stack 1D Residual Stack 1D Residual Stack 1D Max Pooling Max Pooling Max Pooling Concatenate Adaptive Wide layer (Stride = Adaptive Widesize/4) Kernel layer (Stride = Kernel size/4) Convolution 1D (Stride = 1) F = 64 K = 125 Residual Stack 1D Max Pooling Adaptive Wide layer (Stride = Adaptive Widesize/4) Kernel layer (Stride = Kernel Convolution 2Dsize/4) (Stride = (39,5)) F = 64 K = (39,5) Residual Stack 2D Flatten Meta Data Vector 0.4 Concatenate Legend: ReLU FC layer (128) Dropout F Filters 5x FC layer (39) K Kernel size Figure 4: Architecture of the ARes-CNN model with adaptive input layers for each channel, residual stacks and pooling. The model utilizes 1D convolutional layers to learn time series features and a 2D convolutional layer to learn inter-station relationships. 3.5. Model Training We follow a standard train, validation and test split of the dataset. First, we split the full dataset into 80% train and 20% test data. Next, the train set is divided into 80% train and 20% validation data. We apply five-fold cross-validation, where we predicted on the unseen test set for every fold/split. In total, we repeated the process five times with different random splits for the initial train/test set. We report the averaged value across all predictions on the test set. To rule out early cutoff between training of the models, no early stopping was used in the experiments. We used MSE as loss function, RMSprop for optimization and a batch size of 20 with 100 epochs for training. 3.6. Baseline Models Our ARes-CNN model is compared with traditional ML algorithms such as K-Nearest Neighbors (K-NN), Random Forest (RF), XGBoost and Support Vector Machine (SVM). These models do not lend themselves for processing raw multidimensional signals. Therefore, the data is preprocessed into a set of features in both the frequency and time domain, inspired by feature-sets described in [23, 24]. Features in the frequency-domain are calculated after transforming the time series to the frequency spectrum using the FFT algorithm that computes the one-dimensional discrete Fourier transform (DFT). In the frequency domain, Signal Energy; E = (FFT 𝑥𝑖 )2 and ∑︀ ∑︀ (FFT 𝑥𝑖 )2 Signal Power; P = ∑︀𝑡 were computed. In the time domain, mean, median, maximum, 𝑖 minimum, range (difference between maximum and minimum), variance and standard deviation were computed. The features were calculated for every station and earthquake. In total, the feature set consists of 9 features. We used grid search optimization with five-fold cross-validation for all ML models to assess the best performing parameter setting. The parameter settings of the best performing model vary throughout the different experiments and between the two datasets. The properties of the grid search optimization are described in Table 1. Our model is also compared with the CNN developed by [5], which functions as a starting point for the development of our model. Furthermore, we compare our model with a shallow LSTM model, replacing the convolutional layers with two LSTM layers (containing 100 units), that directly process the time series data. Empirical testing showed no significant performance increase when adding more LSTM layers or increase the amount of units. For all DL models, we added the metadata vector containing latitude and longitude coordinates of every station. 3.7. Software and Resources Table 1: Parameter settings used for grid search op- timization per ML model. We used Python with Tensorflow and Keras Model Parameter Grid Search Range to develop the models. All other algo- K-NN 𝐾 1-20 rithms, such as the ML models, are derived Weight Options Distance / Uniform from the Sklearn package, also including SVM 𝐶 25, 30, 40 Numpy/Pandas. The models were trained on Kernel Linear / RBF Gamma 0.0001, 0.001 a dedicated server with two Intel Xeon CPUs RF Nr. of Estimators 800, 900, 1000 (3.2 GHz), 256 GB RAM and an Nvidia Quadro Feature Estimation Log2 / Square Root RTX 6000 (24 GB) GPU. XGBoost Nr. of Estimators 800, 900, 1000 Max Depth 5, 10, 15 Gamma 0.0-0.4 (0.1 per step) 4. Results Table 2 shows the averaged results across the five IM prediction values for every model. We observe a substantial gap between the algorithmic performance on the CI and the CW dataset. This behavior was expected since the CI dataset contains more earthquakes than the CW dataset, i. e., 915 against 266. Also, the earthquakes recorded in the CW dataset outline a larger geographical surface with greater distances between the seismological stations. Furthermore, the earthquakes in the CW dataset have a broader crustal depth range of the hypocenters. In Table 2, we observe, that the CNNs Table 2: Result metrics for IM prediction for every clearly outperform the other models on the de- model on the CW and CI datasets. scribed metrics, displaying the value of CNNs Dataset Model MSE RMSE MAE in multivariate time series regression tasks. CW LSTM 0.61 0.76 0.61 However, the LSTM model performs worse K-NN 0.50 0.71 0.55 than any other model, indicating that RNNs SVM 0.49 0.70 0.54 do not lend themselves for directly process- XGBoost 0.47 0.69 0.53 RF 0.46 0.68 0.52 ing high-frequency time series data as in our CNN [5] 0.35 0.57 0.44 application context. ARes-CNN 0.29 0.53 0.40 Overall, our proposed ARes-CNN improves CI LSTM 0.41 0.63 0.46 performance by a large margin, suggesting SVM 0.39 0.63 0.45 K-NN 0.36 0.60 0.43 that processing time series separately, using XGBoost 0.31 0.56 0.40 adaptive input layers, combined with residual RF 0.31 0.56 0.40 layers beneficially influences model perfor- CNN [5] 0.22 0.46 0.33 ARes-CNN 0.18 0.41 0.30 mance. We see that the average performance of the ARes-CNN compared with the second best performing model (CNN by [5]) is 17% on MSE, 9% on RMSE and 8% on MAE while only increasing the amount of trainable parameters with ≈ 4%, from 1.364.291 (CNN [5]) to 1.421.271 (ARes-CNN) parameters. Another interesting observation from the results described in Table 2 is the high deviation between model performance within the Machine Learning algorithms. We see that the ensemble methods, e. g., RF and XGBoost perform significantly better than the SVM (which shows low performance on both datasets). Therefore, it seems that tree-based methods are quite capable in processing multivariate time series provided that the data is sufficiently preprocessed into a rich feature set. 5. Conclusion This work proposes a Deep Learning (DL) model for multivariate time series regression in the context of seismic maximum ground-shaking prediction. Our model processes multivariate time series (e. g., channels or sensors) in a univariate way, using adaptive input layers, combined with residual learning. The experiments are conducted on two seismic datasets with varying characteristics. Here, our proposed model is compared to traditional Machine Learning (ML) applications, a shallow LSTM model and a Convolutional Neural Network (CNN) specifically designed for the task, developed by [5], which functions as a starting point for our model. As the results in Table 2 indicate, the CNN applications outperform the other models, em- phasizing the capabilities of CNNs for such complex and high-frequency time series data. In contrast, we also observe a relatively poor performance of the LSTM model, suggesting that a shallow LSTM model is not well suited for such contexts. Overall, our model improves perfor- mance with 17% (MSE) compared to the best performing baseline. This indicates that combining adaptive input layers with residual learning to the existing CNN architecture proposed by [5] is a valuable addition for multivariate time series regression tasks. For future work, we plan to execute additional experimentation on large-scale multivariate time series datasets in various fields and deviating tasks, i. e., weather predictions and traffic forecasting. In addition, we will investigate the scalability of ARes-CNN when increasing the amount of time series channels, and thus more separate input layers. Furthermore, we aim to develop new methods of processing sensor data from multiple stations, e. g., using ideas from graph signal processing [25, 26] and by combining residual stacks with a Graph Neural Network structure proposed by [4], to improve model performance. Acknowledgments This work has been funded by the Interreg North-West Europe program (Interreg NWE), project Di-Plast - Digital Circular Economy for the Plastics Industry (NWE729). We want to thank A. Michelini (Istituto Nazionale di Geofisica e Vulcanologia) and D. Jozinović (Swiss Seismological Service) for their domain knowledge considering the CW and CI datasets. References [1] D. E. Rumelhart, G. E. Hinton, R. J. Williams, Learning representations by back-propagating errors, nature 323 (1986) 533–536. [2] Y. LeCun, Y. Bengio, et al., Convolutional networks for images, speech, and time series, The handbook of brain theory and neural networks 3361 (1995) 1995. [3] J. van den Hoogen, S. Bloemheuvel, M. Atzmueller, Classifying multivariate signals in rolling bearing fault detection using adaptive wide-kernel cnns, Applied Sciences 11 (2021). [4] S. Bloemheuvel, J. van den Hoogen, D. Jozinovic, A. Michelini, M. Atzmueller, Graph neural networks for multivariate time series regression with application to seismic data, International Journal of Data Science and Analytics (2022). [5] D. Jozinović, A. Lomax, I. Štajduhar, A. Michelini, Rapid prediction of earthquake ground shaking intensity using raw waveform data and a convolutional neural network, Geophys- ical Journal International 222 (2020) 1379–1389. [6] C. W. Tan, C. Bergmeir, F. Petitjean, G. I. Webb, Time series extrinsic regression, Data Mining and Knowledge Discovery 35 (2021) 1032–1060. [7] A. Michelini, L. Margheriti, M. Cattaneo, G. Cecere, G. Dapos;Anna, A. Delladio, et al., The Italian National Seismic Network and the earthquake and tsunami monitoring and surveillance systems, Advances in Geosciences 43 (2016) 31–38. [8] P. Danecek, S. Pintore, S. Mazza, A. Mandiello, M. Fares, I. Carluccio, E. Della Bina, D. Franceschi, M. Moretti, V. Lauciani, M. Quintiliani, A. Michelini, The Italian Node of the European Integrated Data Archive, Seismological Research Letters 92 (2021) 1726–1737. [9] D. Jozinović, A. Lomax, I. Štajduhar, A. Michelini, Transfer learning: Improving neural network based prediction of earthquake ground shaking for an area with insufficient training data, Geophys J. Int. (2021). [10] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: 3rd International Conference on Learning Representations, 2015. [11] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT press, Cambridge, MA, USA, 2016. [12] S. Kiranyaz, T. Ince, R. Hamila, M. Gabbouj, Convolutional neural networks for patient- specific ecg classification, in: Proc. IEEE EMBC), IEEE, 2015, pp. 2608–2611. [13] A. Zhang, S. Li, Y. Cui, W. Yang, R. Dong, J. Hu, Limited data rolling bearing fault diagnosis with few-shot learning, IEEE Access 7 (2019) 110895–110904. [14] J. van den Hoogen, S. Bloemheuvel, M. Atzmueller, An improved wide-kernel cnn for classifying multivariate signals in fault diagnosis, in: International Conference on Data Mining Workshops, 2020, pp. 275–283. [15] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems 25 (2012). [16] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proc. IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [17] S. M. Mousavi, Y. Sheng, W. Zhu, G. C. Beroza, Stanford earthquake dataset (stead): A global data set of seismic signals for ai, IEEE Access 7 (2019) 179464–179476. [18] A. Strollo, D. Cambaz, J. Clinton, P. Danecek, C. P. Evangelidis, A. Marmureanu, et al., EIDA: The European Integrated Data Archive and Service Infrastructure within ORFEUS, Seismological Research Letters 92 (2021) 1788–1795. [19] P. Jiao, A. H. Alavi, Artificial intelligence in seismology: Advent, performance and future trends, Geoscience Frontiers 11 (2020) 739–744. [20] S. M. Mousavi, W. L. Ellsworth, W. Zhu, L. Y. Chuang, G. C. Beroza, Earthquake trans- former—an attentive deep-learning model for simultaneous earthquake detection and phase picking, Nature communications 11 (2020) 1–12. [21] A. Lomax, A. Michelini, D. Jozinović, An investigation of rapid earthquake characteriza- tion using single-station waveforms and a convolutional neural network, Seismological Research Letters 90 (2019) 517–529. [22] M. P. van den Ende, J.-P. Ampuero, Automated seismic source characterization using deep graph neural networks, Geophysical Research Letters 47 (2020) e2020GL088690. [23] S. Mazilu, A. Calatroni, E. Gazit, D. Roggen, J. M. Hausdorff, G. Tröster, Feature learning for detection and prediction of freezing of gait in parkinson’s disease, in: Intl. workshop on machine learning and data mining in pattern Recognition, Springer, 2013, pp. 144–158. [24] S. Masiala, W. Huijbers, M. Atzmueller, Feature-set-engineering for detecting freezing of gait in parkinson’s disease using deep recurrent neural networks, arXiv preprint arXiv:1909.03428 (2019). [25] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, P. Vandergheynst, The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains, IEEE signal processing magazine 30 (2013) 83–98. [26] S. Bloemheuvel, J. van den Hoogen, M. Atzmueller, A computational framework for modeling complex sensor network data using graph signal processing and graph neural networks in structural health monitoring, Applied Network Science 6 (2021) 97.