Data Preprocessing for Machine Learning in Seismology Vladimir Chernykh a, Andrey Stepnov band Olga Lukyanova a a Computing Center of Far-Eastern Branch, Russian Academy of Sciences, 65 Kim Yo Cheng st., Khabarovsk, 680000, Russia b Geophystech LLC, 1b Nauki, Yuzhno-Sakhalinsk, 693022, Russia Abstract The problem of preliminary data processing on P, S arrivals of seismic waves has been formulated. Data preprocessing was carried out for further classification using machine learning models. A comparative analysis of the following neural networks has been carried out: GPD, EQTransformer, and PhaseNet. Demonstrated the automation process for machine learning methods of seismic waves detection. Keywords 1 Machine learning, earthquake, time series, waveform data 1. Introduction Current day earthquake detection and analysis require the necessity of staff involvement, trained in visual detection of different seismic waves in a continuous stream of data from local seismic networks. Local earthquakes generate different types of seismic waves, which travel away from the source. The fastest among these are P-waves and S-waves (primary and secondary). Accurate detection of P and S waves is used in earthquake source location by computing source parameters: coordinates, hypocenter depth, and origin time. The continuous growth of seismic networks causes an increase in trained human staff demand. Threshold methods [1] are very popular as an incomplete approach to seismic events detection automation. However, these methods have proven ineffective in low-magnitude earthquake detection, especially in noisy environments. The machine learning approach has shown an ability to achieve detection accuracy compared to (or even surpassing) which of trained staff [2]. Preprocessing of seismic data is a first and critical step in full automation of classification of seismic wave arrival times. The present paper demonstrates seismic data preprocessing for subsequent use in machine learning methods of earthquake detection and describes the method employed to automate machine learning methods of seismic waves detection. 2. Data description For neural-networks training and evaluation purposes, we used a dataset of hand-picked local earthquake data from the Sakhalin island seismic network. Dataset consists of 3045 P-arrivals, 3737 S-arrivals, and 3045 noise fragments collected from 2014 to 2021. Each seismic record is a 3-component (North, East, and vertical components) 4 seconds slice of ground movement information with a sampling rate of 100 Hz. Continuous seismic data streams are usually stored as day-length entries with gaps for station offline times. Data gathered from stations VI International Conference Information Technologies and High-Performance Computing (ITHPC-2021), September14–16, 2021, Khabarovsk, Russia EMAIL: syler1983.9@gmail.com (A. 1); a.stepnov@geophystech.ru (A. 2) ORCID: 0000-0002-8383-8620 (A. 1); 0000-0003-3401-6687 (A. 2) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 119 with sampling rates different from 100 Hz were resampled to 100 Hz using the Fourie method. Accelerometer data were integrated by time to convert it to seismograms. Seismic events were filtered by a minimal magnitude of 1 and maximum distance to an earthquake source of 300 km. The data first were detrended and high-pass filtered above 2 Hz and then normalized by the absolute maximum amplitude on any of the three components. Figure 1 illustrates preprocessing on an actual earthquake from 01.04.2021. а) b) Figure 1: Seismic data before (а) and after (б) detrend and filtering Also, we used a larger dataset of southern California seismic network records [2], composed of 4 773 750 3-component records with an equal number of P-wave arrivals, S-wave arrivals, and noise patches. Data preprocessing is similar to formerly described methods used on the Sakhalin data. We used day-long continuous data from 3-component seismographs to evaluate the program integration of machine learning methods. 120 3. Models The integration is designed for classification neural networks and supports output classes number, labels, and positive characteristic (seismic event vs. noise) customization. In this study, following models was evaluated: GPD [2], EQTransformer [3], and PhaseNet [4]. GPD input is a 3-component 4 seconds long record (with a sampling rate of 100 Hz), in other words, an array of data with a shape 400x3. Model output is a set of three probabilities corresponding to the likelihood of each respective class: P-wave, S-wave, and noise. EQTransformer and PhaseNet input is a 3-component record of 60 and 90 seconds length, respectively. For PhaseNet and EQTransformer evaluation, we reconstructed datasets to meet new input data shape requirements. New datasets composed of the same P and S waves arrivals and employed the same preprocessing methods as described in section 2. Data description. All models were trained on southern California data and evaluated on 20% of Sakhalin data (table 1, pre-trained), followed by fine-tuning on 80% of Sakhalin data with evaluation on 20% of Sakhalin data (table 1, fine-tuned). GPD displayed the best results and thus was chosen as a target model for the automation process of seismic events detection. Table 1 Models evaluation results on the local seismic events data Model Accuracy, Accuracy, F1 score, F1 score, pretrained fine-tuned pretrained Fine-tuned P S N P S N GPD 0.87 0.94 0.89 0.87 0.86 0.94 0.95 0.93 EQTransformer 0.51 0.81 0.34 0.20 0.34 0.86 0.89 0.34 PhaseNet 0.61 0.49 0.69 0.66 0.66 0.60 0.61 0.50 4. Automation of seismic waves detection Program integration is designed to work with seismogram databases generated by SEISAN [5] software. SEISAN software is a software package for analyzing earthquakes. The system provides the means to maintain the database containing the configuration of the seismic station network, earthquake records, data stream archives from the seismic station network. SEISAN database includes the following directories: • REA – earthquake readings and full epicenter solutions in a database • WOR – the users work directory • DAT – default and parameter files, system configuration files • WAV – digital waveform data files • archives – database of continuous seismic data from stations split into day-long files The product of the automation development is a program that analyzes SEISAN database files and searches for earthquakes on the data stream from specified stations. The program analyzes the network configuration, including information on active stations from the DAT directory, and searches daily archives of the continuous stream from seismic stations. Currently, the automation is not used in real-time, rather daily analysis of the new seismic data from the specified monitoring stations is performed. The workflow of the earthquake detection automation is displayed in figure 2. 121 Figure 2: Earthquake detection automation workflow The automation employs the following workflow: program scans SEISAN database structure and then picks correct seismogram archive for analysis. Next, preliminary data processing is performed: detrend and high-pass filtering above 2 Hz. The seismic data stream is then split by a sliding window with a length of 4 seconds and a step of 0.1 seconds; each window is normalized and used as an input for the target neural network prediction. Class predictions were then restored to input data frequency (from 10 Hz to 100 Hz) using linear interpolation, resulting in three probability curves: P-wave curve, S-wave curve, and noise probability curve. Probability curves example displayed in figure 3,b alongside with raw input data (figure 3,a). Figure 3: Probability curves example for an earthquake prediction (N, E, Z – input components; P, S – seismic waves probability, N – noise probability) 122 Positive wave arrivals predictions are marked as red “stars” on P and S probability curves. For each positive class (P and S waves), probability curve peaks are found with the minimal allowed distance between adjacent peaks in 10 seconds and the threshold probability value is 0.95. Then, the mean value in a 1-second span (a quarter of the window length) around the peak position is calculated for every class probability curve. Finally, the values are compared, and if the mean value belonging to the peak is highest, then the peak position is assumed as a positive prediction. Positives are then outputted in a text file in order of occurrence with corresponding information about positives time, probability, type (P-arrival, S-arrival), and seismic station. In addition, the program supports data and predictions visualization in the form of graphs, including scores visualization (figure 3), preprocessed data plotting, and raw data plotting (figure 3). Also, launch options for performance evaluation of entire automation and only neural network computation times were implemented to provide means for future models comparative analysis. 5. Conclusion The development of the present study yielded the application for the automation of seismic waves detection using machine learning methods. Also, during the automation evaluation, new classification neural network training flaws were revealed, which may lead to further studies and improvements. Extensive effort was put into data gathering and processing for model training and evaluation, which may be used for future projects and new classification neural networks. Also, a program package was developed for data (P and S waves and noise records) gathering from SEISAN databases. The package also includes the ability to filter out events by magnitude, source depth and distance, and seismic monitoring station properties (such as the number of components, instrument types). 6. Acknowledgements The work was done with partial support by the Russian Foundation for Basic Research, project No. 18-29-03196. The studies were carried out using the resources of the Center for Shared Use of Scientific Equipment "Center for Processing and Storage of Scientific Data of the Far Eastern Branch of the Russian Academy of Sciences" [6], funded by the Russian Federation represented by the Ministry of Science and Higher Education of the Russian Federation under project No. 075-15-2021- 663. 7. References [1] Withers M., Aster R., Young C., Beiriger J., Harris M., Moore S. and Trujillo, J.. A Comparison of Select Trigger Algorithms for Automated Global Seismic Phase and Event Detection. Bulletin of the Seismological Society of America, Vol. 88, No. 1, pp. 95-106, February 1998. [2] Ross Z., Meier M., Hauksson E., Heaton T.. Generalized Seismic Phase Detection with Deep Learning. Bulletin of the Seismological Society of America, Vol. 108, pp. 2894-2091, 2018. [3] Mousavi, S.M., Ellsworth, W.L., Zhu, W. et al. Earthquake transformer—an attentive deep- learning model for simultaneous earthquake detection and phase picking. Nat Commun 11, 3952 (2020). https://doi.org/10.1038/s41467-020-17591-w. [4] Zhu, W., & Beroza, G. C. (2018). PhaseNet: A Deep-Neural-Network-Based Seismic Arrival Time Picking Method. arXiv preprint arXiv:1803.03211. [5] SEISAN - earthquake analysis software. URL: http://seisan.info. [6] A. A. Sorokin, S. V. Makogonov, S. P. Korolev, The information infrastructure for collective scientific work in the Far East of Russia, Scientific and Technical Information Processing 44(4) (2017) 302-304, doi:10.3103/S0147688217040153. 123