Advanced Signal Reconstruction in Tunka-Rex with Matched Filtering and Deep Learning Pavel Bezyazeekov1 , Nikolai Budnev1 , Oleg Fedorov1 , Oleg Gress1 , Oleg Grishin1 , Andreas Haungs2 , Tim Huege2,3 , Yulia Kazarina1 , Matthias Kleifges4 , Dmitriy Kostunin5 , Elena Korosteleva6 , Leonid Kuzmichev6 , Vladimir Lenok2 , Nima Lubsandorzhiev6 , Stanislav Malakhov1 , Tatiana Marshalkina1 , Roman Monkhoev1 , Elena Osipova6 , Alexander Pakhorukov1 , Leonid Pankov1 , Vasily Prosin6 , Frank Gerhard Schröder2,7 , Dmitry Shipilov1 , and Alexey Zagorodnikov1 1 Institute of Applied Physics ISU, Irkutsk, Russia 2 Institut für Kernphysik, KIT, Karlsruhe, Germany 3 Astrophysical Institute, Vrije Universiteit Brussel, Pleinlaan 2, Brussels, Belgium 4 Institut für Prozessdatenverarbeitung und Elektronik, KIT, Karlsruhe, Germany 5 DESY, Zeuthen, Germany 6 Skobeltsyn Institute of Nuclear Physics MSU, Moscow, Russia 7 Bartol Research Inst., Dept. of Phys. and Astron., Univ. of Delaware, Newark, USA Abstract. The Tunka Radio Extension (Tunka-Rex) is a digital antenna array operating in the frequency band of 30-80 MHz, measuring the radio emission of air-showers induced by ultra-high energy cosmic rays. Tunka- Rex is co-located with the TAIGA experiment in Siberia and consists of 63 antennas, 57 of them in a densely instrumented area of about 1 km2 . The signals from the air showers are short pulses, which have a duration of tens of nanoseconds and are recorded in traces of about 5 µs length. The Tunka-Rex analysis of cosmic-ray events is based on the reconstruction of these signals, in particular, their positions in the traces and amplitudes. This reconstruction suffers at low signal-to-noise ratios, i.e. when the recorded traces are dominated by background. To lower the threshold of the detection and increase the efficiency, we apply advanced methods of signal reconstruction, namely matched filtering and deep neural networks with autoencoder architecture. In the present work we show the comparison between the signal reconstructions obtained with these techniques, and give an example of the first reconstruction of the Tunka-Rex signals obtained with a deep neural networks. Keywords: Tunka-Rex · Matched filtering · Autoencoder · Denoising 1 Introduction Cosmic rays (CR) are high-energy charged particles with extra-terrestrial origin. The sources of ultra-high energy CR are still unknown due to complexity of tracing charged particles deflected by galactic and extragalactic magnetic fields. However, the energy spectrum and mass composition of CR can shed light on the most powerful cosmic accelerators. Due to low flux of ultra-high energy CR it is impossible to measure them directly (in space or higher layers of atmosphere), and they are detected by large ground detectors measuring cascades produced by their interaction with the atmosphere. These cascades, called air-showers, consist of many secondary particles, including electrons and positrons, which produce short radio pulses due to deflection in the Earth’s magnetic field. These pulses have a broadband spectrum mostly in the MHz domain and a duration of tens of nanoseconds [1]. Tunka-Rex [2] is a sparse antenna array located at the TAIGA facility [3, 4] in the Tunka valley (Eastern Siberia). It consists of 63 antennas measuring radio emission from air showers in the frequency band of 30-80 MHz. Since Tunka- Rex is placed in a relatively radio-quiet location, the main background is from the Galaxy and has a power law behavior with an index of ≈ −2.5. However, there are plenty of non-stationary background sources, which complicate the re- construction of events with low energies and may distort the reconstruction of the signal. This background is caused by hardware RFI (especially after several upgrades of TAIGA facility) and by anthropogenic influence. A simple study of the background shows that the noise in the Tunka-Rex traces has distinguish- able features and cannot be approximated as white noise (see Fig. (1)). In this work we discuss different approaches of signal reconstruction and background suppression, particularly a standard one used in many experiments and more sophisticated ones: matched filtering (MF) and autoencoder (AE). We discuss their performance and future applications. 2 Signal reconstruction The air-shower pulse at each antenna station is measured in two independent channels and digitized in two traces of 1024 samples each with a rate of 200 MHz (i.e. binsize of 5 ns). For the reconstruction of cosmic-ray air showers the two main properties of radio pulses are used: the amplitude of the signal and its arrival time. Details of this reconstruction and its application to cosmic-ray science are given in Refs. [5, 6]. In this paper we focus on the signal reconstruction itself and methods of its improvement. Before the reconstruction of the signals we perform several preprocessing transformations. Spectra of the signals obtained with Fourier transform are cut by a digital bandpass to 35-80 MHz and filtered with a median filter, which removes narrow-band RFI and equalizes the noise using a sliding window of 3 MHz width. Then the traces are upsampled in order to increase the timing resolution by factors of 4, 16, and 64 for the standard method, AE, and MF, respectively. As last step, the electric field along the two polarization directions in the plane perpendicular to the shower axis are reconstructed, namely v × B (along the Lorentz force, where v is the direction of the air shower and B the direction of the geomagnetic field) and v × v × B perpendicular to it. Since the radio emission of air-showers is mostly in the v × B polarization we consider only this one hereafter. averaged signal before pulse: 400 RMS (mean) = 21.84 standard deviation RMS (aver.) = 4.09 after pulse: 200 amplitude (µV/m) RMS (mean) = 24.56 RMS (aver.) = 4.21 0 −200 −400 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 time (µs) Fig. 1. Average of 400 Tunka-Rex traces centered at the pulse. The black line indi- cates the mean value, the shaded area indicates √ one standard deviation. The expected reduction factor of the noise by averaging is 400 = 20, however it is significantly less (about 5), moreover one can see prominent noise feature after the pulse. The main quantity used for the definition of the threshold of signal recon- struction is the signal-to-noise ratio (SNR). The SNR is defined using the fol- lowing formula: SNR = S 2 /N 2 , (1) where S is the amplitude of the peak of the signal envelope S = max(s(t)) inside of the signal window, a 200 ns window defined by the trigger time. The envelope s(t) is defined as s(t) = u(t) + iH[u(t)] , (2) where u(t) is the initial trace of the electric field at the antenna as a function of time t, and H denotes the Hilbert transformation.The noise level N is defined as RMS in the noise window. SNRth is defined as minimal RMS by sliding the noise window, a 500 ns window, scanning the trace in order to pass over occasional RFI. We also use the additional definition of neighborhood SN Rneighb. for the cases of overlapping signal with RFI (when noise is inside the signal window) based on the estimation of the noise level around the signal (± 100 ns). Thus, to keep the ratio of false positive detection at the level of 5%, the thresholds are set as following: SNRth ≥ 16 and SNRneighb. ≥ 10. 2.1 Signal and background dataset for simulation study In this study we use a dataset of 650 000 samples of measured background (2014-2017) recorded by Tunka-Rex and 25 000 CoREAS [7] simulations. The air-shower pulse is randomly located within the signal window, summed with −3 3 6 ×10 1.6 ×10 threshold = 854 µV/m CC with noise 1.4 5 1.2 True amplitude (µV/m) Probability density 4 1.0 3 0.8 0.6 2 0.4 threshold 1 0.2 signal w/noise signal without noise 0 0.0 0.2 0.4 0.6 √ 0.8 1.0 1.2 1.4 0.0 0.5 1.0 1.5√ 2.0 2.5 3.0 3.5 4.0 Acc (µV/m) ×103 Acc (µV/m) ×103 √ Fig. 2. Left: Distribution of MF values Acc applied on traces of pure noise without air-shower signal. The detection threshold √ (red line) is set to have < 5% of false positive detections. Right: Correlation between Acc and the amplitude of the simulated signal. noise and folded with the Tunka-Rex hardware response taking into account the geometry of the air shower and the calibration of the detector. As was discussed in Ref. [6], the simulated signals reproduce real ones with satisfactory accuracy. As shown below, the methods developed for simulated pulses can be applied to the real data without additional tuning. 2.2 Matched filtering MF is a widely used technique of extracting signal from noisy data based on the convolution of a signal template T with the time trace u(t): X scc [t] = T [t0 − t]u[t], Acc = max(scc [t0 ]) , (3) t0 where Acc is the maximum of the convolution, which defines the peak of the filtered signal scc [t]. MF is very effective for the detection of the signal position inside a trace con- taminated significantly by white noise and can even detect signals with SNR  1. However, MF is unable to reconstruct the pulse shape and can only give an ap- proximate estimation of the amplitude of initial signal. For testing the filter we use a single template with a length of 60 ns obtained from averaging √ over many CoREAS simulations. The threshold of signal detec- tion is set to Acc = 854 µV/m (see Fig. (2), left), allowing for a 5% probability of false positives (similar to our standard method [2]). 2.3 Autoencoder The AE [8] is an unsupervised convolutional network used for learning the coded representation of the data and removing specific features from it. We train the AE in order to learn noise-related features and clean the input traces leaving 8 filters 16 filters 32 filters 8 filters 16 filters 32 filters Number of layers 3 0.34 0.28 0.27 3 0.78 0.74 0.68 Number of layers 4 0.24 0.31 0.27 4 0.74 0.72 0.74 5 0.26 overfit overfit 5 0.92 overfit overfit Number of filters on 1st layer Number of filters on 1st layer Fig. 3. Performance of different configurations of the autoencoder (see main text for explanation). Left: efficiency (fraction of reconstructed events Nrec /Ntot ). The rela- tively low efficiency is due to the detection threshold of the AE at the very low SNRs. Right: purity (fraction of events with properly reconstructed peak position, namely < 5 ns from true value: Nhit /Nrec ). The reconstruction quality of the AE increases with complexity. However, the AE has been overfitted when the number of degrees of freedom exceeded the size of the training data. cosmic-ray signals. The input array for our AE consists of 4096 values, what corresponds to a trace length of 1280 ns and 0.3125 ns sampling in order to contain the signal window of 200 ns as well as surrounding background. For the minimization of the loss, we normalize the input data to the [0:1] range with a baseline level at 0.5. The encoder, the encoding part of the AE, distinguishes features of noise contained in the input data by applying a set of filters. The filters perform the convolution of characteristic noise-related features with the input data, estimate its contribution as a result of the convolution, and afterwards send it to the max-pooling layer. The max-pooling layer performs a discrete downsampling of the data and sends it to the next convolution layer with the next set of filters. With each layer of the encoder, the data becomes more abstract and reduced in size. The result of encoding is a map of features of the input data. After encoding, noise-related features are removed and the map of denoised data proceeds to the next part of the AE, the decoder, which produces a reverse reconstruction and returns a data array of the same dimension as the input. In case of success, the resulting output is the denoised trace containing only the air-shower pulse. We have explicitly selected a subsample with low amplitudes and low SNR for training to test the possibility of lowering the threshold. We implemented and trained our AE with Keras [9] and Tensorflow [10] in a uDocker container with GPU support. 80 signal+noise 60 signal denoised signal 40 Amplitude (µV/m) 20 0 −20 −40 −60 −80 500 600 700 800 900 1000 1100 t (ns) 150 signal+noise signal 100 denoised signal Amplitude (µV/m) 50 0 −50 −100 500 600 700 800 900 1000 t (ns) 80 60 40 Amplitude (µV/m) 20 0 −20 −40 signal+noise −60 signal denoised signal −80 500 600 700 800 900 1000 t (ns) Fig. 4. Examples of the AE performance on simulated data (from top to bottom): 1) correct identification (true positive, signal is perfectly denoised); 2) no identifica- tion (false negative, due to heavy distortion by the noise), 3) double identification (true+false positive, passing signal-like RFI). 103 18 reco. (std.) reconstructed (std.), µ = 0.75◦ reco. (MF) 16 reconstructed (MF), µ = 0.90◦ reco. (AE) reconstructed (AE), µ = 1.02◦ 14 simulated 2 10 12 number of events number of events 10 8 1 10 6 4 2 100 0 16.0 16.5 17.0 17.5 18.0 0 1 2 3 4 5 energy log(E/eV) 6 (Atrue, Arec) (degrees) Fig. 5. Comparison of the performance of different methods on simulated data. Left: distribution of the reconstructed events as function of the energy for the standard, MF and AE pipelines; right: angular resolution using the different methods. We have tested several AE with different sizes defined by the depth (D) and number of filters per layer (N ), where the configuration of the i-th (i = 1, ..., D) encoding layer is defined as follows: Si = Smin × 2D−i , ni = 2i+N −1 , (4) where Si is the size of the i-th filter, ni is the number of filters per layer; Smin = 16 is the minimum size of a layer. To estimate the performance of the AE we use a cross-entropy loss function with the following metrics: – efficiency Nrec. /Ntot. , the fraction of reconstructed events. This metric indi- cates how many non-zero traces are returned by the AE. However this metric alone is insufficient. To evaluate the performance of the AE one has to check also the purity. – purity Nhit /Nrec. , the fraction of events with a peak position reconstructed within 5 ns from the true position. This metric indicates the true positive detections and filters misreconstructed signals. Fig. (3) shows a comparison chart of the different AE settings used. Although only a quarter of the testing sample is reconstructed (due to low SNR), the purity of the reconstruction is very high, i.e. the reconstruction of background- dominated signals is improved. Moreover, the performance of the AE increases with the complexity of the network and we plan to train more sophisticated AE with larger datasets (our present AE is limited by the relatively small training set). Examples of the reconstruction are given in Fig. (4). 0.8 expected pulse position raw signal 0.6 denoised signal 0.4 Amplitude (arbitrary) 0.2 0 -0.2 -0.4 -0.6 -0.8 300 400 500 600 700 800 t (ns) Fig. 6. Example of the autoencoder performance on a measured Tunka-Rex trace show- ing successful denoising of the typical RFI after the signal, cf. Fig. (1). 2.4 Comparison of the methods We have integrated both, the MF and AE algorithms, with the Auger Offline framework [11], used for the standard reconstruction of Tunka-Rex. This enables us to check the performance of MF and AE integrated in Tunka-Rex reconstruc- tion pipeline, and to use the identical benchmark for comparison between the standard signal reconstruction, matched filtering, and denoising with the au- toencoder. As metrics we selected the reconstruction of air-shower events (which requires the reconstruction of the signal at several antenna station per event), and the accuracy of the air-shower reconstruction. We performed this test on simulations using a small subset of the Tunka-Rex library. The results of the comparison in Fig. (5) show that both, MF and AE, are ready for the appli- cation to the reconstruction of real data. Moreover, the threshold is slightly lowered. With future improvements of MF (using a library of templates instead of a single one, and better whitening of the noise) and the AE (developing a more sophisticated network using a larger training dataset) we expect a fur- ther improvement of the reconstruction. At the moment we are working on the application of the AE on real Tunka-Rex data, an example is shown in Fig. (6). 3 Discussion and conclusion The signal reconstruction in the Tunka-Rex experiment is continuously improved (see, for example, Refs. [12, 13]). One of the most advanced method, namely deep learning, represented by the autoencoder technique is discussed in this work (it is worth noticing that a similar technique was recently used for a similar prob- lem [14]). We have shown that the methods described in this paper are ready for the implementation for the reconstruction of real data, and we are currently working on this reconstruction. Besides the application of these methods to Tunka-Rex data we plan to apply them in a Tunka-Rex child engineering ex- periment, Tunka-21cm, for RFI tagging and denoising, and to the Tien-Shan array [15] recently upgraded with SALLA. Acknowledgements This work has been supported by the Russian Foundation for Basic Research › › (grants 18-32-00460 and 18-32-20220), by the Helmholtz grant HRSF-0027, › and by grant 18-41-06003 Russian Science Foundation (Section 2.3). References 1. Schröder, Frank G., “Radio detection of Cosmic-Ray Air Showers and High-Energy Neutrinos,” Prog. Part. Nucl. Phys., vol. 93, pp. 1–68, 2017. 2. P. A. Bezyazeekov et al., “Measurement of cosmic-ray air showers with the Tunka Radio Extension (Tunka-Rex),” Nucl. Instrum. Meth., vol. A802, pp. 89–96, 2015. 3. N. Budnev et al., “The TAIGA experiment: From cosmic-ray to gamma-ray as- tronomy in the Tunka valley,” Nucl. Instrum. Meth., vol. A845, pp. 330–333, 2017. 4. D. Kostunin et al., “Tunka Advanced Instrument for cosmic rays and Gamma As- tronomy,” in 18th International Baikal Summer School on Physics of Elementary Particles and Astrophysics: Exploring the Universe through multiple messengers (ISAPP-Baikal 2018) Bolshie Koty, Lake Baikal, Russia, July 12-21, 2018, 2019. 5. P. A. Bezyazeekov et al., “Radio measurements of the energy and the depth of the shower maximum of cosmic-ray air showers by Tunka-Rex,” JCAP, vol. 1601, no. 01, p. 052, 2016. 6. P. A. Bezyazeekov et al., “Reconstruction of cosmic ray air showers with Tunka- Rex data using template fitting of radio pulses,” Phys. Rev., vol. D97, no. 12, p. 122004, 2018. 7. T. Huege, M. Ludwig, and C. James, “Simulating radio emission from air showers with CoREAS,” AIP Conf.Proc., vol. 1535, p. 128, 2013. 8. Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, pp. 1798– 1828, Aug. 2013. 9. F. Chollet, “keras.” https://github.com/fchollet/keras, 2015. 10. M. Abadi et al., “TensorFlow: Large-scale machine learning on heterogeneous sys- tems,” 2015. Software available from tensorflow.org. 11. P. Abreu et al., “Advanced functionality for radio analysis in the Offline soft- ware framework of the Pierre Auger Observatory,” Nucl.Instrum.Meth., vol. A635, pp. 92–102, 2011. 12. P. A. Bezyazeekov et al., “Imporoving reconstrucion methods for radio measure- ments with Tunka-Rex,” in 25th European Cosmic Ray Symposium (ECRS 2016) Turin, Italy, September 04-09, 2016, eConf C16-09-04.3, 2017. 13. D. Kostunin et al., “Improved measurements of the energy and shower maximum of cosmic rays with Tunka-Rex,” PoS, vol. ICRC2017, p. 400, 2017. 14. M. Erdmann, F. Schlüter, and R. Smida, “Classification and Recovery of Radio Sig- nals from Cosmic Ray Induced Air Showers with Deep Learning,” JINST, vol. 14, no. 04, p. P04005, 2019. 15. A. Beisenova et al., “Search for EAS radio-emission at the Tien-Shan shower instal- lation at a height of 3340 m above sea level,” EPJ Web Conf., vol. 145, p. 11003, 2017.