Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 BENCHMARK OF GENERATIVE ADVERSARIAL NETWORKS FOR FAST HEP CALORIMETER SIMULATIONS F. Rehm1,2,a, S. Vallecorsa1, K. Borras2,3, D. Krücker3 1 CERN, Esplanade des Particules 1, Geneva, Switzerland 2 RWTH Aachen University, Templergraben 55, Aachen, Germany 3 DESY, Notkestraße 85, Hamburg, Germany E-mail: a florian.matthias.rehm@cern.ch Highly precise simulations of elementary particles interaction and processes are fundamental to accurately reproduce and interpret the experimental results in High Energy Physics (HEP) detectors and to correctly reconstruct the particle flows. Today, detector simulations typically rely on Monte Carlo-based methods which are extremely demanding in terms of computing resources. The need for simulated data at future experiments - like the ones that will run at the High Luminosity Large Hadron Collider (HL-LHC) - are expected to increase by orders of magnitude, increasing drastically the computational challenge. This expectation motivates the research for alternative deep learning-based simulation strategies. In this research we speed-up HEP detector simulations for the specific case of calorimeters using Generative Adversarial Networks (GANs) with a huge factor of over 150 000x compared to the standard Monte Carlo simulations. This could only be achieved by designing smart convolutional 2D network architectures for generating 3D images representing the detector volume. Detailed physics evaluation shows an accuracy similar to the Monte Carlo simulation. Furthermore, we quantize the data format for the neural network architecture (float32) with the Intel Low Precision Optimization tool (LPOT) to a reduced precision (int8) data format. This results in an additional 1.8x speed-up on modern Intel hardware while maintaining the physics accuracy. These excellent results consolidate the beneficial use of GANs for future fast detector simulations. Keywords: Generative Adversarial Networks, Calorimeter Simulation, Fast Simulation, Reduced Precision Computing Florian Rehm, Sofia Vallecorsa, Kerstin Borras, Dirk Krücker Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 310 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 1. HEP Calorimeter Simulations At present, detector simulations are primarily performed with the Geant4 toolkit [1] which relies on Monte Carlo-based methods. Calorimeters are detectors that measure the particles energy in high energy physics experiments such as at the Large Hadron Collider (LHC). Due to their considerable complexity and high granularity calorimeter simulations remain the tasks which utilize the most significant fraction of computational resources. In the future High Luminosity LHC (HL- LHC) phases the amount of data to be simulated will significantly increase due to the larger luminosities. Furthermore, the calorimeter detectors get progressively more complex with higher granularities. This predictably causes an increase of the computational requirements which exceed the extrapolated computational resources of the Worldwide LHC Computing Grid [2] by far. In this research are Generative Adversarial Networks (GANs) - a modern Deep Learning approach - applied to speed-up calorimeter simulations. Recent physics publications proved already speed-up's of orders of magnitudes [3, 4] while maintaining physics accuracy [5, 6]. As training data are 200 000 three-dimensional high granularity shower images with a dimension of 25x25x25 pixels used. One demonstrative example shower image is shown in Figure 1. Figure 1. Shows (left) an example electromagnetic calorimeter 3D shower image with a primary particle energy of 500 GeV. (right) Inference of Conv2D model run with different batch sizes. With a batch size of 2 048 it reveals the highest inference time with 9 347 showers per second (or 158 000x speed-up versus Geant4). 2. 3D Generative Adversarial Network Deep learning approaches are today an appropriate choice to deal with computationally demanding problems. Generative Adversarial Networks (GANs) comprise an established category of models which generate realistic data similar to the data of a training data set. In the GAN principle two models are carrying out an adversarial role based on game theory. The generator network tries to fool the discriminator network by sending fake images labelled as true images (training images). The discriminator on the other hand, tries to distinguish between real data (images from the training data set) and fake data (generated images). The training is successful, when the discriminator is no more able to distinguish between the original images and synthetic results producing a classification prediction of 50% for each class. The generator and the discriminator model are parameterized by deep neural networks. Since we interpret the calorimeter output as a three-dimensional image, we can build neural networks consisting primarily of convolutional layers. Although the generated images are three-dimensional, we designed an architecture which utilizes only 2D convolutional (Conv2D) layers in order to reduce the computational time. The generator architecture is shown in Figure 2 and the discriminator architecture in Figure 3. 311 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 Figure 2. Conv2D generator architecture The networks consist of three branches corresponding to the three image canonical axes. The generator input latent space comprises 200 random numbers drawn from a uniform distribution between zero and one multiplied by the primary particle energy 𝐸𝑝 . The generator output is the three- dimensional image with 25x25x25 pixels. In addition to Conv2D layers, the generator network includes transposed 2D convolutional (Conv2D_transpose) layers to increase the image size, batch normalization (BatchNorm), a rectified linear units activation function (ReLU), linear ReLU activation functions (LeakyReLU) and dropout layers (Dropout). With the help of the three branches, the network is capable to learn the correlations between all three image dimensions. For the discriminator we employ a model similarly consisting of three branches. The input represents either the real images from the training set or the generated images. The discriminator outputs three values: the first is the typical GAN true/fake probability [7] which is used to calculate a binary cross entropy loss [8]. The second loss (named AUX, for AUXiliary loss) represents the result of a regression task on the initial particle energy 𝐸𝑝 , that the discriminator estimates from the images using a dense layer. It is implemented as a Mean Absolute Percentage Error (MAPE) [9]. The third discriminator output comes from a Lambda layer, calculating the sum over the pixels of the input image which, therefore, corresponds to the total energy of the input image. It is entitled ECAL and uses the MAPE loss function likewise. Figure 3. Conv2D discriminator architecture 3. GAN Evaluation We evaluate the Conv2D GAN model in terms of physics accuracy and computational speed and compare it to a previous architecture taken from Ref. [10] which uses Conv3D layers for the same simulation use case. Ultimately, the new Conv2D model is compared to the Geant4 simulation which is aimed to be replaced. The goal is to speed-up the simulation time while providing the equivalent level of necessary accuracy to evaluate the physics results. The inference is run on a Nvidia Tesla V100 GPU with Python version 3.6.8 and TensorFlow version 2.2.0. We run 20 warm-up batches and evaluate afterwards 100 inference steps including 20 batches each. The inference process of the Conv2D model is optimized with different batch sizes and we measure the speed-up versus Geant4 simulation and the percentage of the GPU utilization. The results are presented in Figure 1. One can see that, for increasing batch sizes, the GPU utilization, and the number of showers per second rises 312 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 almost linear until the batch size of 2 048, where it reaches its peak with 9 347 showers per second. This results in a tremendous 158 000 speed-up compared to the Geant4 simulation which requires 17 seconds to reproduce one single shower image (taken from a previous measurement in Ref. [11]). One can see that at the batch size of 2 048 the GPU is almost completely utilized which results in a drop of showers per second for the measurements with higher batch sizes. In Table 1 we compare the new Conv2D network to the previous Conv3D architecture. One can see that the Conv2D model provides a much larger speed-up versus Geant4 compared to the Conv3D model, in spite of the fact that the new Conv2D model has a much higher number of parameters and convolutional layers. It should be noted, however, that no batch size optimization was performed for the Conv3D model. However, the GPU utilization of the Conv3D model with a batch size of 128 is already quite high. This is the reason why no significant speed-up of the Conv3D model is expected. Table 1. The number of parameters and the number of convolutional layers for the Conv2D and Conv3D generator model. The speed-up is given with respect to Geant4 and the last column shows the GPU utilization during inference. Model Parameters Nb. Conv Layers Speed-up vs Geant4 Utilization Conv3D 752 000 4 6 200x 78.75% Conv2D 2 052 000 28 158 000x 98.50% In order to better quantify the physics agreement of the GAN output with Geant4, we define an accuracy metric based on the mean squared error (MSE). It is calculated by building two- dimensional projections of the particle shower distributions along the 𝑥-, 𝑦- and 𝑧-axis (averaged over 20 000 samples) and measuring the MSE between the corresponding GAN model and Geant4. The Conv2D architecture has an MSE of 0.027 which is lower than the MSE of the previous Conv3D architecture with 0.065 (because this quantity is a measure for the error, the lower the MSE the better the accuracy). The same behavior we can observe in the shower shape plots in Figure 4. (left). The Conv2D model (green) is closer to Geant4 (red) and performs better than the Conv3D model (blue). In particular, the new Conv2D model is able, for the first time, to correctly reproduce the lower energy tails of the shower shape distributions, usually largely overestimated or underestimated by GAN, see Ref. [12]. 4. Reduced Precision Research Modern Deep Learning (DL) dedicated hardware, developed by various vendors to accelerate DL workloads implements different kind of reduced precision strategies. In order to evaluate the effect of reduced precision (int8 in particular) on the inference process of our GAN model, we quantize the neural network parameters from float32 down to the int8 format. We intend to verify whether it is possible to further speed-up the inference and to reduce the memory consumption, while maintaining the physics accuracy. For quantizing model, we use the Intel Low Precision Optimization Tool (LPOT) [13]. LPOT optimizes in an iterative process, based on a predefined accuracy metrics, how many and which weights are quantized. We compare the results with models quantized by the TensorFlow Lite library [14]. We run inference on an Intel 2S Xeon 8280 CPU, "Cascade Lake" architecture, with various numbers of data streams and cores. The best result is achieved with the configuration of 8 streams and 56 cores. We gain a speed-up of 1.8x from the initial float32 Conv2D model to the int8 Conv2D model (float32 2 372 showers/second, int8 4 158 showers/second). On the previously mentioned Intel CPU the speed-up of the quantized int8 model represents 68 000 (different value as in the previous section because it is run on CPU for the research here and on GPU previously) with respect to Geant4. There are multiple reasons why we do not achieve the theoretical expected 4x speed-up. The first is, that the operations for quantizing of the input and de-quantizing the output takes already 20% of the 313 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 computation time. Additionally, the batch normalization layers alone require around 30% of the computation time. In a future LPOT version the batch normalization layer will be combined with the convolutional layer and the activation function which is expected to considerably decrease the simulation time. Due to the quantization, the model memory size is reduced by a factor of 2.26x from 8.08 MB down to 3.57 MB. Figure 4. Shower shape plots for measuring physics accuracy. (Left) Comparison of the new Conv2D vs a previous Conv3D architecture. (Right) Quantization of the Conv2D model into lower precision. Concerning physics accuracy evaluation, we consider the physics metrics introduced in the previous section. The MSE of the initial float32 model is 0.061, the LPOT int8 is 0.053, the TFLite float16 is 0.253 and TFLite int8 is 0.340. One can see, that the quantized LPOT model reaches an even lower MSE and therefore a higher accuracy as the float32 model. This is understood by the fact that the MSE metric was used in the LPOT tool for optimization likewise. Furthermore, the TFLite models perform worse. The reason could be that TFLite quantize the network parameters without any optimization. In Figure 4 (right) the shower distributions are shown for the different quantized models. The LPOT model follows Geant4 very closely, whereas the TFLite models are clearly off for lower energy cells. 5. Conclusion We introduced a novel Conv2D neural network architecture to successfully solve a 3D image generation task using GANs for the simulation of high granularity calorimeters in HEP experiments. Our GAN model is capable to achieve a tremendous 158 000x speed-up compared to the Geant4 simulation which we aim to replace. The physics accuracy evaluation demonstrated equally accurate results for the Conv2D GAN model as for Geant4 simulation. In addition, we investigated the effect of data quantization, from float32 down to the int8 format, using the Intel Low Precision Optimization Tool. We obtained a further 1.8x speed-up as well as a 2.26x reduction in model memory size while retaining a good level of physics accuracy. 314 Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021), Dubna, Russia, July 5-9, 2021 6. Acknowledgements This work has been sponsored by the Wolfgang Gentner Programme of the German Federal Ministry of Education and Research. References [1] S. Agostinelli, GEANT4--a simulation toolkit, Nucl. Instrum. Meth. A, 2003. [2] Worldwide LHC Computing Grid [Online]. Available: https://wlcg-public.web.cern.ch/. [Accessed 2021]. [3] F. Rehm, S. Vallecorsa, K. Borras and D. Krücker , "Physics Validation of Novel Convolutional 2D Architectures for Speeding Up High Energy Physics Simulations," 2021. [4] M. Erdmann, J. Glombitza and T. Quast, "Precise Simulation of Electromagnetic Calorimeter Showers Using a Wasserstein Generative Adversarial Network," in Comput Softw Big Sci 3, 2019. [5] D. Sipio, Riccardo and Giannelli, "DijetGAN: a Generative-Adversarial Network approach for the simulation of QCD dijet events at the LHC," in Journal of High Energy Physics, 2019. [6] F. Rehm, S. Vallecorsa, K. Borras and D. Krücker, "Validation of Deep Convolutional Generative Adversarial Networks for High Energy Physics Calorimeter Simulations," in AAAI 2021 - Association for the Advancement of Artificial Intelligence, 2021. [7] I. Goodfellow, "Generative Adversarial Networks," 2014. [8] G. E. Nasr, "Cross Entropy Error Function in Neural Networks: Forecasting Gasoline Demand," FLAIRS Conference, 2002. [9] P. Swamidass, "Encyclopedia of Production and Manufacturing Management," Springer US, pp. 462-462, 2000. [10] G. Khattak and et al., "Three Dimensional Energy Parametrized Generative Adversarial Networks for Electromagnetic Shower Simulation," 2018. [11] S. Vallecorsa and F. Carminati, "Distributed Training of Generative Adversarial Networks for Fast Detector Simulation," in High Performance Computing, vol. Springer International Publishing, Springer International Publishing, 2018, pp. 487-503. [12] G. Khattak, S. Vallecorsa, F. Carminati and M. Khan, "Particle Detector Simulation using Generative Adversarial Networks with Domain Related Constraints," in 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), 2019. [13] F. Tian and W. Chuanqi, "Intel® Low Precision Optimization Tool LPOT," 2020. [Online]. Available: https://github.com/intel/lp-opt-tool. [14] "TensorFlow Lite," TensorFlow For Mobile & IoT, [Online]. Available: https://www.tensorflow.org/lite. [15] F. Rehm, S. Vallecorsa, V. Saletore, H. Pabst , A. Chaibi, V. Codreanu, K. Borras and D. Krücker, "Reduced Precision Strategies for Deep Learning: A High Energy Physics Generative Adversarial Network Use Case," in Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods, 2021. 315