Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 SENSITIVITY ANALYSIS IN A PROBLEM OF REAXFF MOLECULAR-DYNAMIC FORCE FIELD OPTIMIZATION K. S. Shefov a, M. M. Stepanova b St. Petersburg State University, 7-9 University emb, St. Petersburg, 199034, Russia E-mail: a k.s.shefov@gmail.com, b mstep@mms.nw.ru In a wide range of modern problems, it is required to estimate an influence of uncertainty of input parameters on uncertainty of an output value of a modeling function. In this contribution, we present algorithms for analyzing the sensitivity of a target function with respect to parameters in the problem of optimization of ReaxFF molecular-dynamic force field. In this particular case it allows one to effectively decrease the number of simultaneously optimized parameters. We compare the Sobol's global sensitivity indexes (SI) approach and the correlation analysis. Both methods are based on computations of the target function value on the set of pseudo- or quasi-randomly distributed points. The distribution derived is used for further computations of SI using Monte-Carlo technique and correlation coefficients. In the case of optimized ReaxFF force field one may spend up to several seconds to compute a value of the target function in a particular point. That is why it is important to perform calculations in parallel for multiple points. A parallel algorithm has been implemented in C++ using MPI. We compute Sobol's SI and coefficients of correlation of parameters variation and target function values variation while we optimize the force field for molecules and crystals of zinc hydroxide. We show that using of parameter set sorted by influence allows one to significantly increase convergence speed of the optimization algorithm and even completely exclude those parameters with relatively small influence. Keywords: sensitivity analysis, reactive force field, molecular dynamics, parameter optimization, parallel algorithm, scalability ยฉ 2018 Konstantin S. Shefov, Margarita M. Stepanova 595 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 1. Introduction In a wide range of modern problems, it is required to estimate the influence of the uncertainty of the input parameters on the uncertainty in the output value of the modeling function. In this paper there are presented algorithms of sensitivity analysis of an objective function to parameters in the problem of optimization of the molecular-dynamic force field ReaxFF. In this particular case, this effectively reduces the number of parameters simultaneously participating in the optimization. Two approaches are compared: one based on Sobolโ€™s global sensitivity indexes (SI) by and the other one uses correlation analysis. A parallel program is presented, its scalability is studied, and the calculation results for a particular task are given. 2. The Problem A molecular-dynamic force field is characterized by a number of parameters: ๐‘ˆ = ๐‘“(๐‘1 , ๐‘2 , โ€ฆ , ๐‘๐‘ ). These parameters are set before solving the Newtonโ€™s equations of motion and do not alter during the simulation. Their number (๐‘) may vary from 3 โ€“ 4 to several dozens. A procedure of a force field parameters search for a particular simulated system is referred to as a force field optimization. As a measure of optimality we use the parameter-dependent objective function (OF) ๐‘‡(๐‘1 , ๐‘2 , โ€ฆ , ๐‘๐‘ ). The OF is determined by the deviation of any characteristics of the simulated system obtained using the methods of MD, from those obtained by more accurate methods. The process of optimization is to search the force field parameters that bring the OF the minimal value. As the MD force field ๐‘ˆ(๐‘Ÿโƒ—) we use ReaxFF (Reactive Force Field) that is able to simulate chemical reactions [1]. To optimize the OF we use the multifactorial global search algorithm (MGSA) suggested by Strongin [2]. The MGSA is able to simultaneously optimize relatively small number of parameters (the optimal variant is 4). However, optimizing ReaxFF for a particular system may need the number of parameters to be of several dozens. Therefore the parameters being searched are divided into groups, e. g. of 4 pieces, which participate in the optimization. The question arises of how to sort the parameters by groups, and also how to exclude parameters that almost do not affect the change in the OF. This task could be solved with a help of sensitivity analysis. In this paper we consider the correlation analysis and the method of Sobolโ€™s sensitivity indices. 3. Correlation analysis The objective function for searching MD force field parameters is a sum of terms depending on the parameters ๐‘1 , ๐‘2 , โ€ฆ , ๐‘๐‘ . The procedure of correlation analysis requires the following steps. 1. Generate a sample of ๐‘… (pseudo-)random sets of parameters ๐‘1,๐‘– , ๐‘2,๐‘– , โ€ฆ , ๐‘๐‘,๐‘– , 1 โ‰ค ๐‘– โ‰ค ๐‘…, that are uniformly distributed over the search domain of the OF. 2. For each set of parameters compute the OF value and record the values of particular terms. One will obtain a ๐‘€ ร— ๐‘… matrix of OF termsโ€™ values, where ๐‘€ is the number of terms of the OF. In total, one has two matrices: the ๐‘ ร— ๐‘… matrix of parameters ๐‘ƒ and the ๐‘€ ร— ๐‘… matrix of OF terms ๐น. 3. Normalize matrices ๐‘ƒ and ๐น: ๐‘๐‘–๐‘— โˆ’ ๐‘ฬ…๐‘– ฬ…๐‘– ๐‘“๐‘–๐‘— โˆ’ ๐‘“ ๐‘๐‘–๐‘— = , ๐‘“๐‘–๐‘— = . โˆš๐‘…๐œŽ๐‘,๐‘– โˆš๐‘…๐œŽ๐‘“,๐‘– ฬ…๐‘– are averages by columns, ๐œŽ๐‘,๐‘– , ๐œŽ๐‘“,๐‘– are dispersions by columns, ๐‘… is the sample size. Here ๐‘ฬ…๐‘– , ๐‘“ 4. Calculate the ๐‘€ ร— ๐‘ matrix ๐ถ = ๐‘ƒ๐‘‡ ยท ๐น of cross-correlations of parameters change and OF values change. The elements of the matrix, ๐‘๐‘–๐‘— , are cross-correlation coefficients. 5. Sort the rows of the matrix ๐ถ by the greatest absolute value of element in a row in descending order. Each row corresponds to a single parameter. The correlation analysis will give good result only if the OF terms depend on the parameters monotonically, which is not always the case. 596 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 4. Sobol sensitivity analysis A more efficient technique for estimating the influence of variables on a function is the calculation of global sensitivity indices (SI) suggested by I. Sobol and A. Saltelli [3]. The method of sensitivity indices, in contrast to the correlation analysis, does not require the presence of a monotonic dependence of the function on the variables. Let ๐‘“(๐‘ฅ) be a function of several variables ๐‘ฅ = (๐‘ฅ1 , ๐‘ฅ2 , โ€ฆ , ๐‘ฅ๐‘›) defined and square integrable in the unit cube ๐พ๐‘›. Then its ANOVA (Analysis of Variances) decomposition is ๐‘› ๐‘“(๐‘ฅ) = ๐‘“0 + โˆ‘ โˆ‘ ๐‘“๐‘–1 ,โ€ฆ,๐‘–๐‘  (๐‘ฅ๐‘–1 , โ€ฆ , ๐‘ฅ๐‘–๐‘  ) = ๐‘ =1 ๐‘–1 <โ‹ฏ<๐‘–๐‘  = ๐‘“0 + โˆ‘ ๐‘“๐‘– (๐‘ฅ๐‘– ) + โˆ‘ ๐‘“๐‘–๐‘— (๐‘ฅ๐‘– , ๐‘ฅ๐‘— ) + โ‹ฏ + ๐‘“1,2,โ€ฆ,๐‘›(๐‘ฅ1 , โ€ฆ , ๐‘ฅ๐‘›), (1) ๐‘– ๐‘–<๐‘— 1 if ๐‘“0 = โˆซ๐พ ๐‘“(๐‘ฅ)๐‘‘๐‘ฅ, and โˆซ0 ๐‘“๐‘–1 ,โ€ฆ,๐‘–๐‘  ๐‘‘๐‘ฅ๐‘–๐‘ = 0, when 1 โ‰ค ๐‘ โ‰ค ๐‘ . The inner sum in (1) is done by ๐‘› all the ๐‘–1 , โ€ฆ , ๐‘–๐‘  , that satisfy inequalities 1 โ‰ค ๐‘–1 < โ‹ฏ < ๐‘–๐‘› โ‰ค ๐‘›. The decomposition (1) could be made with any complete orthonormal system of functions ๐œ“0 (๐‘ฅ), ๐œ“1 (๐‘ฅ), โ€ฆ , ๐œ“๐‘˜ (๐‘ฅ), โ€ฆ , that includes the function ๐œ“0 (๐‘ฅ) โ‰ก 1. The quantities ๐ท๐‘–1 ,โ€ฆ,๐‘–๐‘  = โˆซ ๐‘“๐‘–21 ,โ€ฆ,๐‘–๐‘  (๐‘ฅ๐‘–1 , โ€ฆ , ๐‘ฅ๐‘–๐‘  )๐‘‘๐‘ฅ๐‘–1 , โ€ฆ , ๐‘‘๐‘ฅ๐‘–๐‘  are called the dispersions. Here and below the sign โˆซ means integration from 0 to 1 by the corresponding variables. The quantity ๐ท = โˆซ ๐‘“ 2 (๐‘ฅ)๐‘‘๐‘ฅ โˆ’ ๐‘“02 is called the total dispersion. It is also true that. ๐ท = โˆ‘๐‘›๐‘ =1 โˆ‘๐‘–1 <โ‹ฏ<๐‘–๐‘  ๐‘“๐‘–1 ,โ€ฆ,๐‘–๐‘  . The global sensitivity indices are the dispersion ratios ๐‘†๐‘–1 <โ‹ฏ<๐‘–๐‘  = ๐ท๐‘–1 <โ‹ฏ<๐‘–๐‘  โ„๐ท. In application single-dimension sensitivity indices ๐‘†๐‘– are most often used. With their help one is able to sort variables ๐‘ฅ๐‘– : the bigger is ๐‘†๐‘– , the more influential is the variable ๐‘ฅ๐‘– . Let us consider an arbitrary group of variables ๐‘ฅ๐‘˜1 , โ€ฆ , ๐‘ฅ๐‘˜๐‘š , where 1 โ‰ค ๐‘˜1 < โ‹ฏ < ๐‘˜๐‘š โ‰ค ๐‘›, 1 โ‰ค ๐‘š โ‰ค ๐‘› โˆ’ 1. We shall denote them by a single letter ๐‘ฆ = (๐‘ฅ๐‘˜1 , โ€ฆ , ๐‘ฅ๐‘˜๐‘š ); and let ๐‘ง be the aggregate of all the rest ๐‘› โ€“ ๐‘š variables. Thus, ๐‘ฅ = (๐‘ฆ, ๐‘ง). With ๐‘€ let us denote the aggregate of indices (๐‘˜1 , โ€ฆ , ๐‘˜๐‘š ). For the set ๐‘ฆ let us introduce two types of global SI: ๐‘†๐‘ฆ = โˆ‘ ๐‘†๐‘–1 ,โ€ฆ,๐‘–๐‘  , ๐‘†๐‘ฆtot = โˆ‘ ๐‘†๐‘–1 ,โ€ฆ,๐‘–๐‘  . In ๐‘†๐‘ฆ the summation is produced for all the groups of ๐‘–1 , โ€ฆ , ๐‘–๐‘  , where all the ๐‘–๐‘ โˆˆ ๐‘€. In ๐‘†๐‘ฆtot the summation is produced for all the groups of ๐‘–1 , โ€ฆ , ๐‘–๐‘  so that at least one index ๐‘–๐‘ โˆˆ ๐‘€. One is able to calculate the SI ๐‘†๐‘ฆ and ๐‘†๐‘ฆtot by integrating using (pseudo/quasi) Monte Carlo technique. Let ๐‘ฅ and ๐‘ฅโ€ฒ be the points of ๐พ๐‘› and let ๐‘ฅ = (๐‘ฆ, ๐‘ง), and ๐‘ฅโ€ฒ = (๐‘ฆ โ€ฒ , ๐‘งโ€ฒ). In the paper [3] it is shown that ๐ท๐‘ฆ = โˆซ ๐‘“(๐‘ฅ)๐‘“(๐‘ฆ, ๐‘ง โ€ฒ )๐‘‘๐‘ฅ๐‘‘๐‘ง โ€ฒ โˆ’ ๐‘“02 , 1 ๐ท๐‘ฆtot = โˆซ[๐‘“(๐‘ฅ) โˆ’ ๐‘“(๐‘ฆ โ€ฒ , ๐‘ง)]2๐‘‘๐‘ฅ๐‘‘๐‘ฆ โ€ฒ = ๐ท + ๐‘“02 โˆ’ โˆซ ๐‘“(๐‘ฅ โ€ฒ )๐‘“(๐‘ฆ, ๐‘ง โ€ฒ )๐‘‘๐‘ฅ โ€ฒ๐‘‘๐‘ฆ. 2 In order to compute all the one-dimensional (๐‘ฆ = (๐‘ฅ๐‘– )) SI ๐‘†๐‘ฆ and ๐‘†๐‘ฆtot , in each sample of the Monte Carlo method one needs to use two n-dimensional random points ๐‘ฅ and ๐‘ฅโ€ฒ and calculate the value of the function ๐‘› + 2 times: ๐‘“(๐‘ฅ), ๐‘“(๐‘ฅโ€ฒ), and ๐œ”๐‘– = ๐‘“(๐‘ฅ1โ€ฒ, โ€ฆ , ๐‘ฅ๐‘›โˆ’1โ€ฒ โ€ฒ , ๐‘ฅ๐‘›, ๐‘ฅ๐‘›+1 , โ€ฆ , ๐‘ฅ๐‘›โ€ฒ), when 1 โ‰ค ๐‘– โ‰ค ๐‘›. 5. LPฯ„ Sequences One may select the points of parameters in a pseudo-random way, but it will not provide a uniform cover of the whole unit cube ๐พ๐‘›. The alternative is to use LPฯ„ sequences [4]. These strictly ordered sequences are based on multiple binary division of the domain by all the dimensions. They allow one to construct a mesh on a hypercube so that the mesh nodes fill it as uniformly as possible. 597 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 Let us compare a mesh of 16 points of this sequence with the simple cubic mesh for the square (Figure 1). The points of the simple mesh allow one to calculate function values only in 4 different values of each of the two variables. The mesh built by LPฯ„ allows one to obtain function values in 16 different values of each variable for the same total number of points. Thus, LPฯ„ provides more efficient distribution of points. 1 1 0 1 0 1 Figure 1. Simple mesh (left) and LPฯ„ mesh (right) 6. Implementation and Scalability We implemented parallel programs for sensitivity analysis in C++ using MPI technology. The use of sensitivity analysis in the problem of ReaxFF force field optimization has an important feature that the time of calculation of a single OF value significantly exceeds the time of all other operations, i. e. data exchange between processes of interaction with the file system. OF values are computed with the help of LAMMPS molecular dynamic simulation library [5]: it calculates values of energies and forces for several dozens of molecules. In Figure 2 the general scheme of the implemented programs is presented. In both cases (correlation analysis and sensitivity indices approach) the scheme is the same. At first one generates an array of N-dimensional points (ReaxFF parameters). The array is divided into equal parts between all the parallel processes and is sent between them to compute the OF values. In the stage of OF values calculation the data exchange is not needed. Each process computes the values independently. The calculated OF values are gathered by the main process which computes correlation coefficients and Sobolโ€™s SI. Due to no data exchange in the main loop of the Figure 2. General scheme of programs programs good scalability is expected. This was verified on a cluster of the following configuration. 14 nodes: 2 ร— 4 core CPU Intelยฎ Xeonยฎ E5335 2.00 GHz, 16 GB RAM; 112 cores in total; OS: CentOS Linux 7 (Core); MPICH2 v1.4.1p1. Hardware is provided by Resource Center Computer Center of St. Petersburg State University. We measured the dependence of acceleration of programsโ€™ time of operation on the number of CPU cores in use for 1, 7, 14, 28, 56, and 112 pieces. The computations we done for 112 points of the search domain; when using all the 112 cores, we had one point per process. The average time of computation of a single OF value by a single process is 12.4 seconds. 598 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 Figure 3 shows the measured dependence 84 for both programs (CC and SI), since the values 77 70 of acceleration match with high accuracy. Linear 63 approximation is done with the least squares 56 technique. The efficiency of parallelism (the ratio Acceleration 49 of acceleration increase and cores number 42 increase) is (71.2ยฑ1.3) %. 35 28 The time of operation of the serial part of 21 the programs, i. e. the time of SI or CC 14 computation is less than 1 second for 112 points 7 and is about 30 seconds for 130000 points, which 0 0 7 14 21 28 35 42 49 56 63 70 77 84 91 98 105 112 119 is significantly smaller than the time of the P parallel part of the programs. Figure 3. Accel. dependence on the cores number 7. Application The programs presented have been applied to sensitivity analysis of the objective function to the parameters of ReaxFF force field in its optimization for Zn โ€“ O โ€“ H compounds. The sensitivity indices (SI) and correlation coefficients (CC) calculation is performed for 58 parameters on 130000 points. The diagram on Figure 4 depicts relative values of SI and CC for 24 parameters of ReaxFF. For the unity we take the greatest SI in the first case, and the greatest CC in the second case. The parameters on the diagram are sorted by SI in descending order. On the chart for CC white hatching highlights those parameters that were not included in the first 24 in the correlation analysis. The number of such parameters is not large relative to the length of the whole list. One has to note that 5 parameters with the greatest CC are also located in the beginning of the list by SI. In general, the order of the parameters for the SI and CC is noticeably different. AlphavdW_Zn-O RvdW_Zn-O Rsig_Zn-O EvdW_Zn-O RvdW_Zn ChiEEM_O AlphavdW_Zn EtaEEM_O EtaEEM_Zn Pbo1_Zn-O GammaEEM_O Rsig_Zn ChiEEM_Zn GammaEEM_Zn Pbo2_Zn-O RvdW_O EvdW_Zn Pbe2_Zn-O Pbe1_Zn-O AlphavdW_O Rsig_O Pval2_O-Zn-O Pval1_Zn-O-Zn Pval4_O-Zn-O 1E-3 0.01 0.1 1 0.0 0.5 1.0 Relative Sensitivity Indices Relative Correlation Coefficients Figure 4. SI and CC for ReaxFF force field parameters for ZnO Using MGSA method we perform ReaxFF optimization by 24 parameters: first with the greatest CC, then with the greatest SI. The optimization is done in groups of 4 parameters. The sequence of the groups is defined by sorting the parameters by CC and SI, respectively. The loop of groups is repeated until the difference of OF values of the two consecutive iterations becomes less than a predefined tolerance. If the order of the groups in the optimization is defined by CC, the algorithm requires 2.5 loop passes to converge. When the order is defined by SIs, the algorithm needs 599 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 1.5 passes. Thus, sorting the parameters using the Sobolโ€™s SI gives for this problem a better result than using the correlation analysis. 8. Conclusion Sensitivity analysis is useful in the problem of optimizing of a large number of parameters of ReaxFF force field in groups. The approach allows one to exclude those parameters that have little influence on the objective function and also effectively sort them by influence. The developed parallel program for Sobol sensitivity indices and correlation coefficients calculation is effectively scalable on a computational cluster. To sort the parameters in ReaxFF optimization for zinc โ€” oxygen โ€” hydrogen compounds it is better to use Sobol sensitivity analysis than the correlation analysis, since the procedure converges faster in the first case. References [1] K. Nomura, R. K. Kalia, A. Nakano, P. Vashishta, J. L. Landa. A scalable parallel algorithm for large-scale reactive force-field molecular dynamics simulation // Comp. Phys. Comm. โ€” 2008. โ€” Vol. 178(2). โ€” P. 73 โ€“ 87. [2] Stepanova M.M., Shefov K.S., Slavyanov S.Yu. Multifactorial global search algorithm in the problem of optimizing a reactive force field // Theoretical and Mathematical Physics. โ€” 2016. โ€” Vol. 187, Issue 1. โ€” P. 603 โ€“ 617. [3] Sobolโ€™ I. M, Globalโ€™nye pokazateli chuvstvitelnosti dlya izucheniya nelineynykh matematicheskikh modeley [Global sensitivity indices for exploration of non-linear mathematical models] // Matematicheskoye modelirovanie [Mathematical modeling] โ€” 2005. โ€” Vol. 17 (9). โ€” P. 43 โ€“ 52. (in Russian) [4] Sobolโ€™ I. M., Statnikov R. B. Vybor optimalโ€™nykh parametrov v zadachakh so mnogimi kriteriyami [Choise of optimal parameters in problems with multiple criteria]. โ€” M.: ยซDrofaยป. 2006. โ€” 175 p. (in Russian) [5] LAMMPS package: https://lammps.sandia.gov/ (accessed 30.09.2018) 600