=Paper=
{{Paper
|id=Vol-2638/paper10
|storemode=property
|title=Predicting runtime of computational jobs in distributed computing environment
|pdfUrl=https://ceur-ws.org/Vol-2638/paper10.pdf
|volume=Vol-2638
|authors=Alexander Feoktistov,Olga Basharina
|dblpUrl=https://dblp.org/rec/conf/iccs-de/FeoktistovB20
}}
==Predicting runtime of computational jobs in distributed computing environment==
Predicting runtime of computational jobs in distributed computing environment A G Feoktistov1 and O Yu Basharina2 1 Matrosov Institute for System Dynamics and Control Theory of SB RAS, Lermontov St. 134, Irkutsk, Russia, 664033 2 Irkutsk State University, Karl Marx St. 1, Irkutsk, Russia, 664003 agf@icc.ru Abstract. The paper addresses a relevant problem of predicting the runtime of jobs for executing problem-solving schemes of large-scale applications in a heterogeneous distributed computing environment. Such an environment includes nodes that have various hardware architectures, different system software, and diverse computational possibilities. We believe that increasing the accuracy in predicting the runtime of jobs can significantly improve the efficiency of problem-solving and rational use of resources in the heterogeneous environment. To this end, we propose new models that make it possible to take into account various estimations of the module runtime for all modules included in the problem-solving scheme. These models were developed using the special computational model of distributed applied software packages (large-scale scientific applications). In addition, we compare the prediction results (jobs runtime and their errors) using different estimations. Among them are the estimations obtained through the modules testing, user's estimations, and estimations based on computational history. These results were obtained in continuous integration, delivery, and deployment of applied and system software of a package for solving warehouse logistics problems. They show that the largest accuracy is achieved by the modules testing. 1. Introduction Todays, scientific applications focus on carrying out large-scale scientific experiments in a heterogeneous distributed computing environment. They play a significant role in the process of solving important practical problems based on mathematical modeling of complex systems under study [1]. Often, such applications are implemented as distributed applied software packages. The environment heterogeneity means that its nodes (PCs, compute servers, HPC-clusters, and cloud resources) have various hardware architectures, different system software, and diverse computational possibilities. Various local resource managers (LRMs) are hosted in nodes of the environment. In distributed applied software packages, problem-solving is described by schemes that specify the computing process in terms of the subject domain. We apply methods of the computation and information planning in constructing such problem-solving schemes on the special computational model [2]. Wherein, this model is a special case of the semantic network. To execute a problem-solving scheme in the heterogeneous distributed computing environment, a computational job is generated. A job enters into the environment. The meta-scheduler selects an Copyright ยฉ 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). environment node suitable for this job. It then submits the job to LRM located on the node. The job falls into the LRM's queue. When the resources of the node are freed, the job is launched. An improvement in the efficiency of problem-solving and rational use of resources depends a lot on the ability to estimate jobs runtime. In this regard, we propose new models for predicting jobs runtime in the environment. Unlike well-known similar models [3-5], the proposed models make it possible to take into account various estimations of the module runtime for all modules included in the problem-solving scheme. Among them estimates that are obtained on the basis of the methodology proposed by the authors. This methodology allows us to test the program runtime. It is also used in the process of continuous integration of applied software [6]. The rest of the paper is structured as follows. In Section 2, we briefly review related works on the problem under study. Section 3 provides the models for predicting the jobs runtime. An example of applying the proposed models is considered in Section 4. Section 5 concludes the paper. 2. Related work When starting jobs in a heterogeneous distributed computing environment, it is necessary to solve the following two problems: ๏ท Forming a rational configuration of heterogeneous resources of the environment, ๏ท Planning a suboptimal schedule of the job execution on the formed configuration of resources. It's obvious that a qualitative solution to these problems for a large spectrum of practical scientific applications requires an estimation of the execution time of applied programs [7]. For example, such an estimation is used to cluster jobs in the allocation of resources to them [8]. Generally, job runtime estimates are implemented by the user or some runtime prediction procedure [9]. Most of traditional job management systems and many workflow management systems are based on the use of estimates for program execution time that are specified by the user. This is a simple and very flexible approach. However, the errors of such estimates are usually highly large in practice. There are various methods for predicting program execution time [10]. Among them are static and dynamic methods of program analysis. The use of methods and tools of static analysis of program code without real program execution in heterogeneous environments is characterized by high overheads to additional programming. Such overheads are owing to the need to simulate operating the processor of the target computing node for executing a large spectrum of programs written in various programming languages. In practice, the method of frequency characteristics has proven itself well [11]. It is based on the use of special tools for dynamic analysis of programs [12, 13]. Algorithms based on the use of such analysis differ in the sets of studied software and hardware characteristics [14]. Each algorithm determines specific relations between these characteristics. In this regard, the method of profiling the program on some reference node is most applicable in practice [15]. Its results are then extrapolated relative to the characteristics of the target node. Within this method, the accuracy of estimating the program execution time largely depends on the selection of scaling coefficients for computation speedup. These coefficients are determined by the ratios of characteristics for the reference and target nodes. However, our practical experience in applying continuous integration of applied software represented in [6] allows us to draw the following conclusion. If the target nodes are available, then testing the programs in them can simplify the program runtime prediction in comparison with the dynamic analysis. 3. Models for Predicting Computational Jobs Runtime The computational model of the environment is determined by the structure ๐ =< ๐ด, ๐, ๐น, ๐, ๐, ๐ฝ, ๐ , ๐, ๐, ๐๐ป >, where ๐ด, ๐, ๐น, ๐, ๐, ๐ฝ, ๐ , and ๐ are, respectively, the sets of applied software packages, parameters, operations, program modules, problem-solving schemes, jobs, resources, and constraints to the job execution and resources use. ๐ is the set of relations between the above-listed objects. The data structure ๐๐ป represents the computational history that reflects the functioning of modules from ๐. If necessary, a description of components of the model ๐ can be found in details in [2]. Operations from ๐น determine computational actions on the set ๐ of parameters. Parameters can be represented by scalars, vectors, and matrices of various basic data types or arbitrary data structures. Each operation ๐๐ โ ๐น is implemented by the module ๐๐ โ ๐, where ๐ โ ฬ ฬ ฬ ฬ ฬ ฬ ฬ ฬ ฬ ฬ ฬ ฬ ฬ 1, ๐๐ , ๐ โ 1, ๐๐ , ๐๐ is the number of operations, and ๐๐ is the number of modules. One module can implement several operations. Each operation ๐๐ has two subsets ๐๐๐๐ , ๐๐๐๐ข๐ก โ ๐ of parameters. The subset ๐๐๐๐ consists of input parameters. Their values must be known in order to calculate values of parameters from ๐๐๐๐ข๐ก . Parameters of the subsets ๐๐๐๐ and ๐๐๐๐ข๐ก reflect the purpose and semantics of formal parameters of the module ๐๐ that implements the operation ๐๐ . Parameters are transferred between modules in the form of data files. Schemes from ๐ represent problems-solving processes in packages. The scheme ๐ โ ๐ that performs operations from ๐น is an analogue of the tiered-parallel form of the algorithm graph. Within the computational model under consideration, a scheme is a connected subgraph of an oriented acyclic bipartite graph. Such a graph represents schematic knowledge about algorithms for studying a subject domain. The example of such a graph is represented in Figure 1. Operations and parameters are depicted on it by filled and unfilled circles. The problem-solving scheme represented by the graph includes 3 tiers. Tier 1 contains the operation ๐3. The second consists of the operations ๐4 , ๐5, and ๐6. Finally, two operations ๐7 and ๐8 are placed on tier 3. Tier 1 Tier 2 Tier 3 z2 f4 z5 f7 z7 z1 f3 z3 f5 z6 f8 z8 z4 f6 z9 Figure 1. Bipartite directed graph of the problem-solving scheme. Two special operations ๐1 and ๐2 are emphasized in the computational model. This allows us to maintain a commonality for computations planning when constructing problem-solving schemes using their formulations [16]. The operations ๐1 and ๐2 model the conditions of the problem. They respectively define a subset of parameters whose values are known, and a subset of the parameters whose values need to be calculated. Thus, the operation ๐1 defines the subsets ๐1๐๐ = โ and ๐1๐๐ข๐ก โ โ . The operation ๐2 determines the subsets ๐2๐๐ โ โ and ๐1๐๐ข๐ก = โ . In the above example, ๐1๐๐ = โ and ๐1๐๐ข๐ก = {๐ง1 } (Figure 2). At the same time, ๐2๐๐ = {๐ง7 , ๐ง8 , ๐ง9 } and ๐2๐๐ข๐ก = โ . The execution of a problem-solving scheme in the environment is specified by a computational job. A job includes a list of modules of a problem-solving scheme. In addition, it contains requirements to the environment that determine the computing resources needed to execute the listed modules. These requirements include the number of processors or cores, sizes of RAM and disk memory, execution time, etc. z7 f1 z1 ... z8 f2 z9 Figure 2. The operations ๐1 and ๐2. We propose new models for predicting the job runtime in the heterogeneous environment. In these models, we consider cases of job execution in modes of data readiness and fork/join, resource virtualization, and job restarts. Let the matrix ๐ of dimension ๐ ร ๐ represent information on the precedence of ๐ operations of the problem-solving scheme. The element ๐๐๐ = 1 (๐๐๐ = 0) of the matrix ๐ means that the operation ๐๐ in the problem-solving scheme precedes (does not precede) the operation ๐๐ and ๐๐๐๐ โฉ ๐๐๐๐ข๐ก โ โ (๐๐๐๐ โฉ ๐๐๐๐ข๐ก = โ ). The matrix ๐ of dimension ๐ ร ๐ reflects estimates of the data volumes transmitted between operations. The matrix element ๐๐๐ โฅ 0 shows the amount of data transmitted by the operation ๐๐ to the operation ๐๐ . The matrix ๐ of dimension ๐ ร ๐ provides information about the interconnect bandwidth between nodes where modules that implement scheme operations are launched. The matrix element ๐ค๐๐ โฅ 0 demonstrates the interconnect bandwidth between nodes, in which the modules implementing the operations ๐๐ and ๐๐ operate. The estimate ๐ธ๐ of the job runtime in the asynchronous mode based on data availability is defined as follows: ๐ ๐ ๐ธ๐ = max ๐๐๐ , ๐๐๐ = ๐๐ + ๐๐ + max (๐๐๐ + ๐ข), ๐ข = ๐ (๐ก)๐ค . ๐๐ (1) ฬ ฬ ฬ ฬ ๐=1,๐ ฬ ฬ ฬ ฬ :๐๐๐ =1,๐โ ๐ โ๐โ1,๐ ๐๐ ๐๐ The variables in the formulas (1) are interpreted as follows: ๏ท ๐๐๐ (๐๐๐ ) is the estimate of the period ๐ elapsed from the beginning of the job execution to the completion of the operation ๐๐ (๐๐ ), ๐ ๏ท ๐๐ is the estimate of the wait time in a queue for the module that implements the operation ๐๐ (at a time when all the necessary data is ready to execute it), ๏ท ๐๐ is the predictive estimate of the module runtime, ๏ท ๐ is the number of scheme operations, ๏ท 0 < ๐๐๐ (๐ก) โค 1 is the coefficient of decrease in interconnect bandwidth between nodes, in which the modules implementing the operations ๐๐ and ๐๐ operate, at the time ๐ก. Let the matrix ๐ of dimension ๐ ร ๐ be a tiered-parallel form describing the scheme execution in the fork/join mode, and ๐ is the number of scheme tiers. The element ๐ ๐๐ = 1 means that the operation ๐๐ must be performed on the lth tier. The transition to operations of the (l+1)th tier is possible provided that all operations on the lth tier are completed. The estimate ๐ธ๐ โฒ of the job runtime in the fork/join mode is defined as follows: ๐ ๐๐๐ ๐ธ๐ โฒ = โ๐ ๐ ๐ ๐=1 ๐๐ , ๐๐ = max (๐๐ + ๐๐ + ๐ฃ), ๐ฃ = โโ๐โ1,๐ ฬ ฬ ฬ ฬ :๐๐๐ =1,๐โ ๐ ๐ (๐ก)๐ค . (2) ฬ ฬ ฬ ฬ :๐ ๐๐ =1 โ๐โ1,๐ ๐๐ ๐๐ We use formulas (1) and (2) to define other estimates. Among them the estimates of the job runtime with virtualized resources, module restarts, user estimates of the module runtime, and estimates adjusted based on the computational history. In the environment with virtualized resources, the estimate ๐ธ๐ฃ๐ of the job runtime in the asynchronous mode is defined as follows: ๐ ๐ธ๐ฃ๐ = max ๐๐๐ , ๐๐๐ = ๐๐ + ๐๐๐ฃ๐๐ + ๐๐ + ๐๐๐ฃ๐๐ + max (๐๐๐ + ๐ข). ฬ ฬ ฬ ฬ ๐=1,๐ ฬ ฬ ฬ ฬ :๐๐๐ =1,๐โ ๐ โ๐โ1,๐ The variables ๐๐๐ฃ๐๐ and ๐๐๐ฃ๐๐ are, respectively, estimates of the time it takes to launch and remove virtual machine with a module that implements the operation ๐๐ . The estimates ๐๐๐ฃ๐๐ and ๐๐๐ฃ๐๐ are determined experimentally for virtual machines of various configurations. โฒ For the same environment, the estimate ๐ธ๐ฃ๐ of the job runtime in the fork/join mode is defined as follows: ๐ โฒ ๐ธ๐ฃ๐ = โ๐ ๐ ๐ ๐=1 ๐๐ , ๐๐ = max (๐๐ + ๐๐๐ฃ๐๐ + ๐๐ + ๐๐๐ฃ๐๐ + ๐ฃ). ฬ ฬ ฬ ฬ :๐ ๐๐ =1 โ๐โ1,๐ In the asynchronous mode with restarting modules, the estimate ๐ธ๐๐ of the job runtime is defined as follows: ๐ ๐ ๐ธ๐๐ = max ๐๐๐ , ๐๐๐ = ๐๐ + ๐๐ + ๐๐ + ๐๐๐๐ข๐ + ๐๐๐๐๐ + max (๐๐๐ + ๐ข). ฬ ฬ ฬ ฬ ๐=1,๐ ฬ ฬ ฬ ฬ :๐๐๐ =1,๐โ ๐ โ๐โ1,๐ The variables in the formula above are interpreted as follows: ๐ ๏ท ๐๐ is the estimate of the time it takes for the detection and identification of a failure, ๏ท ๐๐๐๐๐ is the estimate of the time it takes for the restart of a module that implements operation ๐๐ , ๏ท ๐๐๐๐ข๐ is the module runtime to failure. ๐ These variables are required in the case of a hardware-software failure. The estimates ๐๐ and ๐๐๐๐๐ are determined by the average execution time of such processes by a meta-monitoring system for different types of software and hardware failures. โฒ At the same time, in the fork/join mode with restarting modules, the estimate ๐ธ๐๐ of the job runtime is defined as follows: ๐ ๐ โฒ ๐ธ๐๐ = โ๐ ๐ ๐ ๐=1 ๐๐ , ๐๐ = max (๐๐ + ๐๐ + ๐๐ + ๐๐๐๐ข๐ + ๐๐๐๐๐ + ๐ฃ). ฬ ฬ ฬ ฬ :๐ ๐๐ =1 โ๐โ1,๐ Let us now look at a case for applying user's estimates. In the asynchronous mode, the estimate ๐ธ๐ข๐ of the job runtime is defined as follows: ๐ ๐ธ๐ข๐ = max ๐๐๐ , ๐๐๐ = ๐๐ + ๐๐โฒ + max (๐๐๐ + ๐ข), ฬ ฬ ฬ ฬ ๐=1,๐ ฬ ฬ ฬ ฬ :๐๐๐ =1,๐โ ๐ โ๐โ1,๐ where ๐๐โฒ is the userโs estimate of the module runtime. โฒ The estimate ๐ธ๐ข๐ of the job runtime in the fork/join mode is defined as follows: ๐ โฒ ๐ธ๐ข๐ = โ๐ ๐ ๐ ๐=1 ๐๐ , ๐๐ = max (๐๐ + ๐๐โฒ + ๐ฃ). ฬ ฬ ฬ ฬ :๐ ๐๐ =1 โ๐โ1,๐ In the environment with virtualized resources, the estimate ๐ธ๐ข๐ฃ๐ of the job runtime in the asynchronous mode is defined as follows: ๐ ๐ธ๐ข๐ฃ๐ = max ๐๐๐ , ๐๐๐ = ๐๐ + ๐๐๐ฃ๐๐ + ๐๐โฒ + ๐๐๐ฃ๐๐ + max (๐๐๐ + ๐ข). ฬ ฬ ฬ ฬ ๐=1,๐ ฬ ฬ ฬ ฬ :๐๐๐ =1,๐โ ๐ โ๐โ1,๐ โฒ For the same environment, the estimate ๐ธ๐ข๐ฃ๐ of the job runtime in the fork/join mode is defined as follows: ๐ โฒ ๐ธ๐ข๐ฃ๐ = โ๐ ๐ ๐ ๐=1 ๐๐ , ๐๐ = max (๐๐ + ๐๐๐ฃ๐๐ + ๐๐โฒ + ๐๐๐ฃ๐๐ + ๐ฃ). ฬ ฬ ฬ ฬ :๐ ๐๐ =1 โ๐โ1,๐ In the asynchronous mode with restarting the modules, the estimate ๐ธ๐ข๐๐ of the job runtime is defined as follows: ๐ ๐ ๐ธ๐ข๐๐ = max ๐๐๐ , ๐๐๐ = ๐๐ + ๐๐ + ๐๐โฒ + ๐๐๐๐๐ + max (๐๐๐ + ๐ข). ฬ ฬ ฬ ฬ ๐=1,๐ ฬ ฬ ฬ ฬ :๐๐๐ =1,๐โ ๐ โ๐โ1,๐ โฒ Meanwhile, the estimate ๐ธ๐ข๐๐ of the job runtime in the fork/join mode is defined as follows: ๐ ๐ โฒ ๐ธ๐ข๐๐ = โ๐ ๐ ๐ ๐=1 ๐๐ , ๐๐ = max (๐๐ + ๐๐ + ๐๐โฒ + ๐๐๐๐๐ + ๐ฃ). ฬ ฬ ฬ ฬ :๐ ๐๐ =1 โ๐โ1,๐ Finally, let us consider the use of estimates based on computational history. In the asynchronous mode, the estimate ๐ธ๐ข๐ ๐ of the job runtime is defined as follows: ๐ ๐ธ๐ข๐ ๐ = max ๐๐๐ , ๐๐๐ = ๐๐ + ๐๐ ๐๐โฒโฒ + max (๐๐๐ + ๐ข). ฬ ฬ ฬ ฬ ๐=1,๐ ฬ ฬ ฬ ฬ :๐๐๐ =1,๐โ ๐ โ๐โ1,๐ The variable ๐๐โฒโฒ is the estimate adjusted based on the computational history. Its correction coefficient ๐๐ = โ(๐, ๐๐ป, ๐โ ) > 0 is calculated on the basis of the computational history. The function โ(๐, ๐๐ป, ๐โ ) calculates ๐๐ using the average or median values from the sample time of the execution of the module that implements the operation ๐๐ for the period ๐โ . โฒ The estimate ๐ธ๐ข๐ ๐ of the job runtime in the fork/join mode is defined as follows: ๐ โฒ ๐ธ๐ข๐ ๐ = โ๐ ๐ ๐ ๐=1 ๐๐ , ๐๐ = max (๐๐ + ๐๐ ๐๐โฒโฒ + ๐ฃ). ฬ ฬ ฬ ฬ :๐ ๐๐ =1 โ๐โ1,๐ In the environment with virtualized resources, the estimate ๐ธ๐ข๐ ๐ฃ๐ of the job runtime in the asynchronous mode is defined as follows: ๐ ๐ธ๐ข๐ ๐ฃ๐ = max ๐๐๐ , ๐๐๐ = ๐๐ + ๐๐๐ฃ๐๐ + ๐๐ ๐๐โฒโฒ + ๐๐๐ฃ๐๐ + max (๐๐๐ + ๐ข). ฬ ฬ ฬ ฬ ๐=1,๐ ฬ ฬ ฬ ฬ :๐๐๐ =1,๐โ ๐ โ๐โ1,๐ โฒ For the same environment, the estimate ๐ธ๐ข๐ ๐ฃ๐ of the job runtime in the fork/join mode is defined as follows: ๐ โฒ ๐ธ๐ข๐ ๐ฃ๐ = โ๐ ๐ ๐ ๐=1 ๐๐ , ๐๐ = max (๐๐ + ๐๐๐ฃ๐๐ + ๐๐ ๐๐โฒโฒ + ๐๐๐ฃ๐๐ + ๐ฃ). ฬ ฬ ฬ ฬ :๐ ๐๐ =1 โ๐โ1,๐ In the asynchronous mode with restarting modules, the estimate ๐ธ๐ข๐ ๐๐ of the job runtime is defined as follows: ๐ ๐ ๐ธ๐ข๐ ๐๐ = max ๐๐๐ , ๐๐๐ = ๐๐ + ๐๐ + ๐๐ ๐๐โฒโฒ + ๐๐๐๐ข๐ + ๐๐๐๐๐ + max (๐๐๐ + ๐ข). ฬ ฬ ฬ ฬ ๐=1,๐ ฬ ฬ ฬ ฬ :๐๐๐ =1,๐โ ๐ โ๐โ1,๐ โฒ At the same time, the estimate ๐ธ๐ข๐ ๐๐ of the job runtime in the fork/join mode is defined as follows: ๐ ๐ โฒ ๐ธ๐ข๐ ๐๐ = โ๐ ๐ ๐ ๐=1 ๐๐ , ๐๐ = max (๐๐ + ๐๐ + ๐๐ ๐๐โฒโฒ + ๐๐๐๐ข๐ + ๐๐๐๐๐ + ๐ฃ). ฬ ฬ ฬ ฬ :๐ ๐๐ =1 โ๐โ1,๐ Estimates of the job runtime for virtual machines can be adapted to the container application case. 4. Example As an example, we consider the problem of improving the processes of loading and unloading of goods in a warehouse through their simulation. To solve this problem, a distributed applied software package has been developed using the Orlando Tools framework [17]. This package is a parameter sweep application [18]. Simulations models (modules) are created using a special toolkit [19]. The heterogeneous distributed computing environment, in which this package is applied, was created on the basis of the resources of the public access Irkutsk Supercomputer Center SB RAS [20]. We compare the prediction results (jobs runtime and their errors) using different estimations. The experiments were carried out in the continuous integration process of package modules. Such integration includes the modification, version control, build, testing, delivery, and deployment of applied and system software on different nodes of the heterogeneous environment. The experiments were carried out on two different clusters: PC-cluster and HPC-cluster. The characteristics of both clusters are provided in [17]. Figure 3 and Figure 4 show the actual and predicted jobs runtime on the PC-cluster and HPC-cluster, respectively. The jobs runtime are shown for different data size that is determined by the number of problem parameter variants. This time was predicted using the three methods discussed above. Obviously, the most accurate prediction in both cases was formed on the basis of the modules testing. 350000 Actual job runtime 280000 Predicted job runtime based on the Job runtime, s module testing 210000 Predicted job runtime based on the computational history 140000 Predicted job runtime based by users 70000 0 31.73 61.57 121.98 246.73 Data size, MB Figure 3. Predictive and actual results for PC-cluster. 8000 Actual job runtime 7000 6000 Predicted job runtime based on the Job runtime, s module testing 5000 Predicted job runtime based on the 4000 computational history Predicted job runtime based by users 3000 2000 1000 0 31.73 61.57 121.98 246.73 Data size, MB Figure 4. Predictive and actual results for HPC-cluster. Figure 5 and Figure 6 demonstrate the runtime prediction errors obtained through the various methods for estimating the module runtime. The errors are shown in percentages relative to the actual job runtime on the PC-cluster and HPC-cluster, respectively. 300 Error of the prediction based on the modules testing Error of the prediction based on the user's estimation 250 Error of the prediction based on the computational history 200 Error, % 150 100 50 0 31.73 61.57 121.98 246.73 Data size, MB Figure 5. Runtime prediction error for the PC-cluster. 250 Error of the prediction based on the modules testing Error of the prediction based on the user's estimation 200 Error of the prediction based on the computational history Error, % 150 100 50 0 31.73 61.57 121.98 246.73 Data size, MB Figure 6. Runtime prediction error for the HPC-cluster. These results show that the error in the job runtime predicted based on the module testing decreases with increasing the data size in the both cases. In the example, it does not exceed 10%. At the same time, a change of the runtime prediction errors obtained owing to userโs estimates or computational history can be non-monotonous. In practice, the job runtime predicted based on a userโs estimates is usually overstated. The use of computational history can slightly reduce such errors. 5. Conclusions The rational allocation of resources in solving large problems in a heterogeneous distributed computing environment depends on the effectiveness of job maintenance schedules planned by LRMs. Usually, these schedules are based on estimates of program execution time. In the paper, we propose new models for predicting the job runtime using a variety of estimates. Such job specifies the execution of a problem-solving scheme of a distributed applied software package (large-scale scientific application) in the environment. Within these models, we take into account job execution in modes of data readiness and fork/join, resource virtualization, and job restarts. The practical significance of the study lies in improving the quality of planning computations and resource allocation in heterogeneous environments. Such an improvement is due to reducing the error in estimating the job runtime through the application of the proposed models. Acknowledgments The study is supported by the Russian Foundation of Basic Research, project no. 19-07-00097. The development of the heterogeneous distributed computing environment was supported in part by the Basic Research Program of SB RAS, project no. IV.38.1.1. References [1] Deelman E, Peterka T, Altintas I, Carothers C D, van Dam K K, Moreland K, Parashar M, Ramakrishnan L, Taufer M and Vetter J 2018 The future of scientific workflows Int. J. High Perform. C. 32(1) 159โ175 [2] Bychkov I, Oparin G, Tchernykh A, Feoktistov A, Bogdanova V and Gorsky S 2017 Conceptual Model of Problem-Oriented Heterogeneous Distributed Computing Environment with Multi-Agent Management Procedia Comput. Sci.103 162โ167 [3] Cao F, Zhu M M and Ding D 2013 Distributed workflow scheduling under throughput and budget constraints in grid environments Lect. Notes Comput. Sci. 8429 62โ80 [4] Qin J and Fahringer T 2012 Semantic-based scientific workflow composition Scientific Workflows (Berlin and Heidelberg: Springer ) 115โ134 [5] Deelman E, Vahia K, Juvea G, Ryngea M, Callaghan S Maechling P J, Mayani R, Chen W, da Silva R F, Livny M and Wenger K 2015 Pegasus, a workflow management system for science automation Future Gener. Comp. Sy. 46 17โ35 [6] Feoktistov A, Gorsky S, Sidorov I and Tchernykh A 2019 Continuous Integration in Distributed Applied Software Packages Proc. of the 42th Int. Convention on information and communication technology, electronics and microelectronics (Riejka: IEEE) pp 1775โ1780 [7] Voevodin V V 2007 The solution of large problems in distributed computational media. Automat. Rem. Contr.+ 68(5) 773โ786 [8] Pegasus WMS โ Automate, recover, and debug scientific computations. Available at: https://pegasus.isi.edu/ (accessed: 02.03.2020) [9] da Silva R F, Juve G, Deelman E, Glatard T, Desprez F, Thain D, Tovar B and Livny M 2013 Toward fine-grained online taskcharac-teristics estimation in scientific workflows: Proc. of the 8thWorkshop on Workflows in Support of Large-Scale Science pp 58โ67 [10] Wilhelm R et al. 2008 The worst-case execution-time problem โ overview of methods and survey of tools ACM T. Embed. Comput. S. 7(3) 1โ52 [11] Wang W, Wang W, Guan X, Zhang X and Yang L 2006 Profiling program behavior for anomaly intrusion detection based on the transition and frequency property of computer audit data Comput. Secur. 25(7) 539โ550 [12] Intelยฎ VTuneโข Profiler (formerly Intelยฎ VTuneโข Amplifier). Available at: https://software.intel.com/en-us/vtune/ (accessed: 02.03.2020) [13] OProfile โ A System Profiler for Linux. Available at: https://oprofile.sourceforge.io/news/ (accessed: 02.03.2020) [14] Adhianto L, Banerjee S, Fagan M, Krentel M, Marin G, MellorโCrummey J and Tallent N R 2010 HPCToolkit: Tools for performance analysis of optimized parallel programs Concurr. Comp.-Pract. E. 22(6) 685โ701 [15] Ivannikov V P, Gaisaryan S S, Avetisyan A I and Padaryan V A 2006 Estimation of dynamical characteristics of a parallel program on a model Program. Comput. Soft.+ 32(4) 203โ214 [16] Novopashin A P and Oparin G A 2004 Boolean Modeling of Action Planning in Distributed Computing Systems J. Comput. Sys. Sc. Int.+ 43(5) 763โ766 [17] Tchernykh A, Feoktistov A, Gorsky S, Sidorov I, Kostromin R, Bychkov I, Basharina O, Alexandrov A and Rivera-Rodriguez R 2019 Orlando Tools: Development, Training, and Use of Scalable Applications in Heterogeneous Distributed Computing Environments Comm. Com. Inf. Sc. 979 265โ279 [18] Bychkov I, Oparin G, Tchernykh A, Feoktistov A, Bogdanova V, Dyadkin Yu, Andrukhova V and Basharina O 2017 Simulation Modeling in Heterogeneous Distributed Computing Environments to Support Decisions Making in Warehouse Logistics Procedia Eng. 201 524โ533 [19] Feoktistov A G, Kostromin R O, Fereferov E S, Tchernykh A, Basharina O Yu, Dmitriev V I and Kurzybova Ya V 2019 Toolkit for simulation modeling of queue systems in Grid Proc. of the 1st International Workshop on Information, Computation, and Control Systems for Distributed Environments (ICCS-DE) (CEUR-WS Proceedings) 2430 pp 51โ59 [20] Public access center Irkutsk Supercomputer Center. Available at: http://hpc.icc.ru (accessed: 02.03.2020)