=Paper= {{Paper |id=Vol-1536/paper16 |storemode=property |title= Parallel Discrete Event Simulation as a Paradigm for Large Scale Modeling Experiments |pdfUrl=https://ceur-ws.org/Vol-1536/paper16.pdf |volume=Vol-1536 |dblpUrl=https://dblp.org/rec/conf/rcdl/ShchurS15 }} == Parallel Discrete Event Simulation as a Paradigm for Large Scale Modeling Experiments == https://ceur-ws.org/Vol-1536/paper16.pdf
        Parallel Discrete Event Simulation as a Paradigm for
                 Large Scale Modeling Experiments
                       © Lev Shchur                           © Liudmila Shchur
                                   Science Center in Chernogolovka,
                                            142432 Russia
                          shchur@chg.ru ,                        lvs@chg.ru


                        Abstract                                     assigns a logical process (LP) to each subsystem.
                                                                     Therefore, each LP is just a sequential event-driven
   Parallel Discrete Event Simulation (PDES) is a                    simulator. Sending an event to a LP is equivalent to
method introduced to conduct simulation of a complex                 scheduling that event in the receiving list of LP pending
system, which consists of a large number of objects. The             events.
PDES has been known for about 30 years and we                           The PDES is a space-parallel approach with an
discuss why further analysis of the method is important              execution of a single discrete event simulation program
nowadays. We briefly introduce the main features of the              on a parallel computer or on a cluster of computers [1].
method and discuss the current state of PDES. We                     It is widely used in economics and logistics, and
present a survey of the method as well as promising                  sometimes applied in physics and computer science. A
applications.                                                        system that should be simulated is divided into disjoint
                                                                     subsystems, which are mapped onto the programming
1 Introduction                                                       objects [2]. Subsystems are not isolated during the
                                                                     simulations, and dependencies between objects should
   Methods of parallel algorithms developed in the
                                                                     be resolved properly. There are three main essentials of
middle of the last century are weakly applicable on the
                                                                     PDES. First, it is assumed that changes of possible, or
modern supercomputers. During the last two decades
                                                                     potential, dependencies occur at some particular
the high-performance computing evolved from the big
                                                                     moments of time. It is supposed that these moments of
mainframes with a complicated central processing unit
                                                                     time are spread on a scale that is large enough compared
(CPU) into the parallel processing of large numbers of
                                                                     to an elementary unit of simulation time. Therefore,
CPUs and GPGPUs (general-purpose graphics
                                                                     changes are considered to be discrete (although random)
processing units). Running hundreds of thousands or
                                                                     in time, and are called discrete events. The objects
even millions of CPUs in a single task is one of the
                                                                     evolve independently in time, as soon as there are no
challenging problems of high-performance computing
                                                                     dependencies generated. We need to have some protocol
(HPC). We will focus our discussion on how the parallel
                                                                     in order to process dependencies correctly, in other
discrete event simulation (PDES) method may be used
                                                                     words we would like to keep causality. Using general
in large-scale simulations.
                                                                     language we can say that we need some protocol for
   There are two ways to parallelize a simulation model:
                                                                     synchronization of discrete events.
time-parallel and space-parallel [1]. The time-parallel
                                                                        The main idea of PDES is to use an asynchronous
approach partitions the simulated time axis into intervals
                                                                     mechanism and to implement the concept of Virtual
[T1,T2], [T2,T3], …, [Ti,Ti+1], … and assigns each                   Time [2]. A local virtual time variable Wi is associated
interval to a separate process. A process i computes a               with each object i. As soon as a new time event is
sample path for the interval [Ti,Ti+1]. The final state of           generated by an object i, that event is stamped with the
the system in the sample path [Ti-1,Ti] must be the                  local virtual time (LVT) Wi. A message containing
initial state in the sample path for [Ti,Ti+1]. It is clear          information on the event is generated and is sent to
that the time-parallel simulations are restricted to special         other objects. A message sending mechanism is the
models that fulfil that requirement.                                 second essential feature of PDES. The third essential
   The space-parallel approach partitions the system                 feature of PDES - there is no information exchange
being modelled into a collection of subsystems, and                  between subsystems through common variables and
                                                                     there is no access of objects to a shared memory.
                                                                     Instead, all the information is distributed via messages.
Proceedings of the XVII International Conference
                                                                     The ensemble of LVT Wi generates a profile of local
«Data Analytics and Management in Data Intensive
                                                                     virtual times. A minimal value of the profile defines
Domains» (DAMDID/RCDL’2015), Obninsk, Russia,
                                                                     time horizon, and an evaluation of this value defines
October 13 - 16, 2015
                                                                     Global Virtual Time (GVT) of simulations. An evolution




                                                               107
of LVT profile and of GVT in time depends on a                      case of local algorithms it is possible to prove existence
particular scheme of managing causality, and the                    of a lower bound of that speed [10].
analysis of time profile properties (which is related to               We use an extension [11] of Virtual Time concept on
the problem of synchronization!) is one of subjects of              the current multi-core and multi-thread architecture of
the present paper.                                                  hardware and software. In our discussion, the LVT is
   Originally, the PDES was designed to deal with                   associated with logical processes (LP) and not with
complex and heavy problems. Suppose, one has to                     processing elements, as it was the case in “before-core
simulate a complex system with the condition that                   era” of hardware [9]. A processing element (PE) is
simulation is resource consuming (huge processor time,              associated with hardware, and a LP is more flexible
big RAM, and big data storage, large number of                      because it is the virtual. Nowadays this extension is
sensors) and could not be fit in a single computer. An              even more important with GPGPU in which threads
appropriate simulation method should be able to handle              may run LP in parallel and concurrently.
any problem. The main requirement was not in the                       Let us define an abstract model of simulation in
effective use of CPU time, or storage, or other hardware            PDES. The whole physical system is decomposed into
loading level. The main requirement was just to have                N subsystems. Each physical subsystem is associated
some possibility to simulate. It is not a usual                     with a logical process. Each logical process develops in
requirement of scalability, which is traditionally                  time, changing its internal state at some moments of
associated with the effective use of hardware. Contrary,            local simulation time. Time of simulation is measured in
scalability in the PDES is just a possibility to evolve a           the Global Virtual Time [2], which is a minimal time of
system regardless of how large and complex this system              the LVT profile. We call a change in the internal state of
is. During the last years, the PDES method attracts                 simulated subsystem i as an event that happens at logical
interest of a computational physics community, which is             process i, it is denoted LP(i). We suppose that intervals
looking for methods to simulate huge systems that can               of time between events are distributed according to a
be simulated on computers with millions of nodes                    Poisson law. A logical process generates a message with
[4,5,6]. A discussion of that issue is another subject of           information on the changes in the internal state, marked
the paper.                                                          up (stamped) by the value of LVT. Sequence of LVT
   The third subject of the paper is a discussion of the            times Wi is not decreasing. Each logical process LP(i)
possibility to use the PDES method on a lower level, on             receives a message and may include the received
a level of system software that will link together billions         information in simulation, or ignore information if it is
of nodes (computational units, CPUs, cores, sensors,                not necessary for simulation. In the case when LPs
etc.) within one task. This was mentioned in the White              perform simulations in parallel, the following problem
Paper “Performance Technologies for Peta-Scale                      may occur. It may happen that LVT Wj of LP(j) is higher
Systems” prepared by a group of researchers from US                 than Wi in the received message from LP(i), and LP(j)
National Laboratories and associated Universities [7].              does depend on the information contained in the
   The paper is organized as follows. In Section 2 we               message generated by LP(i). Therefore, causality is
describe briefly the PDES method. In Section 3 we                   violated! Fujimoto classified PDES algorithms into two
review the research on the evolution of LVT time                    groups [1,9]. They are called optimistic algorithms and
profiles. In Section 4 we discuss an application of PDES            conservative algorithms reflecting the way in which
for in physics in informatics. In Section 5 we discuss a            synchronization is performed. FaS algorithm is
possible application of the PDES method in system                   introduced and described in the paper [10], it may be
software. In Section 6 we summarize our discussion and              viewed as representatives of the third group of PDES.
point out possible future research.
                                                                    2.1 Conservative PDES algorithm
2 Time evolutions in Parallel Discrete Event
                                                                       In the case of a conservative algorithm, a subsystem
Simulation method
                                                                    waits for all events to happen in every subsystem from
   We discuss an evolution of the local time profile and            each it depends on, before proceed in time. Clearly it is
of the GVT in PDES algorithms [8,9] using simplified                possible only for the simulations in which we know in
models of that evolution. In a sense, it is a modelling of          advance all dependencies between subsystems in the
the possible PDES algorithms. As we already                         physical system (named as arcs on graphs of LPs).
emphasized, the evolution of the LVT and the GVT is                 Therefore, in a conservative algorithm LVT Wi of LP(i)
associated with synchronization problems. The                       should not be greater than LVT Wk of all LPs from which
importance of the model-of-model approach becomes                   LP(i) depends on [1].
clear if we take into account that using that approach we              In realizations, one has to define matrix of
may prove, for example, that deadlocks are never                    dependencies M(i,j) with elements equal to 0 or 1. If an
happens in conservative algorithms. In terms of LVT                 element M(i,j) equals to 1, than LP(j) does may depend
profile time evolution, it means that the speed of the              on events generated by LP(i). This matrix may be
time profile is always positive because there exists at             viewed as a matrix of sender-receiver for message
least one object with the LVT lower than LVTs of other              sending. For each LP(i) the corresponding column of a
objects, therefore that object may proceed in time. In the          matrix is defined, and each time event happens in




                                                              108
simulations of LP(i), it sends messages to all LP(j) for           with the logical processes (LP) rather than with the
which element of matrix M(i,j) is equal to one.                    processing elements (PE). Second, it is assumed that it
Accordingly, LP(j) will never go forward in simulation             is possible to handle several LPs within one PE, and that
with LVT W j larger than minimal W k in its queue of               communication within PE may be efficient using
messages stamped with the LVTs.                                    conservative algorithm. In that case we can use
   In subsection 3.1 we will show that a conservative              correspondence of time profile evolution in the
algorithm never leads to a deadlock and will                       conservative PDES [10] with the evolution of surface
demonstrate how simulation speed of GVT behaves for                growth on substrate [12], described by Kardar-Parisi-
some simple cases                                                  Zhang partial differential equation, named KPZ
                                                                   equation. There is one-to-one correspondence between
2.2 Optimistic PDES algorithm                                      classification of boundary conditions in KPZ equation
    In an optimistic algorithm, all LPs are evaluated in           and classification of PDES algorithms [11]. In KPZ
time with assumption that causality is fulfilled. Clearly,         equation, we may divide space into domains, and
LVT of subsystems increased independently, and it may              classify possible boundary conditions (BC) between
happen that LP(j) will receive message with time stamp             domains. There are three possibilities of BC:
W i smaller than W l in the last processed message.                continuous, free, and fixed. Mapping of KPZ onto
Causality is broken! LP(j) proceeds using the wrong                PDES is as follows. Domain corresponds to PE, and BC
information, while ignoring an event which happened                between domains corresponds to the communication
                                                                   protocol. Therefore, with many LPs running within one
earlier in simulation time, W i < W l. An optimistic
                                                                   PE and updated using conservative algorithm, there are
algorithm introduces a protocol of rollback in time that
                                                                   three possible algorithms for communication protocol
resolves the problem of causality. This is done again
                                                                   between PEs (associated with corresponding BC):
using a message sending procedure. The most well
                                                                   conservative, optimistic, and FaS. FaS is abbreviation of
known optimistic protocol is Time Warp paradigm [2].
                                                                   the Freeze-and-Shift algorithm.
The event causing rollback is called a straggler. A
                                                                      Let us consider the simplest parallel simulation task,
recovery is organized by undoing effects of all events
                                                                   in which LPs are placed along the horizontal lines (see
that have been processed prematurely by the LP
                                                                   Figure 1), and send time-stamped messages only to the
receiving the straggler.
                                                                   left and right LP. Let us assume we distribute l logical
    In addition to the Input Message Queue and Output
                                                                   processes on each of N processor elements.
Message Queue, each LP keeps State Queue, which is a
list of the recently processed events. It gives a
possibility to restore an old state vector on rollback. LP
sends anti-message which annihilates the original one
when it reaches its destination LP. So, rollback
generates an anti-message (negative), which annihilates
while meeting the corresponding primary (positive)
message. This happens like annihilation in particle
physics, with an electron and a positron. If LP(i)
receives an anti-message that corresponds to a positive            Fig. 1. Possible realization of the first two “steps” of the
message, which it has already processed, then the                    time profile evolution in the conservative algorithm.
process must be rolled back itself to undo the effect of            Number of LPs is equal to the number of PEs. Each LP
processing that positive message. In the case if both an           sends message only to the neighboring LP at the left and
anti-message and a positive message are in LP(i) queue                                      right side.
they just annihilate within queue and do not cause any
rollback by LP(i). Recursive repetition of this procedure          Figure 1 shows time profile of simulation with 12 LPs
allows cancelling all erroneous calculations. This                 on the 12 PEs. We start with all LVT W i = 0. The system
recursive procedure may lead to the avalanche in the               evaluates to the second raw of black LPs. At that
rollback, but the length of the avalanche is limited by            “moment” GVT is defined by the process with the
GVT. No event with time-stamp smaller than GVT will                minimal LVT time, it is process number 3. At the
ever be rolled back, so LPs with GVT may discard that              “next” step of simulation only LPs with the minimal
event from the State Queue.                                        local virtual time (number 1, 3, 6, 8, 10, and 12) are
    In subsection 3.2 we will describe a simple model of           allowed to proceed to the possible time marked by pink,
evolution of LVT profile, and demonstrate an analogy               according with the conservative algorithm. The third
with a process known in physics as an unrestricted                 “step” consists in possible evaluation of those processes,
surface growth.                                                    which are sitting in local minima of LVT profile; it is
                                                                   processes with number 4, 7, 9, and 11.
2.3 Freeze-and-Shift (FaS) algorithm
To make the classification of the PDES algorithms
possible, two steps are performed. First, concept of
Virtual Time [2] was generalized, as LVT is associated




                                                             109
                                                                     by group of Korniss, mainly on the conservative
                                                                     algorithm. Next, we discuss results of our research with
                                                                     Mark Novotny, mainly for the optimistic algorithm.

                                                                     3.1 Conservative PDES and Kardar-Parisi-Zhang
                                                                     equation
                                                                        Let us consider the simplest case of application of
                                                                     conservative algorithm to the chain of LPs
Fig. 2. Possible realization of the first two “steps” of the
                                                                     communicating with neighbors only as already
time profile evolution in the FaS algorithm. Number of
                                                                     considered in Subsection 2.3 and depicted schematically
LP=4 in each of PE=3. Each LP sends message only to
                                                                     in the Figure 1. LVT Wi is associated with each LP(i). We
the neighboring LPs at the left and right side: it is
conservative algorithm between LP in each PE, and                    start with all Wi = 0 (LVT time profile is flat at t = 0,
boundary LPs are frozen.                                             where t is artificial time, and the discrete one for
                                                                     simplicity1). Evolution of LVT profile may be written
Figure 2 demonstrate FaS algorithm applied to the chain              as iterative process: W i (t+1) = W i (t)+K i(t) if W i (t) d min
of LPs communicating only with the neighbors. 12 LP                  {Wi-1 (t),W i+1 (t)}, and W i (t+1) = W i (t), otherwise. Here K i(t)
distributed over 3 PE, with 4 LP in each PE. Boundary                are random variables. When LP(i) advanced in time it
LP (number 1,4,5,8,9, and 12) is not allowed to proceed              sends messages to the right LP(i+1) and left LP(i-1)
after step 1 - they are fixed. On the “next step” only LP            stamped with the LVT W i (t+1). In this process causality
in the local minima and not on the border of PE are                  is never violated.
allowed to proceeds simulation, it is processes 3, 6, and               This algorithm is free of deadlock. This can be seen
10, marked by pink. Compare Figure 2 and Figure 1 –                  qualitatively from the fact that statistically there are four
the difference only on the border of PE. We have to                  possible local configurations for LVT of LP(i), and only
note, that no LP could proceed after that step – there are           one of them with the local minima. Therefore, average
no local minima of LVT time profile. The next step of                speed of GVT (time horizon) should be 1/4. In fact,
FaS algorithm is to perform shift phase, i.e. perform                numerical simulation of Korniss et al model gives
exchange of LPs between PEs. For the case depicted in                average speed of the time horizon is 0.246 410(7) [10]
Figure 2, we have to “shift” LPs cyclically by two                   for the Poisson distribution of Ki(t), 0.267(4) [11] for the
positions. So, PE number 1 will keep LPs 3,4,5, and 6;               uniformly distributed Ki(t), and 0.258(5) [11] for the
PE number 2 – LPs 7,8,9, and 10; PE number 3 – LPs                   Gaussian noise. So, average utilization of CPU time in
11,12,1, and 2. So, from that boundary LPs (the fixed                this simplified local version of conservative algorithm is
ones) will be 3, 6, 7, 10, 11, and 2, this is the freeze             close to 25 per cent.
phase of FaS algorithm. After that, processes 4, 8, 12,                 Korniss et al argued that in the continuum limit (the
and 1 could proceed.                                                 increment of the artificial time t is infinitesimal)
   The forward process depends on the number l of                    iteration process for the LVT time profile obeys the
logical processes on each of N processor elements, and               equation introduced by Kardar-Parisi-Zhang (KPZ) for
time to stop can be estimated as l2/4 for the linear chain           the surface growth on the solid substrate [10,11]. This is
of LPs discussed for simplicity.                                     very important observation because it provides mapping
   FaS allows for an effective realization on the parallel           of conservative PDES algorithm with local message
computers, clusters of computers, and in grid                        exchange onto KPZ problem. In turn, KPZ problem is
computing. It effectively decouples the computation                  well investigated [12], and solution demonstrates
phase and communication phase for these parallel                     universal properties of the profile fluctuations
computations.        This allows, for example, the                   depending on the time and system size.
programmer to utilize fast block-memory-transfer                        First, from the fluctuations of surface profile it is
commands. Furthermore, it should allow the efficient                 possible to estimate the width growth of the profile.
simultaneous execution of both computation part and                  Mapping this result from KPZ onto the properties of
communication part when they are performed by                        LVT time profile in PDES shows us that width of LVT
independent hardware.                                                grows with time as t3/2. This means that LPs are
                                                                     unsynchronized with simulation time growth. This is a
3 Evolution of time profile in PDES                                  bad news, but from KPZ analysis it follows, that width
                                                                     growth is saturated at some moment of time. Saturation
   In this section we will discuss evolution of LVT time             level of LVT width is proportional to the square root
profile in PDES. We do not discuss case studies in                   from the number of LPs. This is a good news.
which particular realization of PDES are investigated
empirically. Instead, we will discuss properties of                  3.2 Optimistic PDES and unrestricted surface
models which mimics essential features of PDES and
which can be investigated using methods of applied                   1
mathematics and mathematical physics. First, we                        One should not confuse artificial time t we introduce for the sake of
                                                                     simplicity with the time at which events happens and time-stamped
discuss results of the research done in series of papers             messages are generated.




                                                               110
growth                                                              papers that may be of general interest.
                                                                       SPPARKS is the project running by Sandia National
   Model for the evolution of time profile, which mimics
                                                                    Laboratory and developing kinetic Monte Carlo
essential features of optimistic algorithm, was
                                                                    simulator [16]. SPPARKS runs on single processors or
introduced first in [13]. Let us consider again for the
                                                                    in parallel using message-passing techniques and a
sake of simplicity the linear chain of LPs. The first step
                                                                    spatial-decomposition of the simulation domain. The
is the optimistic unrestricted evolution of LVT at which
                                                                    code is designed to be easy to modify or extend with
simulation evolves forward in time and the second is the
                                                                    new functionality. Main idea is in partitioning of
rollback (or backward) algorithm of sending anti-
                                                                    simulation space into the computational domains, and
messages. For this purpose one may introduce two
                                                                    performing simulations in parallel within some time
parameters, F and B associated with two steps of
                                                                    window. In some sense, this idea is similar to FaS
optimistic scheme. Namely, at the first step one
                                                                    algorithm. It implements several KMC solvers whose
evaluates the time horizon by updating time of the
                                                                    serial computational complexity ranges from O(N) to
randomly chosen LPs with value F following the
                                                                    O(NlogN) to O(1) in the number of events N owned by a
Poisson distribution, and every LP can be chosen with
                                                                    processor. In a generic sense the solvers are catalog a
the non-zero probability. Then one have to relax
                                                                    list of "events", each with an associated probability,
(rollback!) LVT profile B times: for all LP(i) value of
                                                                    choose a single event to perform, and advance time by
LVT time changed to the value of LVT time of the
                                                                    the correct amount. Events may be chosen individually
nearest left LP(i-1) or right LP(i+1). Value of B satisfies
                                                                    at random, or by sweeping over sites in a more ordered
the Poisson distribution. The average value of the speed
                                                                    fashion. Problems they addressed are magnetic spin
profile u(t) evolves proportional to (q-qc)a, with
                                                                    models, surface growth on substrate, etc. One of the
q=1/(1+B), with a value of a=1.74 close to the critical
                                                                    simulations is connected with the microstructural
exponent of directed percolation and describing
                                                                    evolution during sintering [17]. The model is a grain
roughening transition in one-dimensional unrestricted
                                                                    growth with physical effects of particle diffusion,
growth process [14]. The value of qc=0.23(3)
                                                                    vacancy diffusion, and vacancy annihilation. This
corresponds to the value of the parameter B=3.3. It
                                                                    sintering model is able to capture all the necessary
should be mentioned that the width of the LVT profile
                                                                    mechanism to simulate simple solid-state sintering
acts better for values q far away from the critical value
                                                                    correctly. It was demonstrated by comparing it to the
qc. When close to qc the system practically does not
                                                                    three--dimensional, in-situ images taken in a high-
evolve in time. The growth of width provides an
                                                                    energy synchrotron during the sintering of Cu particles.
analogy with the roughening transition, which is in the
                                                                    Similar project is running by Lawrence Livermore
same universality class as a directed percolation [14].
                                                                    National Laboratory, also for the parallel kinetic Monte
Practically this means that if the length of avalanche
                                                                    Carlo [18,19]. It is based on ideas of virtual time and
(see subsection 2.2) is larger than B=3.3 in average,
                                                                    random order of simulating computational domains
optimistic algorithm will not proceed in time but will be
                                                                    within time window. It is tested on the model of billion
rather stock. It is quite important to perform case studies
                                                                    atoms. The approach was tested with the Ising model
of optimistic algorithm in order to understand that
                                                                    (the simplest ferromagnetic model) and demonstrated
prediction in details.
                                                                    that scalability with the number of nodes is close to the
                                                                    ideal one.
4 Using PDES in physics and informatics                                A number of projects connected with simulation of
   The most important feature of PDES for using in                  communication networks are using PDES. One of such
physics is that PDES algorithms are scaled very                     projects is running by INRIA, and it is connected with
naturally with the physical system size (number of                  the simulation of Border Gateway Protocol using
objects, or logical processes) and with the hardware size           optimistic PDES [20] with number of nodes from 10 to
(number of nodes, cores, threads) with the only                     100 thousands. In the last case simulation lasts up to 36
requirement that the time for message distribution is               hours on 4-core Intel Xeon W5580 3.2Ghz with 64GB
smaller than computation time. This is valid for the                of RAM running 64-bit Fedora 12 on a Linux kernel
simulation in which time to simulate object is much                 2.6.32.
larger than time to send message. It is clear that more
complex and hard simulations the more gain can be                   5 PDES and system software
reached.
                                                                      The goal of high end computing (HEC) initiative in
   Last years a number of papers with the analysis of
                                                                    US and Europe is to deploy large-scale computing
implementation of PDES ideas in the large-scale
                                                                    platforms with hundreds thousands of nodes. In the
computing of models in physics have been published.
                                                                    White Paper “Performance Technologies for Peta-Scale
Most of discussions were done for the analysis of
                                                                    Systems” the group of US researchers emphasized
application of PDES algorithm to the Ising model,
                                                                    importance of the research on the program tools and
which is the simplest model of ferromagnetism, and
                                                                    software facilities, and in particular in system
which is standard model to test algorithms in Monte
                                                                    simulation [7]. They see that as “an open-source
Carlo simulations [15]. Here we review some of the
                                                                    architectural simulation framework and API that enables




                                                              111
plug-and-play between separately-developed simulators             6 Summary and discussion
for different architectural features… and would also
enable zoom-out and zoom-in between statistically-                   Parallel Discrete Event Simulation paradigm is one of
based and cycle-accurate simulation techniques”.                  the paradigms for the large-scale high performance
Solution seen by them as “the simulations will be                 computing. It is under extensive development by many
decomposed into logical processes, and will be                    groups of researches. PDES was successfully tested on
synchronized by either conservative or optimistic                 the new architectures, from Blue Gene and conventional
methods … as developed in PDES community”. The                    clusters and CPUs to the Virtual Machine/Cloud
reason is that by such approach the high degree of                approach. Efficiency of PDES does depend on the
computational parallelism in the simulation will                  problem and it is promising to investigate properties of
perfectly match the high degree of real parallelism of            synchronization, scalability and efficiency.
HEC systems.
   There are several simulators available as open source          7 Acknowledgments
software.
   The most famous realization of Time Warp                         Russian Science Foundation supported this work
synchronization algorithm is Rensselaer's Optimistic              under the grant number 14-21-00158.
Simulation System, named as ROSS simulator [21]. It is
public domain software, which can be installed on the             References
cluster. Recently group of researchers from Lawrence
Livermore National Laboratory, Rensselaer Polytechnic              [1] R.M. Fujimoto. Parallel Discrete Event
Institute and University of Illinois at Urbana-                        Simulation. Comm. of the ACM, 33(10), p. 30-53,
Champaign tested ROSS on the Blue Gene architecture                    1990.
with successful use of almost 2 billions of cores [22]             [2] A.M. Law and W.D. Kelton. Simulation modeling
yielding speed of 504 billion events per second.                       and analysis. McGraw-Hill. Third edition. 2000.
Modification of optimistic algorithm named shock-                  [3] D.R. Jefferson. Virtual Time. ACM Trans.
resistant Time WARP (SRTW) designed by group in                        Programming Languages and Systems, 7(3), p.
Westminster University and analysis was performed                      404-425, 1985.
using the Grid’5000 platform [23].                                 [4] S. Miller and S. Luding. Event-Driven Molecular
Group from Bologna University developed PDES                           Dynamics in Parallel. J. Comp. Phys. 193(1), p.
software named ErlangTW based on Time Warp                             306-316, 2004.
algorithm. Work was presented at 1st ACM SIGPLAN
                                                                   [5] D.C. Richardson, K.J. Walsh, N. Murdoch, and P.
workshop on Functional high-performance computing
                                                                       Michel P. Numerical simulations of granular
(FHPC '12) [24], where some preliminary performance
                                                                       dynamics: I. Hard-sphere discrete element method
results on multicore and distributed architectures using
                                                                       and tests. Icarus 212, p. 427-437, 2011.
the PHOLD benchmark.
There are at least two attempts to implement PDES in               [6] J.P. Nilmeier, J. Marian J. A Rigorous Sequential
Cloud architecture. Fujimoto group introduced TW-                      Update Strategy for Parallel Kinetic Monte Carlo
SMIP (the Time Warp Straggler Message Identification                   Simulation. (accepted to Communications in
Protocol) protocol resigned for environment with shared                Computer Physics. 2014) arXiv link:
hardware resources for which it is known that traditional              http://arxiv.org/pdf/1403.4674.pdf
Time Warp (TW) algorithm shows poor performance                    [7] D.H. Bailey et al. Performance Technologies for
[25].     TW-SMIP       protocol     defines     dynamic               Peta-Scale Systems: A White Paper Prepared by
synchronization points for individual LPs based on                     the Performance Evaluation Research Center and
straggler messages (a new event with time stamp                        Collaborators. May 14 2003.
smaller than other events it has already processed),               [8] R. Fujimoto. Parallel and Distributed Simulation
which improves efficiency of simulation.                               Systems. Wiley Interscience, 2000.
Second example is realization of PDES simulator on                 [9] G. Korniss, Z. Toroczkai, M.A. Novotny, and
Cloud/Virtual machine platforms developed in Oak                       P.A. Rikvold. From Massively Parallel
Ridge National Laboratory [26]. The scalability of                     Algorithms and Fluctuating Time Horizons to
virtual time scheduler has been tested on 128 virtual                  Nonequilibrium Surface Growth. Phys. Rev. Lett.,
machines multiplexed on 32 cores, showing                              84, p. 1351-1354, 2000.
improvement in the runtime relative to the default                [10] L.N. Shchur and M.A. Novotny. Evolution of time
Cloud/VM scheduler. Authors believe that algorithmic                   horizons in parallel and grid simulations. Phys.
design, observations, and results of their work are                    Rev. E, 70, 026703, 2004.
timely for emerging cloud/VM-based installations,
                                                                  [11] M. Kardar, G. Parisi, and Y.C. Zhang. Dynamic
highlighting the need for PDES-specific support in high
                                                                       Scaling of Growing Interfaces. Phys. Rev. Lett.,
performance PDES on the cloud/VM platforms.
                                                                       56, p. 889-892, 1986.
                                                                  [12] L.N. Shchur, M.A. Novotny. Evolution of time
                                                                       horizon in parallel discrete event simulations.




                                                            112
     Proceedings of Conference “Scientific services in           [20] ROSS - Rensselaer's Optimistic Simulation
     Internet: multicore computer world”, Abrau-                      System. http://www.cs.rpi.edu/~chrisc/ross.html
     Dyurso, p. 197-200, 2007 (in Russian).                      [21] P.D. Barnes Jr, D.R. Jefferson, M. Schordan, D.
[13] U. Alon, M.R. Evans, H. Hinrinchsen, and D.                      Quinlan, C.D. Carothers, L.V. Kale. Extreme scale
     Mukamel. Roughening Transition in a One-                         optimistic parallel discrete event simulation with
     Dimensional Growth Process. Phys. Rev. Lett.,                    dynamic load balancing. Proceedings of the 2014
     76, p. 2746-2749, 1996.                                          Winter Simulation Conference. 2014.
[14] D.P. Landau and K. Binder. A Guide to Monte                 [22] G. Krafft and V. Getov. Transaction-oriented
     Carlo Simulations in Statistical Physics. 3d                     simulation in Ad Hoc Grids: design and
     edition. Cambridge: Cambridge University Press.                  experience. Procedings of HPCS, pp. 38-44, 2008.
     2013.                                                       [23] L. Toscano, G. D’Angelo, and M. Marzolla.
[15] SPPARKS Kinetic Monte Carlo Simulator                            Parallel discrete event simulation with Erlang.
[16] . Sandia National Laboratory.                                    FHPC '12 Proceedings of the 1st ACM SIGPLAN
     http://spparks.sandia.gov/index.html                             workshop on Functional high-performance
[17] C. G. Cardona, V. Tikare, S. J. Plimpton. Parallel               computing. p. 83-92, 2012.
     Simulation of 3D Sintering. Int J Comp Materials            [24] A.W. Malik, A.J. Park, R.M. Fujimoto. An
     Science and Surface Engineering, 4, p. 37-54,                    Optimistic Parallel Simulation Protocoal for Cloud
     2011.                                                            Computing Environments. SCS M&S Magazin, 4,
[18] J.P. Nimeier and J. Marian. A rigorous sequential                p. 1-9, 2010.
     update strategy for parallel kinetic Monte Carlo            [25] S.B. Yoginati and K.S. Permulla. Efficient
     simulation. Computer Physics Communications,                     Parallel Discrete Event Simulation on
     185, p. 2479-2486, 2014.                                         Cloud/Virtual Machine platform. ACM
[19] D. Coudert, L. Hogie, A. Lancin, D.                              Transactions on Modeling and Computer
     Papadimitrou, S. Perennes, and Isaam Tahiri.                     Simulation. 2014.
     Feasibility study on distributed simulation of BGP.
     Preprint arxiv: 1304.4750.




                                                           113