=Paper= {{Paper |id=Vol-1643/paper-01 |storemode=property |title=Resource-Aware Application Execution Exploiting the BarbequeRTRM |pdfUrl=https://ceur-ws.org/Vol-1643/paper-01.pdf |volume=Vol-1643 |authors=Giuseppe Massari,Simone Libutti,William Fornaciari,Federico Reghenzani,Gianmario Pozzi |dblpUrl=https://dblp.org/rec/conf/date/MassariLFRP16 }} ==Resource-Aware Application Execution Exploiting the BarbequeRTRM== https://ceur-ws.org/Vol-1643/paper-01.pdf
Proceedings of 1st Workshop on Resource Awareness and Application Autotuning in Adaptive and Heterogeneous Computing (RES4ANT) 2016




                                            Resource-Aware
                                        Application Execution
                                    Exploiting the BarbequeRTRM

                                            Giuseppe Massari, Simone Libutti,
                                          William Fornaciari, Federico Reghenzani
                                                   and Gianmario Pozzi

                                                  Politecnico di Milano
                              DEIB: Dipartimento di Elettronica, Informazione e Bioingegneria
                                             giuseppe.massari@polimi.it




                           Abstract. Energy efficiency and thermal management have become ma-
                           jor concerns in both embedded and HPC systems. The progress of silicon
                           technology and the subsequent growth of the dark silicon phenomena are
                           negatively affecting the reliability of computing systems. As a result, in
                           the next future we expect run-time variability to increase in terms of both
                           performance and computing resources availability. To address these is-
                           sues, systems and applications must be able to adapt to such scenarios.
                           This work provides a brief overview of the Barbeque Run-Time Resource
                           Manager (BarbequeRTRM ) and the application execution model that it
                           exploits, in order to deal with run-time performance and available re-
                           sources variability.



                    1    Introduction

                    The need of resource-aware and adaptive applications is driven by several issues
                    and requirements that are typical of modern computing systems. For instance,
                    embedded mobile devices must deal with the limited energy budget provided
                    by the battery, while HPC centers must afford huge costs due to the power
                    consumption and the cooling of the infrastructure. Furthermore, the dark silicon
                    phenomenon affecting modern processors is becoming prominent[1], since it is
                    increasing the amount of silicon area that must be turned off, to guarantee the
                    power envelope of the processor. For all these reasons, a continuous and full usage
                    of the whole set of system computing resources is often impossible to achieve.
                        On the application side, we can gain efficiency by implementing suitable
                    adaptive behaviors like enabling/disabling the execution of a task, or scaling the
                    accuracy of the output depending on the availability of computing resources.
                    A run-time resource management framework can implement such approach by
                    constraining the resource allocation according to system level requirements or
                    runtime conditions, and providing to the applications suitable interfaces to check
                    and negotiate the resource assignment.
Proceedings of 1st Workshop on Resource Awareness and Application Autotuning in Adaptive and Heterogeneous Computing (RES4ANT) 2016




                                                                                                                  Applications

                                                C                        C++                                OpenCL




                                      Recipes                                       AEM API                  AS-RTM API



                                   RPC Channel                                       Plain API


                                   FIFOs/Binder      Synchronization
                                       DBus             Protocol                                   Run-time application library



                                                     Synchronization                                          Resource
                                 Application Proxy                              Scheduler Policy
                                                         Policy                                               Accounter


                                    Application      Synchronization
                                                                               Scheduler Manager          Resource Manager
                                     Manager            Manager


                                      Power                                                               Resource Manager
                                                                   Platform Proxy
                                     Manager                                                                  daemon


                                     CPUfreq         Platform Drivers           Control Groups           Linux kernel-space



                    Fig. 1. The BarbequeRTRM Architecture. On top the programming languages sup-
                    ported by the application Run-Timr Library (RTLib). In red the resource manager
                    core, on top of the support provided by the Linux OS to control the system resources.


                    2     Run-time Resource Management
                    The BarbequeRTRM is a modular and portable run-time resource manager tar-
                    geting both embedded and High-Performance Computing (HPC) systems. From
                    the hardware resources perspective, the framework can manage homogeneous
                    and heterogeneous multi-core processors, as well as heterogeneous systems in-
                    cluding devices characterized by completely different ISA (e.g., CPU and GPU).
                        The modularity of the BarbequeRTRM comes from a software architecture
                    in which we can distinguish between core components and plugin modules. Typ-
                    ically, the latter are platform-specific extensions and selectable resource man-
                    agement policies.
                        The portability instead, is guaranteed by the exploitation of some underlying
                    Linux operating system frameworks, like cpufreq and cgroups, that allows the
                    BarbequeRTRM to enforce the resource allocation decisions [2].

                    2.1    Abstract Execution Model
                    The resource manager exposes its services to the applications through a run-time
                    library (RTLib). The library accomplishes a two-fold objective: 1) to provide a
Proceedings of 1st Workshop on Resource Awareness and Application Autotuning in Adaptive and Heterogeneous Computing (RES4ANT) 2016




                                               Fig. 2. Abstract Execution Model


                    communication channel between the resource manager and the applications; 2)
                    to expose an execution model to support the implementation of the resource-
                    aware adaptive execution of the applications[3].
                        In Figure 2 we show the Abstract Execution Model (AEM), that the run-time
                    manageable applications must implemented accordingly. This execution model
                    is put in place by defining and implementing a suitable C++ class, derived from
                    the BbqueEXC class provided by the RTLib.
                        At run-time, the BbqueEXC member functions are called by a control thread,
                    which is responsible of synchronizing the application execution with the de-
                    cisional process of the resource manager. The rationale behind each member
                    function implementation is the following:
                        onSetup(): setting up the application (initialize variables and structures,
                    starting threads, . . . ). onConfigure(): check the amount of assigned resources
                    and configure the application accordingly. onRun(): single cycle of computation
                    (e.g., computing a single frame during a video encoding). onMonitor(): perfor-
                    mance and QoS monitoring. onRelease(): cleanup and termination code.
                        Therefore, once the application ends the initialization step (onSetup), the
                    control thread waits for the resource allocation decision coming from the Barbe-
                    queRTRM. As soon as it has been received, the onConfigure function is called.
                    In this function, the application can then check the amount of assigned resources,
                    and configure itself accordingly, before starting (or continuing) the execution, as
                    sketched here below.
Proceedings of 1st Workshop on Resource Awareness and Application Autotuning in Adaptive and Heterogeneous Computing (RES4ANT) 2016




                    RTLIB ExitCode t BlackscholesEXC : : onConfigure ( i n t 8 t awm id ) {
                       // Get t h e number o f CPU c o r e s a s s i g n e d
                       GetAssignedResources ( RTLIB ResourceType : : PROC NR, n r c p u ) ;

                          // C o n f i g u r e . . .
                    }

                       The functions onRun and onMonitor are then sequentially called and exe-
                    cuted in a loop, until the entire computation is over.
                       The RTLib estimates the current performance of the application, in terms of
                    cycles-per-second (CPS), such that the application could check the gap between
                    the required performance level and the one currently achieved. After that, the
                    application can notify the resource manager about this gap.
                       Considering also that the performance goal can vary depending on input data
                    and external events, a effective approach is to exploit the SetCPSGoal function to
                    specify the performance goal and the notification rate, as shown in the following
                    example of onMonitor implementation:
                    RTLIB ExitCode t BlackscholesEXC : : onMonitor ( ) {
                       // S p e c i f i c e v e n t c o n d i t i o n t r i g g e r i n g t h e
                       // change o f performance r e q u i r e m e n t s
                        if (...)
                          SetCPSGoal ( 2 . 5 , 1 0 ) ;
                       // . . .
                    }

                        In the example, the application sets a performance goal of 2.5 CPS, and a
                    notification rate of 10 cycles. The library keeps track of the application per-
                    formance, computing the average CPS value over a (configurable) number of
                    last execution cycles. Whenever the performance gap overcomes a given (con-
                    figurable) threshold, such a gap value is sent to the resource manager. As a
                    consequence, the amount of assigned resources can be adjusted accordingly. The
                    notification rate is then exploited to bound the application reconfiguration rate,
                    and hence the related overhead. In other words, the application asks the resource
                    manager to send back a reconfiguration request after not less than 10 execution
                    cycles or more.


                    3    Experimental Scenario

                    In this section we show results of the resource-aware adaptive execution of
                    blackscholes from the PARSEC benchmark suite [4] on a embedded develop-
                    ment board that features an ARM Cortex A9 dual-core CPU. The benchmark
                    has been properly modified to fit the Abstract Execution Model. The frequency
                    of the CPU has been set to its maximum value, which is 920 MHz. The full CPU
                    usage, which is shown in Figure 3a, causes the chip temperature to raise over
                    100◦ C, thus triggering the thermal throttling response of the operating system.
Proceedings of 1st Workshop on Resource Awareness and Application Autotuning in Adaptive and Heterogeneous Computing (RES4ANT) 2016




                                            Load         Temperature
                                            The BarbequeRTRM: Power data traceFrequency
                                                                               plot                                                                             Load         Temperature
                                                                                                                                                                The BarbequeRTRM: Power data traceFrequency
                                                                                                                                                                                                   plot
                                                                                                                  1000                                60                                                                                1000
                                  120
                                                                                          120                                                                                                                    90


                                  100                                                                                                                 50
                                                                                                                  800                                                                                            80                     800
                                                                                          100

                                                                                                                                                      40                                                         70
                                   80




                                                                                                                         Frequency [MHz]




                                                                                                                                                                                                                                               Frequency [MHz]
                                                                                                Temperature [C]




                                                                                                                                                                                                                      Temperature [C]
                                                                                                                  600                                                                                                                   600
                       Load [%]




                                                                                                                                           Load [%]
                                                                                          80
                                                                                                                                                                                                                 60
                                   60                                                                                                                 30

                                                                                                                  400                                                                                            50                     400
                                                                                          60
                                   40                                                                                                                 20
                                                                                                                                                                                                                 40

                                                                                          40                      200                                                                                                                   200
                                   20                                                                                                                 10
                                                                                                                                                                                                                 30


                                    0                                                  20                         0                                    0                                                         20                     0
                                        0        50          100         150         200                                                                   0   100   200   300   400   500   600   700   800   900
                                                           Time [s]                                                                                                               Time [s]



                                                       (a) CPS = 4                                                                                                           (b) CPS = 1

                    Fig. 3. PARSEC blackscholes execution: CPU load, temperature and clock frequency
                    variations according to two performance requirements: a) 4 cycles-per-second; b) 1
                    cycle-per-second.


                    A continuous frequency scaling is operated in order to cool down the CPU, with
                    performance variability as a further consequence.
                        In Figure 3b, the application sets a performance goal of CPS=1. The resource
                    manager takes into account such information shrinking the amount of CPU time
                    assigned. The implicit result is a lower but more stable performance level, along
                    with a reduced thermal stress.


                    References
                    1. H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger,
                       “Dark silicon and the end of multicore scaling,” in Proceedings of the
                       38th Annual International Symposium on Computer Architecture, ser. ISCA
                       ’11. New York, NY, USA: ACM, 2011, pp. 365–376. [Online]. Available:
                       http://doi.acm.org/10.1145/2000064.2000108
                    2. P. Bellasi, G. Massari, and W. Fornaciari, “Effective Runtime Resource Manage-
                       ment Using Linux Control Groups with the BarbequeRTRM Framework,” ACM
                       Transactions on Embedded Computing Systems (TECS), vol. 14, no. 2, p. 39, 2015.
                    3. G. Massari, E. Paone, P. Bellasi, G. Palermo, V. Zaccaria, W. Fornaciari, and C. Sil-
                       vano, “Combining application adaptivity and system-wide resource management on
                       multi-core platforms,” in Embedded Computer Systems: Architectures, Modeling,
                       and Simulation (SAMOS XIV), 2014 International Conference on. IEEE, 2014,
                       pp. 26–33.
                    4. C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC benchmark
                       suite: characterization and architectural implications,” in Proceedings of the 17th
                       international conference on Parallel architectures and compilation techniques, ser.
                       PACT ’08. New York, NY, USA: ACM, 2008, pp. 72–81. [Online]. Available:
                       http://doi.acm.org/10.1145/1454115.1454128