=Paper= {{Paper |id=Vol-1464/ewili15_14 |storemode=property |title=Autonomic Thread Scaling Library for QoS Management |pdfUrl=https://ceur-ws.org/Vol-1464/ewili15_14.pdf |volume=Vol-1464 |dblpUrl=https://dblp.org/rec/conf/ewili/DurelliS15 }} ==Autonomic Thread Scaling Library for QoS Management== https://ceur-ws.org/Vol-1464/ewili15_14.pdf
 Autonomic Thread Scaling Library for QoS Management∗

                      Gianluca C. Durelli                                  Marco D. Santambrogio
                     Politecnico di Milano                                  Politecnico di Milano
           Dipartimento di Elettronica, Informazione e            Dipartimento di Elettronica, Informazione e
                         Bioingegneria                                          Bioingegneria
               gianlucacarlo.durelli@polimi.it                        marco.santambrogio@polimi.it


ABSTRACT                                                         line resource management. Taking in consideration a wide-
Over the last years embedded system industry is facing a         spread programming model for parallel applications, such
revolution with the introduction of multicores and hetero-       as OpenMP [2], we have that applications written follow-
geneous devices. The availability of these new platforms         ing this standard generally try to exploit all the available
opens new paths for these devices that can be nowadays           cores in the system, conflicting with the other running ap-
used for more high demand tasks, exploiting the parallelism      plications. Academia and industries tackled this problem by
made available by the muticore processors. Nonetheless the       devising algorithms and solutions to minimize interference
progress of the HW technology are not backed up by im-           among colocated applications [3] and to perform dynamic
provements of the SW side, and runtime mechanisms to             resource management in multiple context, ranging from em-
manage resource allocation and contention on resources are       bedded devices to multi-tenants infrastructures [4]. Many
still lacking the proper effectiveness. This paper tackles the   works proposed to substitute the current programming mod-
problem of dynamic resource management from the applica-         els with a resource on-demand oriented model where the ap-
tion point of view and presents a user space library to con-     plications ask the system the resources they need at runtime
trol application performance. The control knob exploited         (Section 5).
by the library is the possibility of scaling the number of
threads used by an application and seamlessly integrates         This paper tackles the problem of dynamic resource man-
with OpenMP. A case study illustrates the benefits that this     agement from the application point of view and presents a
library has in a classic embedded system scenario, introduc-     user space library to control application performance. The
ing an overhead of less than 0.3%.                               control knob exploited by the library is the possibility of scal-
                                                                 ing the number of threads used by an application and this is
1. INTRODUCTION                                                  carried out by seamlessly integrating with OpenMP (Section
                                                                 2). To achieve this goal the proposed solution monitors ap-
Over the last years embedded system industry is facing a
                                                                 plication performance at runtime through a heartbeat-like
revolution with the introduction of multicores and heteroge-
                                                                 API [5] and builds an estimation of the application scalabil-
neous devices. The representative platform in the field has
                                                                 ity considering the contention caused by other applications
then moved from the classic micro-controller or single core
                                                                 concurrently running. To summarize this work proposes:
processor with relatively low performance and low power
consumption to a more high performance family of chips,
featuring a high degree of parallelism. An example of this          • a library to automatically infer the right number of
revolution is represented by the mobile world which moved in          threads to use to respect the application Quality of
few years from a single core processor working at few hun-            Service (QoS)
dred megahertz, to nowadays systems featuring octa-cores
asymmetric processor working at more than 2 GHz. This               • a model to infer the scalability of an application at
increased computational capability led to a drastic change            runtime considering resource contention
in the applicability of this devices; if once they were used
only for simple functions, now they can be used to perform
highly demanding tasks such as multimedia, video process-        Details about the implementation and experimental results
ing, or other kinds of high performance computations. As         are presented in sections 3 and 4 respectively. The library,
a further example of this wide applicability of embedded         being implemented in user space, can be used as is on every
system devices the EU project Mont Blanc [1] is aimed at         system without further system modification, and the code is
developing the first High Performance Computing (HPC)            available as open source ([6]).
cluster based on ARM processors. If on one hand the up-
dates on the hardware side marked this big revolution, on        2. PROBLEM DEFINITION
the other the software side didn’t keep up and, generally,
                                                                 The availability of multicore architectures in embedded sys-
applications in this domain rarely exploit parallelism (as for
                                                                 tems, allows these systems to execute computational inten-
instance mobile device applications), and, more importantly,
                                                                 sive tasks taking advantage of the available parallelism. As a
there is a big limitation in how the system performs on-
                                                                 consequence multiple applications will be executed, or con-
∗EWiLi’15, October 8th, 2015, Amsterdam, The Nether-             solidated, on the same system to increase its utilization.
lands. Copyright retained by the authors.                        Consolidation, as we learned from the utility computing
                             Start                                eration of the loop respects the throughput constraint then
                                                                  the whole application respects the throughput constraint as
                                                                  well:
                             Fork                                               PD         D           D       D
                                                                                        <        =⇒         <
                                                                             Titeration   TQoS        Treal   TQoS
        Thread      Thread                  Thread
                                     …                            Otherwise if we do not know the amount of data to be com-
          0           1                       N
                                                                  puted we can not set a deadline, but we can instead set a
                                                                  minimum average throughput for our application. In this
                             Join                                 second case we can express the goal over the execution of
                                                                  a single iteration of the loop and so we have to guarantee
                                                                  that PD /Titeration < T hrQoS . As we can see, we can enforce
                             End                                  deadline and throughput constraints in the same way in our
                                                                  applications; this means that we can use a single control
Figure 1: Task graph model of supported applica-                  policy to manage both requirements at the same time.
tions.
                                                                  The performance in our application model is expressed as
                                                                  performance of the computational intensive loop of the ap-
scenario, will lead to conflict among the concurrent appli-       plication. Expressing quality of service constraints in this
cations. The reference programming languages to achieve           way allows the library to be application independent and
high performance (at least for low level libraries) are with      more importantly allows the final users to express perfor-
no doubt C and C++, and pthread and OpenMP are the two            mance in a high level manner. For instance if we consider
main solutions to implement parallel applications. This sec-      a video processing application, we have that each iteration
tion presents the problem tackled by our work and illustrates     of the loop will compute one frame; in this case the perfor-
the different components that participate at its solution.        mance can be set to 30 Frame/s which means that the loop
                                                                  has to execute at least 30 times per second. This concept is
2.1    Application Model                                          similar to the one expressed by the Application Heartbeat
This work focuses on the management of CPU intensive,             framework [5] from which we derive our monitoring compo-
data parallel applications; these applications have the ad-       nent. Application Heartbeat proposes that the application
vantage of achieving an high speedup when parallelized us-        is instrumented with proper function calls that allow to ex-
ing multi-threaded implementations.                               press progress of the application in terms of high level met-
                                                                  rics. Our application will then be instrumented to provide
A simple model of these applications is the one presented         this information to the autonomic library; Section 3 details
in Figure 1. This model is a cyclic graph that represents         how the monitoring has been implemented and will clarify
the execution of a hot loop of a program. Each iteration of       the steps that have to be taken to instrument an application.
a loop executes a computational kernel over a set of data
and it can be parallelized following a fork-join model. This      2.3    Performance Resource Assignment
might seem a fairly simple model to consider but it finds real    Given a set of applications that express requirements in
application in a wide range of fields, such as image analysis     terms of deadline or throughput, the goal of the work is
where the loop may work on different images or frames, or         to perform a dynamic resource assignment such that each
physical simulations where each iteration represents a time       application gets the resources it needs to fulfill its require-
step of the simulation or the analysis of a partition of the      ment. The resource management policy envisioned in this
input dataset.                                                    work aims at managing the number of threads of an appli-
                                                                  cation and it works on a per application basis, meaning that
2.2    Application Performance Requirements                       each application will try to reach the solution to the resource
The model for the applications so presented allows to de-         management problem on its own. The reason behind this
fine application’s performance both in terms of constant          choice is that it solves many of the issues present in state of
throughput and deadline. In particular we can consider two        the art works without losing in effectiveness. For instance it
cases, whether the size of the input dataset is known a pri-      permits to run on a given system both application exploit-
ori or not. In the first situation supposing that D is the size   ing this dynamic control and legacy applications without any
of the dataset and each iteration computes a partition PD         problem. Furthermore distributing the control allows to ex-
of it, we can express on the application two kinds of QoS         ecute the control step only on given events that cause the
levels. If we want to express a deadline, TQoS , for our ap-      availability of new data to tune the applications (i.e. when
plication we might express it as a time constraint or as an       a single iteration of the loop terminates), instead of having
average throughput goal as well; in fact a deadline of TQoS       an external entity that executes with a given period. This
second is the same as an average throughput constraint of         will drastically reduce the impact of the controlling mech-
D/TQoS = T hrQoS actions/s (note that we used the term            anism over the system since now the computation happens
actions because this metric is application dependent). To         only when it is needed and it happens inside every applica-
enforce such deadline we have to guarantee that the desired       tion impacting only its performance. These presented here
D / T hrQoS ratio is greater than the one really achieved         are a few hints of the benefit that can be achieved with this
by the application D / Treal . Considering the structure of       distributed approach while a full comparison with the state
the described applications, we have that if every single it-      of the art is drawn in Section 5.
2.4    Thread Scaling                                                end of the iteration). With respect to standard Heartbeat
The reason behind the choice of thread scaling instead of            framework we allowed the possibility to emit not only the
controlling other resources as number of cores to assign to          heartbeat signal but also to specify the number of heartbeats
one application relies on the fact that controlling the number       to be sent with the signal. This allows one to specify how
of threads solves many management issues while allowing at           much computation has been effectively done in an iteration
the same time for better results, a comparison with other            of the loop, allowing the final user to specify a true high level
approaches is drawn in Section 5. The biggest problem to             metric that does not change with respect to how the input
solve in order to find the correct number of threads to use to       is partitioned across the iterations. Consider a kernel that
execute a kernel is to infer how the application scales when         has a throughput of T hrk operations/s and that takes ∆t
it is assigned a different amount of cores. A first solution         seconds. In the standard library a single heartbeat would
would be to exploit offline profiling to know the performance        be emitted leading to a measured throughput of:
profile of the application. However we do not think this is
                                                                                                1    1           T hrk
the right way to approach the problem since it would require                   T hrstandard =      =   × T hrk =
a different profiling of the application for all the possible tar-                              ∆t   N             N
get machines and for the different size of input data used in        while if we consider the proposed solution the heartbeat will
each iteration of the for loop. Furthermore the scaling curve        signal the computation of N units during the iteration lead-
observed with offline profiling might change at runtime de-          ing to a throughput of:
pending on other applications running in the system. Since
                                                                                                N    N
profiling all the possible applications combinations would                     T hrproposed =      =   × T hrk = T hrk
not be possible, we decided to exploit the same monitoring                                      ∆t   N
information used for detecting application performance, and          Considering a video processing application N might be a
then build online a scalability model of the application.            function of the resolution of the input video and a given
                                                                     goal of 30 Frames/s should not vary while changing the input
The method to predict the performance, which is better ex-           size. The current implementation of the monitoring library
plained in Section 3, starts supposing that the performance          exports three different values to the controller: the instant
of an application scales linearly with the number of the avail-      throughput (i.e. the throughput of the last iteration), the
able threads. The first run of the loop will then provide            window throughput (i.e. the throughput over a set of itera-
an expectation of the actual scalability of the applications         tions), and the global throughput (i.e. the throughput from
and an hint for the policy on the number of threads to use           the start of the program). When an application emits a
in the next iteration. After each iteration the performance          heartbeat signal, the heartbeat value and the emission time
measured for the number of threads used in that execution            are stored in a FIFO buffer having a length defined by the
are trained using an exponential moving average that allows          programmer; this length corresponds to the length of the
the prediction to be adapted to runtime conditions. Points           window used to compute the window throughput. Supposing
not yet tested are predicted interpolating the model start-          a FIFO of length N the three metrics, at iteration i, are so
ing from available values. Threads configurations that are           computed:
not monitored for a certain number of iterations are progres-
sively forgotten so that the model does not get stuck with                       T hrinstant = Heartbeatsi /(ti − ti−1 )
older estimations that are no longer valid.                                              Pi
                                                                                           k=i−N +1 Heartbeatsk /(tk − tk−1 )
                                                                           T hrwndow =
3. IMPLEMENTATION                                                                                          N −1
This section provides implementation details for the differ-
                                                                                               X
ent components of the proposed library. As a general note                       T hrglobal =         Heartbeatsk /(ti − t0 )
all the components here discussed are programmed using                                         k∈I
C++ and the entire library is implemented as an userspace
component. A programmer can write its own application                These three metrics have different meanings, the instant
and compile it against the library adding the proper func-           throughput captures the actual performance of the applica-
tion calls in its program to take advantage of the autonomic         tion and plays a key role in building the performance model
functionalities as further described. Interested readers can         of the kernel. The window throughput instead can be used
look at the complete code for this work cloning this GIT             to monitor applications that specify a QoS that consists in
repository [6].                                                      having a constant throughput throughout all the execution;
                                                                     finally the global throughput can be used to check if an ap-
                                                                     plication is meeting its deadline.
3.1    Monitoring
As a first step, in order to make the library aware of the
performance of the applications, and be able to adapt to             3.2    Model Update
changing runtime conditions, we need a monitoring compo-             To predict which is the right number of threads to use to ex-
nent that captures application performance. This compo-              ecute a kernel fulfilling the requirements, the library needs
nent is modeled after the Heartbeats framework [5]. The              to know how the performance of the application scales with
application can send heartbeats through a proper API and             respect to the number of threads used. One approach fol-
the monitor collects these heartbeats and summarizes them            lowed by many state of the art solutions [7, 3] relies on of-
in high level metrics which are proper of the application. As        fline profiling information to capture both single application
previously highlighted, since the policy controls the num-           performance and also interference caused by colocated work-
ber of threads, the application should emit a heartbeat only         loads. However this solution is unfeasible since the possible
when the parallel section of the kernel ends (i.e. at the            number of application combinations increases exponentially.
s t d : : map>                                   LibThreadScale libQoS ;
        threadPerformanceHistory ;                                                    l i b Q o S . setMinQoS ( MIN QoS ) ;
                                                                                      for ( i . . . ) {
void L i b T h r e a d S c a l e : : updateModel ( )                                  #pragma omp p a r a l l e l f o r
{                                                                                          for ( j . . . ) {
    s t d : : map tempThreadToPerformance ;                                       // Kernel code h e r e
                                                                                           }
      f o r ( auto &v : t h r e a d P e r f o r m a n c e H i s t o r y ) {                l i b Q o S . s e n d H e a r t b e a t (NUM HEARTBEATS) ;
              tempThreadToPerformance [ v . f i r s t ] =                             }
                  e x p o n e n t i a l M o v i n g A v e r a g e ( v . se co n d ,
                   forgettingFactor ) ;
      }                                                                                           Listing 2: Instrumented Application
      model . c r e a t e M o d e l ( tempThreadToPerformance , s t d
          : : p a i r (1 , maxThreads ) ) ;
                                                                                      of data has been processed. Listing 2 illustrates the instru-
}                                                                                     mentation process of the application. Only 3 lines of code
                                                                                      have been added, meaning that the usage of the proposed
               Listing 1: Model update routine                                        library does not heavily impact the application code and
                                                                                      programmability. Upon the execution of the sendHeartbeat
Our solution tries instead to avoid offline profiling and re-                         function the number of threads to use for the next execution
lies only on online monitoring information, collected by the                          is set automatically if OpenMP is used; this number is also
heartbeat-like API, like other approaches [8, 9]. Since the                           returned to the caller if manual implementation of thread
library controls the number of threads to assign to the ap-                           dispatching is needed, for instance when pthreads are used.
plication it has to build a model that estimates for each
possible number of threads the performance of the kernel.                             4. RESULTS
The algorithm to train this model is rather intuitive, and                            This section illustrates the experimental campaign carried
it is based on a linear interpolation of the online monitored                         out to test the functionalities and performance of the auto-
instant throughput. As we discussed at each iteration the                             nomic library proposed in this work. The tests have been
monitor updates the value of the instant throughput that                              performed on the Odroid XU3 development board, which
represents the real performance the kernel can achieve in a                           features an ARM big.LITTLE processor mounting 4×A15,
given moment with the number of threads it used. The in-                              and 4×A17 cores. For sake of clarity, we limit the discussion
stant throughput values are collected in FIFO buffers as it                           to only one case study; an interested reader can download
was for the monitoring data, and the user can set the length                          the source code at [6] and replicate the tests shown here on
of the FIFO. Listing 1 reports the updateModel() routine                              the other benchmarks we instrumented.
which computes the new model on the basis of the last obser-
vations. The routine is divided in two steps; the first one ad-
justs the performance estimation for monitored points on the                          4.1      Case Study
basis of the performance history. The performance history is                          The case study presented here is rather simple, but it is
encoded as a structure (a map) that connects each possible                            quite effective in illustrating the potential of the proposed
thread number configuration to a list of performance mea-                             solution. Consider the case where an embedded device, as
surements. This part computes, for each of the data in the                            the one we used here, is employed on cameras for surveil-
performance history, an exponential moving average over the                           lance purposes. Considering the hardware available on the
list of performance measurements. Depending on the value                              device, is quite natural to perform as much computation as
of the forgettingFactor parameter the last values inserted in                         possible on board instead of streaming the data to external
the FIFO might impact more than the older ones, meaning                               workstations. The system has to perform the task of moni-
that last observations are more meaningful for determin-                              toring the environment and detecting movements; once the
ing the current performance. This behavior is explained by                            movement is detected the device performs a deeper analysis
the fact that if runtime conditions change we want that our                           of the interested frames, while continuing with the motion
model converges quickly to the new estimation. After the es-                          detection task. Since motion detection is critical to initiate
timation for the already monitored point is determined, the                           the second phase and to trigger possible alarms, we need
function creates the final model (model.createModel(...)) in-                         for this task to maintain a constant performance of at least
terpolating linearly the just computed points over a specified                        30 frames per second (fps). The subsequent image analy-
interval (from 1 to the maximum number of threads that can                            ses instead have less priority and it is acceptable to have a
be used). If only one value is present in the tempThreadToP-                          lower performance, let’s say 15 fps. Figure 2 illustrates what
erformance map, the point (0, 0) is used as second point to                           happens in the system when the running applications rely
build the linear interpolation function.                                              on the Odroid native scheduler. As we can see motion de-
                                                                                      tection works at higher performance than the one requested
                                                                                      but when the image analysis is requested its performance
3.3     API                                                                           drops and stays below the desired limit for all the time the
The library exports two APIs that the programmer has to                               image analysis task is running. On the contrary Figure 3
use to instrument its application. All the monitoring and                             illustrates how the proposed solution is able not only to keep
model estimation complexity is hidden behind these APIs,                              the motion detection task in an acceptable working condi-
allowing to minimize the impact of the library on the tar-                            tion range when colocated with the image analysis tasks,
get application. The APIs are the following: (1) setMinQoS                            but it also keeps its performance near to the one requested
: allows to set a minimum QoS requirement (measured as                                using fewer threads to run the task. This decision to scale
throughput) that the library tries to meet; (2) sendHeart-                            the number of threads is done automatically, and helps for
beat : communicates to the library that a certain amount                              instance to reconfigure this particular application when the
                                                                Application                                                                         M
                40                                                Image Analysis 0                       ~ 20% Load   ~ 50% Load    ~ 75% Load
                                                                  Image Analysis 1
                                                                                                                                                        W    w
                                                                  Motion Detection                                                                      G
                30



         Frames/s




                                                                                                MO
                20




                                                                                                T
                                                                                                                                                 QoS Requirement
                                                                                                                                                            m
                10

                    0                                                                                                          Tm
                        0   10        20              30         40                  50
                                           Time [s]
                                                                                          F gure 4 Adaptat on of the dev sed                                 brary to ex-
Figure 2:   Colocated applications managed by                                             terna system oad
Odroid scheduler.
                                                                Application
                                                                                          Tab e 1 Execut on t me for the most nterest ng
                40                                                Image Analysis 0
                                                                  Image Analysis 1        funct on ca s
                30                                                Motion Detection
                                                                                                Funct on    Average µs Std Dev at on µs
         Frames/s




                20                                                                             upda eMode      64 86         45 01
                10                                                                          ge ThreadsToUse    81 82         51 36
                    0
                        0        20                        40
                                                                                              sendHear bea    140 14        105 02
                                           Time [s]

Figure 3: Colocated applications managed by our                                            s about 140 14 µs on average or each terat on1 Note that
solution.                                                                                 s nce the actuat on var es the number o threads the tera-
                                                                                          t ons need to be qu te ong to not ncur n excess ve overhead
new generation of surveillance camera has to be put on the                                wh e spawn ng the threads In our exper ment w th a dea
market. Furthermore, in this case, lowering the number of                                 kerne durat on o 33 33ms (correspond ng to a requ rement
threads implies a reduction in the number of cores active in                              o 30 rame/s) the brary ntroduced an overhead o 0 42%
the system, and subsequently the power consumption and                                    on average wh ch s neg g b e
overall temperature are reduced.
                                                                                          5    RELATED WORKS
4.2    Adapting to Different System Loads                                                 Autonom c resource management has been a hot research
                                                                                          top c over the ast ew years The capab ty o a system
In this experiment we used an instance of the Black and Sc-
                                                                                          to adapt to runt me vary ng cond t ons has been exp ored
holes benchmark, which runs for about one minute, and then
                                                                                           n d fferent context rang ng rom embedded systems 10
we forced on the CPU three different loads using Linux stress
                                                                                          to workstat ons 8 and c oud n rastructures us ng v rtua
and cpulimit utilities in combination. This experiment has
                                                                                          mach nes 7 Many works proposed resource management
the double goal of illustrating how the library reacts to vary-
                                                                                          contro ers w th the goa o manag ng co ocated app cat on
ing system load and that the devised solution can be used
                                                                                          try ng to opt m ze the r per ormance 3 or other figures such
alongside legacy applications without the need for them to
                                                                                          as per ormance per watt 11 D fferent contro techn ques
be instrumented. We injected the load in the system bound-
                                                                                          have been exp ored and a a r y comprehens ve summary can
ing 8 stress workers to the 8 CPUs and we then limit the
                                                                                          be ound n 12 To atta n a good contro mu t p e types
CPU time of these workers to the load we desired. The load
                                                                                          o act ons have been nvest gated such as CPU cores affin-
was kept for 10s and then we left the system at rest (with-
                                                                                           ty management 9 DVFS contro 10 task mapp ng 13
out any external load except for our benchmark) for 5s; the
                                                                                          CPU t me a ocat on 14 and cache co or ng 15 Some
loads used to stress the system where 25%, 50%, and 75%.
                                                                                          o the re ated works re y on offl ne app cat on profi ng to
The goal set for the application was of 5 MOptions/s (i.e.
                                                                                          determ ne the r character st cs (e g I/O or CPU bound) or
number of stock options to process in one second). Figure
                                                                                          how co ocated work oads nter ere one w th the other 7 A -
4 illustrates the performance perceived by the application.
                                                                                          ter the offl ne profi ng phase the on ne a gor thm ensures a
If we look at the throughput perceived by the application
                                                                                          g ven QoS or max m zes a g ven metr c (e g per ormance)
we see that global throughput is always maintained above
                                                                                          However th s approach does not sca e w th the number o
the required QoS threshold regardless of the system load in-
                                                                                          poss b e runn ng app cat ons and n genera on ne profi ng
jected. To achieve this result a different number of threads
                                                                                          so ut ons are better su ted when the set o app cat ons s not
are provisioned to the application at each iteration so that
                                                                                          known n advance as done by 9 14 12 and the proposed
for higher loads the library will consistently use a higher
                                                                                          approach
number of threads.
                                                                                          Cons der ng the resource contro ed to atta n a g ven QoS
4.3    Library Overhead                                                                   the works most s m ar to the one we propose contro s at
                                                                                          runt me e ther the number o CPU cores or the CPU t me
The last part of this section is aimed at showing the overhead
                                                                                          ass gned to app cat ons n order to grant a g ven QoS 3 9
introduced by the proposed library. To measure the over-
                                                                                          14 A these so ut ons as we as our work take n account
head we instrumented our library to collect the execution
                                                                                          the nter erence caused by co ocat on among d fferent app -
times of the most interesting functions. We want to deter-
                                                                                          cat ons when they est mate the resource to a ocate to the
mine the impact of the library on the application, and for
                                                                                          app cat ons However n the r eva uat on these researches
this reason we collected the execution time of the getThread-
                                                                                          used a fixed number o threads per app cat on and then
sToUse, sendHeartbeat and updateModel functions that are
                                                                                          constra ned these threads to execute on a var ab e number
executed when a heartbeat is emitted. Table 1 reports the
                                                                                          o cores depend ng on resource manager dec s on Th s so-
average, standard deviation and maximum execution time
measured over 4500 invocation of the library. As it can be                                1
                                                                                            The three unct on ca s n Tab e 1 are nested on y the ast
noticed from the table the overhead is relatively small and it                            one has been taken n account to compute the overhead)
lution causes slowdown due to the useless context switch         Acknowledgment
overhead, which degrades performance even more when the          This work has been partially funded by the European Com-
context switch delays thread synchronization; in fact a delay    mission in the context of the FP7 SAVE project (#610996-
in executing one instruction from one thread might block an-     SAVE).
other thread that is waiting on a synchronization point. In
the same way [3, 9] possibly cause slowdown at synchroniza-
tion point because they change the thread mapping which          7. REFERENCES
might be suboptimal with respect to memory hierarchy ar-          [1] N. Rajovic et al., “Supercomputing with commodity
chitecture. Usually the operating system uses some heuris-            cpus: are mobile socs ready for hpc?” in High
tics to optimize threads placement among cores and chang-             Performance Computing, Networking, Storage and
ing the affinity manually rarely leads to improvements. The           Analysis (SC), International Conf. for. IEEE, 2013.
approach proposed in this paper aims at changing the num-         [2] L. Dagum et al., “Openmp: an industry standard api
ber of threads used at runtime by an application avoiding             for shared-memory programming,” Computational
context switches among threads of the same application and            Science & Engineering, IEEE, vol. 5, no. 1, 1998.
relying on operating system heuristics for threads placement.     [3] A. Sharifi et al., “Mete: meeting end-to-end qos in
                                                                      multicores through system-wide resource
Finally the last aspect that has to be considered when com-           management,” in Proc. of the ACM SIGMETRICS
paring our approach to the state of the art is the fact that          joint international conference on Measurement and
the control policy of our solution is implemented for each            modeling of computer systems. ACM, 2011.
application running in the system. Most of the state of the       [4] D. B. Bartolini et al., “Towards a
art approaches [14, 3, 7] relies instead on a central entity          performance-as-a-service cloud,” in Proc. of the 4th
that controls the resource assignment for all the application         annual Symposium on Cloud Computing. ACM, 2013.
running in the system. A central entity has the advantage         [5] H. Hoffmann et al., “Application heartbeats: a generic
that it can better manage the resources when the system               interface for specifying program performance and
is overloaded allowing applications with higher priorities to         goals in autonomous computing environments,” in
get more resources, but in this situation the system is al-           Proceedings of the 7th international conference on
ready not respecting QoS so it is just a fact of mitigating           Autonomic computing. ACM, 2010.
the fact that the system is failing in its goal rather than a     [6] G. Durelli, “Source code.” [Online]. Available:
real advantage. Embedding the control inside the single ap-           https://bitbucket.org/durellinux/libthreadscale-code
plication has instead multiple advantages, while maintaining      [7] R. Nathuji et al., “Q-clouds: managing performance
the same effectiveness when the system is not overloaded. A           interference effects for qos-aware clouds,” in
first advantage is the fact that the overhead in which the            Proceedings of the 5th European conference on
application incurs is responsible only by the application it-         Computer systems. ACM, 2010.
self and not on external entities. Another important aspect       [8] D. B. Bartolini et al., “The autonomic operating
is the fact that a central entity can seldom be used with             system research project: achievements and future
legacy applications running in the system, which the con-             directions,” in Proceedings of the 50th Annual Design
troller is not aware of; instead our approach does not have           Automation Conference. ACM, 2013.
any issue with legacy application and can be deployed in any      [9] F. Sironi et al., “Metronome: operating system level
system. Connected to this we can argue that implementing              performance management via self-adaptive
the control inside the application will allow a faster adop-          computing,” in Design Automation Conference (DAC),
tion of the autonomic control since our approach does not             2012 49th. IEEE, 2012.
need any change in the system (which might not belong to
                                                                 [10] F. X. Lin et al., “K2: a mobile operating system for
the final user) and that the action taken by the autonomic
                                                                      heterogeneous coherence domains,” in Proc. of the 19th
library can be customized for each different applications.
                                                                      int. conf. on Architectural support for programming
                                                                      languages and operating systems. ACM, 2014.
6. CONCLUSION                                                    [11] H. Hoffmann and M. Maggio, “Pcp: A generalized
In this paper we presented an autonomic thread scaling li-            approach to optimizing performance under power
brary aimed at guaranteeing to an application certain QoS             constraints through resource management,” in 11th
expressed as a throughput. The library is implemented in              Int. Conf. on Autonomic Computing, 2014.
userspace and compiled against the application, which has        [12] H. Hoffmann, “Seec: A general and extensible
to be instrumented with a few function calls, and seamlessly
                                                                      framework for self-aware computing,âĂİ massachusetts
integrates with OpenMP library. The proposed solution al-
                                                                      institute of technology,” Tech. Rep, Tech. Rep., 2011.
lows to adapt at runtime the number of threads used to com-
pute an iteration of an hot loop to guarantee the desired QoS    [13] J. R. Wernsing and G. Stitt, “Elastic computing: a
exploiting a model of the scalability of the application with         framework for transparent, portable, and adaptive
respect to the number of threads used that is built online. A         multi-core heterogeneous computing,” in ACM
testing campaign illustrated how the library behaves under            SIGPLAN Notices, vol. 45. ACM, 2010.
different working conditions and highlighted how the over-       [14] D. B. Bartolini et al., “Automated fine-grained cpu
head introduced on the application is in the order of 140 µs          provisioning for virtual machines,” ACM Trans. on
which is negligible; furthermore the impact on programma-             Architecture and Code Optimization, vol. 11, 2014.
bility is very low since the library requires the insertion of   [15] A. Sharifi et al., “Courteous cache sharing: Being nice
only 3 lines of code in the application. The library devised          to others in capacity management,” in Design
in this work is available at this link [6].                           Automation Conf. (DAC), 2012 49th. IEEE, 2012.