Autonomic Thread Scaling Library for QoS Management∗ Gianluca C. Durelli Marco D. Santambrogio Politecnico di Milano Politecnico di Milano Dipartimento di Elettronica, Informazione e Dipartimento di Elettronica, Informazione e Bioingegneria Bioingegneria gianlucacarlo.durelli@polimi.it marco.santambrogio@polimi.it ABSTRACT line resource management. Taking in consideration a wide- Over the last years embedded system industry is facing a spread programming model for parallel applications, such revolution with the introduction of multicores and hetero- as OpenMP [2], we have that applications written follow- geneous devices. The availability of these new platforms ing this standard generally try to exploit all the available opens new paths for these devices that can be nowadays cores in the system, conflicting with the other running ap- used for more high demand tasks, exploiting the parallelism plications. Academia and industries tackled this problem by made available by the muticore processors. Nonetheless the devising algorithms and solutions to minimize interference progress of the HW technology are not backed up by im- among colocated applications [3] and to perform dynamic provements of the SW side, and runtime mechanisms to resource management in multiple context, ranging from em- manage resource allocation and contention on resources are bedded devices to multi-tenants infrastructures [4]. Many still lacking the proper effectiveness. This paper tackles the works proposed to substitute the current programming mod- problem of dynamic resource management from the applica- els with a resource on-demand oriented model where the ap- tion point of view and presents a user space library to con- plications ask the system the resources they need at runtime trol application performance. The control knob exploited (Section 5). by the library is the possibility of scaling the number of threads used by an application and seamlessly integrates This paper tackles the problem of dynamic resource man- with OpenMP. A case study illustrates the benefits that this agement from the application point of view and presents a library has in a classic embedded system scenario, introduc- user space library to control application performance. The ing an overhead of less than 0.3%. control knob exploited by the library is the possibility of scal- ing the number of threads used by an application and this is 1. INTRODUCTION carried out by seamlessly integrating with OpenMP (Section 2). To achieve this goal the proposed solution monitors ap- Over the last years embedded system industry is facing a plication performance at runtime through a heartbeat-like revolution with the introduction of multicores and heteroge- API [5] and builds an estimation of the application scalabil- neous devices. The representative platform in the field has ity considering the contention caused by other applications then moved from the classic micro-controller or single core concurrently running. To summarize this work proposes: processor with relatively low performance and low power consumption to a more high performance family of chips, featuring a high degree of parallelism. An example of this • a library to automatically infer the right number of revolution is represented by the mobile world which moved in threads to use to respect the application Quality of few years from a single core processor working at few hun- Service (QoS) dred megahertz, to nowadays systems featuring octa-cores asymmetric processor working at more than 2 GHz. This • a model to infer the scalability of an application at increased computational capability led to a drastic change runtime considering resource contention in the applicability of this devices; if once they were used only for simple functions, now they can be used to perform highly demanding tasks such as multimedia, video process- Details about the implementation and experimental results ing, or other kinds of high performance computations. As are presented in sections 3 and 4 respectively. The library, a further example of this wide applicability of embedded being implemented in user space, can be used as is on every system devices the EU project Mont Blanc [1] is aimed at system without further system modification, and the code is developing the first High Performance Computing (HPC) available as open source ([6]). cluster based on ARM processors. If on one hand the up- dates on the hardware side marked this big revolution, on 2. PROBLEM DEFINITION the other the software side didn’t keep up and, generally, The availability of multicore architectures in embedded sys- applications in this domain rarely exploit parallelism (as for tems, allows these systems to execute computational inten- instance mobile device applications), and, more importantly, sive tasks taking advantage of the available parallelism. As a there is a big limitation in how the system performs on- consequence multiple applications will be executed, or con- ∗EWiLi’15, October 8th, 2015, Amsterdam, The Nether- solidated, on the same system to increase its utilization. lands. Copyright retained by the authors. Consolidation, as we learned from the utility computing Start eration of the loop respects the throughput constraint then the whole application respects the throughput constraint as well: Fork PD D D D < =⇒ < Titeration TQoS Treal TQoS Thread Thread Thread … Otherwise if we do not know the amount of data to be com- 0 1 N puted we can not set a deadline, but we can instead set a minimum average throughput for our application. In this Join second case we can express the goal over the execution of a single iteration of the loop and so we have to guarantee that PD /Titeration < T hrQoS . As we can see, we can enforce End deadline and throughput constraints in the same way in our applications; this means that we can use a single control Figure 1: Task graph model of supported applica- policy to manage both requirements at the same time. tions. The performance in our application model is expressed as performance of the computational intensive loop of the ap- scenario, will lead to conflict among the concurrent appli- plication. Expressing quality of service constraints in this cations. The reference programming languages to achieve way allows the library to be application independent and high performance (at least for low level libraries) are with more importantly allows the final users to express perfor- no doubt C and C++, and pthread and OpenMP are the two mance in a high level manner. For instance if we consider main solutions to implement parallel applications. This sec- a video processing application, we have that each iteration tion presents the problem tackled by our work and illustrates of the loop will compute one frame; in this case the perfor- the different components that participate at its solution. mance can be set to 30 Frame/s which means that the loop has to execute at least 30 times per second. This concept is 2.1 Application Model similar to the one expressed by the Application Heartbeat This work focuses on the management of CPU intensive, framework [5] from which we derive our monitoring compo- data parallel applications; these applications have the ad- nent. Application Heartbeat proposes that the application vantage of achieving an high speedup when parallelized us- is instrumented with proper function calls that allow to ex- ing multi-threaded implementations. press progress of the application in terms of high level met- rics. Our application will then be instrumented to provide A simple model of these applications is the one presented this information to the autonomic library; Section 3 details in Figure 1. This model is a cyclic graph that represents how the monitoring has been implemented and will clarify the execution of a hot loop of a program. Each iteration of the steps that have to be taken to instrument an application. a loop executes a computational kernel over a set of data and it can be parallelized following a fork-join model. This 2.3 Performance Resource Assignment might seem a fairly simple model to consider but it finds real Given a set of applications that express requirements in application in a wide range of fields, such as image analysis terms of deadline or throughput, the goal of the work is where the loop may work on different images or frames, or to perform a dynamic resource assignment such that each physical simulations where each iteration represents a time application gets the resources it needs to fulfill its require- step of the simulation or the analysis of a partition of the ment. The resource management policy envisioned in this input dataset. work aims at managing the number of threads of an appli- cation and it works on a per application basis, meaning that 2.2 Application Performance Requirements each application will try to reach the solution to the resource The model for the applications so presented allows to de- management problem on its own. The reason behind this fine application’s performance both in terms of constant choice is that it solves many of the issues present in state of throughput and deadline. In particular we can consider two the art works without losing in effectiveness. For instance it cases, whether the size of the input dataset is known a pri- permits to run on a given system both application exploit- ori or not. In the first situation supposing that D is the size ing this dynamic control and legacy applications without any of the dataset and each iteration computes a partition PD problem. Furthermore distributing the control allows to ex- of it, we can express on the application two kinds of QoS ecute the control step only on given events that cause the levels. If we want to express a deadline, TQoS , for our ap- availability of new data to tune the applications (i.e. when plication we might express it as a time constraint or as an a single iteration of the loop terminates), instead of having average throughput goal as well; in fact a deadline of TQoS an external entity that executes with a given period. This second is the same as an average throughput constraint of will drastically reduce the impact of the controlling mech- D/TQoS = T hrQoS actions/s (note that we used the term anism over the system since now the computation happens actions because this metric is application dependent). To only when it is needed and it happens inside every applica- enforce such deadline we have to guarantee that the desired tion impacting only its performance. These presented here D / T hrQoS ratio is greater than the one really achieved are a few hints of the benefit that can be achieved with this by the application D / Treal . Considering the structure of distributed approach while a full comparison with the state the described applications, we have that if every single it- of the art is drawn in Section 5. 2.4 Thread Scaling end of the iteration). With respect to standard Heartbeat The reason behind the choice of thread scaling instead of framework we allowed the possibility to emit not only the controlling other resources as number of cores to assign to heartbeat signal but also to specify the number of heartbeats one application relies on the fact that controlling the number to be sent with the signal. This allows one to specify how of threads solves many management issues while allowing at much computation has been effectively done in an iteration the same time for better results, a comparison with other of the loop, allowing the final user to specify a true high level approaches is drawn in Section 5. The biggest problem to metric that does not change with respect to how the input solve in order to find the correct number of threads to use to is partitioned across the iterations. Consider a kernel that execute a kernel is to infer how the application scales when has a throughput of T hrk operations/s and that takes ∆t it is assigned a different amount of cores. A first solution seconds. In the standard library a single heartbeat would would be to exploit offline profiling to know the performance be emitted leading to a measured throughput of: profile of the application. However we do not think this is 1 1 T hrk the right way to approach the problem since it would require T hrstandard = = × T hrk = a different profiling of the application for all the possible tar- ∆t N N get machines and for the different size of input data used in while if we consider the proposed solution the heartbeat will each iteration of the for loop. Furthermore the scaling curve signal the computation of N units during the iteration lead- observed with offline profiling might change at runtime de- ing to a throughput of: pending on other applications running in the system. Since N N profiling all the possible applications combinations would T hrproposed = = × T hrk = T hrk not be possible, we decided to exploit the same monitoring ∆t N information used for detecting application performance, and Considering a video processing application N might be a then build online a scalability model of the application. function of the resolution of the input video and a given goal of 30 Frames/s should not vary while changing the input The method to predict the performance, which is better ex- size. The current implementation of the monitoring library plained in Section 3, starts supposing that the performance exports three different values to the controller: the instant of an application scales linearly with the number of the avail- throughput (i.e. the throughput of the last iteration), the able threads. The first run of the loop will then provide window throughput (i.e. the throughput over a set of itera- an expectation of the actual scalability of the applications tions), and the global throughput (i.e. the throughput from and an hint for the policy on the number of threads to use the start of the program). When an application emits a in the next iteration. After each iteration the performance heartbeat signal, the heartbeat value and the emission time measured for the number of threads used in that execution are stored in a FIFO buffer having a length defined by the are trained using an exponential moving average that allows programmer; this length corresponds to the length of the the prediction to be adapted to runtime conditions. Points window used to compute the window throughput. Supposing not yet tested are predicted interpolating the model start- a FIFO of length N the three metrics, at iteration i, are so ing from available values. Threads configurations that are computed: not monitored for a certain number of iterations are progres- sively forgotten so that the model does not get stuck with T hrinstant = Heartbeatsi /(ti − ti−1 ) older estimations that are no longer valid. Pi k=i−N +1 Heartbeatsk /(tk − tk−1 ) T hrwndow = 3. IMPLEMENTATION N −1 This section provides implementation details for the differ- X ent components of the proposed library. As a general note T hrglobal = Heartbeatsk /(ti − t0 ) all the components here discussed are programmed using k∈I C++ and the entire library is implemented as an userspace component. A programmer can write its own application These three metrics have different meanings, the instant and compile it against the library adding the proper func- throughput captures the actual performance of the applica- tion calls in its program to take advantage of the autonomic tion and plays a key role in building the performance model functionalities as further described. Interested readers can of the kernel. The window throughput instead can be used look at the complete code for this work cloning this GIT to monitor applications that specify a QoS that consists in repository [6]. having a constant throughput throughout all the execution; finally the global throughput can be used to check if an ap- plication is meeting its deadline. 3.1 Monitoring As a first step, in order to make the library aware of the performance of the applications, and be able to adapt to 3.2 Model Update changing runtime conditions, we need a monitoring compo- To predict which is the right number of threads to use to ex- nent that captures application performance. This compo- ecute a kernel fulfilling the requirements, the library needs nent is modeled after the Heartbeats framework [5]. The to know how the performance of the application scales with application can send heartbeats through a proper API and respect to the number of threads used. One approach fol- the monitor collects these heartbeats and summarizes them lowed by many state of the art solutions [7, 3] relies on of- in high level metrics which are proper of the application. As fline profiling information to capture both single application previously highlighted, since the policy controls the num- performance and also interference caused by colocated work- ber of threads, the application should emit a heartbeat only loads. However this solution is unfeasible since the possible when the parallel section of the kernel ends (i.e. at the number of application combinations increases exponentially. s t d : : map> LibThreadScale libQoS ; threadPerformanceHistory ; l i b Q o S . setMinQoS ( MIN QoS ) ; for ( i . . . ) { void L i b T h r e a d S c a l e : : updateModel ( ) #pragma omp p a r a l l e l f o r { for ( j . . . ) { s t d : : map tempThreadToPerformance ; // Kernel code h e r e } f o r ( auto &v : t h r e a d P e r f o r m a n c e H i s t o r y ) { l i b Q o S . s e n d H e a r t b e a t (NUM HEARTBEATS) ; tempThreadToPerformance [ v . f i r s t ] = } e x p o n e n t i a l M o v i n g A v e r a g e ( v . se co n d , forgettingFactor ) ; } Listing 2: Instrumented Application model . c r e a t e M o d e l ( tempThreadToPerformance , s t d : : p a i r (1 , maxThreads ) ) ; of data has been processed. Listing 2 illustrates the instru- } mentation process of the application. Only 3 lines of code have been added, meaning that the usage of the proposed Listing 1: Model update routine library does not heavily impact the application code and programmability. Upon the execution of the sendHeartbeat Our solution tries instead to avoid offline profiling and re- function the number of threads to use for the next execution lies only on online monitoring information, collected by the is set automatically if OpenMP is used; this number is also heartbeat-like API, like other approaches [8, 9]. Since the returned to the caller if manual implementation of thread library controls the number of threads to assign to the ap- dispatching is needed, for instance when pthreads are used. plication it has to build a model that estimates for each possible number of threads the performance of the kernel. 4. RESULTS The algorithm to train this model is rather intuitive, and This section illustrates the experimental campaign carried it is based on a linear interpolation of the online monitored out to test the functionalities and performance of the auto- instant throughput. As we discussed at each iteration the nomic library proposed in this work. The tests have been monitor updates the value of the instant throughput that performed on the Odroid XU3 development board, which represents the real performance the kernel can achieve in a features an ARM big.LITTLE processor mounting 4×A15, given moment with the number of threads it used. The in- and 4×A17 cores. For sake of clarity, we limit the discussion stant throughput values are collected in FIFO buffers as it to only one case study; an interested reader can download was for the monitoring data, and the user can set the length the source code at [6] and replicate the tests shown here on of the FIFO. Listing 1 reports the updateModel() routine the other benchmarks we instrumented. which computes the new model on the basis of the last obser- vations. The routine is divided in two steps; the first one ad- justs the performance estimation for monitored points on the 4.1 Case Study basis of the performance history. The performance history is The case study presented here is rather simple, but it is encoded as a structure (a map) that connects each possible quite effective in illustrating the potential of the proposed thread number configuration to a list of performance mea- solution. Consider the case where an embedded device, as surements. This part computes, for each of the data in the the one we used here, is employed on cameras for surveil- performance history, an exponential moving average over the lance purposes. Considering the hardware available on the list of performance measurements. Depending on the value device, is quite natural to perform as much computation as of the forgettingFactor parameter the last values inserted in possible on board instead of streaming the data to external the FIFO might impact more than the older ones, meaning workstations. The system has to perform the task of moni- that last observations are more meaningful for determin- toring the environment and detecting movements; once the ing the current performance. This behavior is explained by movement is detected the device performs a deeper analysis the fact that if runtime conditions change we want that our of the interested frames, while continuing with the motion model converges quickly to the new estimation. After the es- detection task. Since motion detection is critical to initiate timation for the already monitored point is determined, the the second phase and to trigger possible alarms, we need function creates the final model (model.createModel(...)) in- for this task to maintain a constant performance of at least terpolating linearly the just computed points over a specified 30 frames per second (fps). The subsequent image analy- interval (from 1 to the maximum number of threads that can ses instead have less priority and it is acceptable to have a be used). If only one value is present in the tempThreadToP- lower performance, let’s say 15 fps. Figure 2 illustrates what erformance map, the point (0, 0) is used as second point to happens in the system when the running applications rely build the linear interpolation function. on the Odroid native scheduler. As we can see motion de- tection works at higher performance than the one requested but when the image analysis is requested its performance 3.3 API drops and stays below the desired limit for all the time the The library exports two APIs that the programmer has to image analysis task is running. On the contrary Figure 3 use to instrument its application. All the monitoring and illustrates how the proposed solution is able not only to keep model estimation complexity is hidden behind these APIs, the motion detection task in an acceptable working condi- allowing to minimize the impact of the library on the tar- tion range when colocated with the image analysis tasks, get application. The APIs are the following: (1) setMinQoS but it also keeps its performance near to the one requested : allows to set a minimum QoS requirement (measured as using fewer threads to run the task. This decision to scale throughput) that the library tries to meet; (2) sendHeart- the number of threads is done automatically, and helps for beat : communicates to the library that a certain amount instance to reconfigure this particular application when the Application M 40 Image Analysis 0 ~ 20% Load ~ 50% Load ~ 75% Load Image Analysis 1 W w Motion Detection G 30 Frames/s MO 20 T QoS Requirement m 10 0 Tm 0 10 20 30 40 50 Time [s] F gure 4 Adaptat on of the dev sed brary to ex- Figure 2: Colocated applications managed by terna system oad Odroid scheduler. Application Tab e 1 Execut on t me for the most nterest ng 40 Image Analysis 0 Image Analysis 1 funct on ca s 30 Motion Detection Funct on Average µs Std Dev at on µs Frames/s 20 upda eMode 64 86 45 01 10 ge ThreadsToUse 81 82 51 36 0 0 20 40 sendHear bea 140 14 105 02 Time [s] Figure 3: Colocated applications managed by our s about 140 14 µs on average or each terat on1 Note that solution. s nce the actuat on var es the number o threads the tera- t ons need to be qu te ong to not ncur n excess ve overhead new generation of surveillance camera has to be put on the wh e spawn ng the threads In our exper ment w th a dea market. Furthermore, in this case, lowering the number of kerne durat on o 33 33ms (correspond ng to a requ rement threads implies a reduction in the number of cores active in o 30 rame/s) the brary ntroduced an overhead o 0 42% the system, and subsequently the power consumption and on average wh ch s neg g b e overall temperature are reduced. 5 RELATED WORKS 4.2 Adapting to Different System Loads Autonom c resource management has been a hot research top c over the ast ew years The capab ty o a system In this experiment we used an instance of the Black and Sc- to adapt to runt me vary ng cond t ons has been exp ored holes benchmark, which runs for about one minute, and then n d fferent context rang ng rom embedded systems 10 we forced on the CPU three different loads using Linux stress to workstat ons 8 and c oud n rastructures us ng v rtua and cpulimit utilities in combination. This experiment has mach nes 7 Many works proposed resource management the double goal of illustrating how the library reacts to vary- contro ers w th the goa o manag ng co ocated app cat on ing system load and that the devised solution can be used try ng to opt m ze the r per ormance 3 or other figures such alongside legacy applications without the need for them to as per ormance per watt 11 D fferent contro techn ques be instrumented. We injected the load in the system bound- have been exp ored and a a r y comprehens ve summary can ing 8 stress workers to the 8 CPUs and we then limit the be ound n 12 To atta n a good contro mu t p e types CPU time of these workers to the load we desired. The load o act ons have been nvest gated such as CPU cores affin- was kept for 10s and then we left the system at rest (with- ty management 9 DVFS contro 10 task mapp ng 13 out any external load except for our benchmark) for 5s; the CPU t me a ocat on 14 and cache co or ng 15 Some loads used to stress the system where 25%, 50%, and 75%. o the re ated works re y on offl ne app cat on profi ng to The goal set for the application was of 5 MOptions/s (i.e. determ ne the r character st cs (e g I/O or CPU bound) or number of stock options to process in one second). Figure how co ocated work oads nter ere one w th the other 7 A - 4 illustrates the performance perceived by the application. ter the offl ne profi ng phase the on ne a gor thm ensures a If we look at the throughput perceived by the application g ven QoS or max m zes a g ven metr c (e g per ormance) we see that global throughput is always maintained above However th s approach does not sca e w th the number o the required QoS threshold regardless of the system load in- poss b e runn ng app cat ons and n genera on ne profi ng jected. To achieve this result a different number of threads so ut ons are better su ted when the set o app cat ons s not are provisioned to the application at each iteration so that known n advance as done by 9 14 12 and the proposed for higher loads the library will consistently use a higher approach number of threads. Cons der ng the resource contro ed to atta n a g ven QoS 4.3 Library Overhead the works most s m ar to the one we propose contro s at runt me e ther the number o CPU cores or the CPU t me The last part of this section is aimed at showing the overhead ass gned to app cat ons n order to grant a g ven QoS 3 9 introduced by the proposed library. To measure the over- 14 A these so ut ons as we as our work take n account head we instrumented our library to collect the execution the nter erence caused by co ocat on among d fferent app - times of the most interesting functions. We want to deter- cat ons when they est mate the resource to a ocate to the mine the impact of the library on the application, and for app cat ons However n the r eva uat on these researches this reason we collected the execution time of the getThread- used a fixed number o threads per app cat on and then sToUse, sendHeartbeat and updateModel functions that are constra ned these threads to execute on a var ab e number executed when a heartbeat is emitted. Table 1 reports the o cores depend ng on resource manager dec s on Th s so- average, standard deviation and maximum execution time measured over 4500 invocation of the library. As it can be 1 The three unct on ca s n Tab e 1 are nested on y the ast noticed from the table the overhead is relatively small and it one has been taken n account to compute the overhead) lution causes slowdown due to the useless context switch Acknowledgment overhead, which degrades performance even more when the This work has been partially funded by the European Com- context switch delays thread synchronization; in fact a delay mission in the context of the FP7 SAVE project (#610996- in executing one instruction from one thread might block an- SAVE). other thread that is waiting on a synchronization point. In the same way [3, 9] possibly cause slowdown at synchroniza- tion point because they change the thread mapping which 7. REFERENCES might be suboptimal with respect to memory hierarchy ar- [1] N. Rajovic et al., “Supercomputing with commodity chitecture. Usually the operating system uses some heuris- cpus: are mobile socs ready for hpc?” in High tics to optimize threads placement among cores and chang- Performance Computing, Networking, Storage and ing the affinity manually rarely leads to improvements. The Analysis (SC), International Conf. for. IEEE, 2013. approach proposed in this paper aims at changing the num- [2] L. Dagum et al., “Openmp: an industry standard api ber of threads used at runtime by an application avoiding for shared-memory programming,” Computational context switches among threads of the same application and Science & Engineering, IEEE, vol. 5, no. 1, 1998. relying on operating system heuristics for threads placement. [3] A. Sharifi et al., “Mete: meeting end-to-end qos in multicores through system-wide resource Finally the last aspect that has to be considered when com- management,” in Proc. of the ACM SIGMETRICS paring our approach to the state of the art is the fact that joint international conference on Measurement and the control policy of our solution is implemented for each modeling of computer systems. ACM, 2011. application running in the system. Most of the state of the [4] D. B. Bartolini et al., “Towards a art approaches [14, 3, 7] relies instead on a central entity performance-as-a-service cloud,” in Proc. of the 4th that controls the resource assignment for all the application annual Symposium on Cloud Computing. ACM, 2013. running in the system. A central entity has the advantage [5] H. Hoffmann et al., “Application heartbeats: a generic that it can better manage the resources when the system interface for specifying program performance and is overloaded allowing applications with higher priorities to goals in autonomous computing environments,” in get more resources, but in this situation the system is al- Proceedings of the 7th international conference on ready not respecting QoS so it is just a fact of mitigating Autonomic computing. ACM, 2010. the fact that the system is failing in its goal rather than a [6] G. Durelli, “Source code.” [Online]. Available: real advantage. Embedding the control inside the single ap- https://bitbucket.org/durellinux/libthreadscale-code plication has instead multiple advantages, while maintaining [7] R. Nathuji et al., “Q-clouds: managing performance the same effectiveness when the system is not overloaded. A interference effects for qos-aware clouds,” in first advantage is the fact that the overhead in which the Proceedings of the 5th European conference on application incurs is responsible only by the application it- Computer systems. ACM, 2010. self and not on external entities. Another important aspect [8] D. B. Bartolini et al., “The autonomic operating is the fact that a central entity can seldom be used with system research project: achievements and future legacy applications running in the system, which the con- directions,” in Proceedings of the 50th Annual Design troller is not aware of; instead our approach does not have Automation Conference. ACM, 2013. any issue with legacy application and can be deployed in any [9] F. Sironi et al., “Metronome: operating system level system. Connected to this we can argue that implementing performance management via self-adaptive the control inside the application will allow a faster adop- computing,” in Design Automation Conference (DAC), tion of the autonomic control since our approach does not 2012 49th. IEEE, 2012. need any change in the system (which might not belong to [10] F. X. Lin et al., “K2: a mobile operating system for the final user) and that the action taken by the autonomic heterogeneous coherence domains,” in Proc. of the 19th library can be customized for each different applications. int. conf. on Architectural support for programming languages and operating systems. ACM, 2014. 6. CONCLUSION [11] H. Hoffmann and M. Maggio, “Pcp: A generalized In this paper we presented an autonomic thread scaling li- approach to optimizing performance under power brary aimed at guaranteeing to an application certain QoS constraints through resource management,” in 11th expressed as a throughput. The library is implemented in Int. Conf. on Autonomic Computing, 2014. userspace and compiled against the application, which has [12] H. Hoffmann, “Seec: A general and extensible to be instrumented with a few function calls, and seamlessly framework for self-aware computing,âĂİ massachusetts integrates with OpenMP library. The proposed solution al- institute of technology,” Tech. Rep, Tech. Rep., 2011. lows to adapt at runtime the number of threads used to com- pute an iteration of an hot loop to guarantee the desired QoS [13] J. R. Wernsing and G. Stitt, “Elastic computing: a exploiting a model of the scalability of the application with framework for transparent, portable, and adaptive respect to the number of threads used that is built online. A multi-core heterogeneous computing,” in ACM testing campaign illustrated how the library behaves under SIGPLAN Notices, vol. 45. ACM, 2010. different working conditions and highlighted how the over- [14] D. B. Bartolini et al., “Automated fine-grained cpu head introduced on the application is in the order of 140 µs provisioning for virtual machines,” ACM Trans. on which is negligible; furthermore the impact on programma- Architecture and Code Optimization, vol. 11, 2014. bility is very low since the library requires the insertion of [15] A. Sharifi et al., “Courteous cache sharing: Being nice only 3 lines of code in the application. The library devised to others in capacity management,” in Design in this work is available at this link [6]. Automation Conf. (DAC), 2012 49th. IEEE, 2012.