=Paper= {{Paper |id=Vol-1643/paper-02 |storemode=property |title=A Scalable Black-Box Optimization System for Auto-Tuning VLSI Synthesis Programs |pdfUrl=https://ceur-ws.org/Vol-1643/paper-02.pdf |volume=Vol-1643 |authors=Matthew M. Ziegler,Hung-Yi Liu,George Gristede,Bruce Owens,Ricardo Nigaglioni,Luca Carloni |dblpUrl=https://dblp.org/rec/conf/date/ZieglerLGONC16a }} ==A Scalable Black-Box Optimization System for Auto-Tuning VLSI Synthesis Programs== https://ceur-ws.org/Vol-1643/paper-02.pdf
Proceedings of 1st Workshop on Resource Awareness and Application Autotuning in Adaptive and Heterogeneous Computing (RES4ANT) 2016




                                  A Scalable Black-Box Optimization System for
                                     Auto-Tuning VLSI Synthesis Programs

                                 Matthew M. Ziegler1, Hung-Yi Liu2*, George Gristede1, Bruce Owens3,
                                             Ricardo Nigaglioni4, and Luca P. Carloni2

                                       1 IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
                                                {zieglerm, gristede}@us.ibm.com
                                       2 Department of Computer Science, Columbia University, NY, USA
                                                 {hungyi,luca}@cs.columbia.edu}
                                           3 IBM Systems & Technology Group, Rochester, MN, USA
                                                           browens@us.ibm.com
                                             4 IBM Systems & Technology Group, Austin, TX, USA
                                                          nricardo@us.ibm.com

                                 Abstract. Modern logic and physical synthesis tools provide numerous options
                                 and parameters that can drastically impact design quality; however the large
                                 number of options leads to a complex design space difficult for human circuit
                                 designers to navigate. We tackle this parameter tuning problem with a novel
                                 system employing intelligent search strategies and parallel computing, thus au-
                                 tomating one of the key design tasks conventionally performed by a human de-
                                 signer. We provide an overview of this system, called SynTunSys, as well as
                                 results from employing it during the design of the IBM z13 22nm high-
                                 performance server chip, currently in production. During this major processor
                                 design, SynTunSys provided significant savings in human design effort and
                                 achieved a quality of results beyond what human designers alone could achieve,
                                 yielding on average a 36% improvement in total negative slack and a 7% power
                                 reduction.1

                                 Keywords: synthesis · design space exploration · parameter tuning · optimiza-
                                 tion · VLSI design · design methodology

                      1           Introduction
                      The design of modern high-performance processors is a quest to optimally tune and
                      balance multiple objectives, such as performance, power, and reliability. This multi-
                      objective design space is further complicated by the need for more complex VLSI
                      (very-large-scale integration) chips to fuel the ever increasing desire for more com-
                      pute power. To cope with this design complexity, the VLSI design community has
                      leveraged CAD (computer-aided design) tools for many decades now; however, the
                      high flexibility and sophistication of advanced synthesis tools increases their com-
                      plexity and makes navigating the design space difficult and sometimes non-intuitive
                      for their users.

                      *
                          Hung-Yi Liu is now with the Intel Design Technology & Solutions Group, Hillsboro, OR, USA.
Proceedings of 1st Workshop on Resource Awareness and Application Autotuning in Adaptive and Heterogeneous Computing (RES4ANT) 2016




                            Fig. 1. An example of the available design space by modifying synthesis parameters.

                         The industrial synthesis tool-flow we employ has over 1000 parameters [1]. These
                      parameters span the logic and physical synthesis space and the control settings for
                      modifying the synthesis steps, such as: logic decomposition, technology mapping,
                      placement, estimated wire optimization, power recovery, area recovery, and/or higher
                      effort timing improvement. The parameters also vary in data type (Boolean, integer,
                      floating point, and string). Considering that an exhaustive search of only 20 Boolean-
                      type parameters leads to over one million combinations, and synthesis runs may take
                      several hours or even days, it is clear that intelligent search strategies are required.
                         As an example of the wide design space available from modifying synthesis pa-
                      rameters, Fig. 1 shows the scatter plot of achievable design points for a portion of a
                      synthesized floating-point multiplier macro. A macro may span from 1K to 1M gates
                      in our context. Each point denotes the timing and power values achieved simply by
                      tuning the input parameters of the synthesis program. The ultimate goal of this pro-
                      cess is to reach timing closure at the lowest achievable power. Quite often the default
                      values for the parameters are not ideal for a specific macro, which would benefit in-
                      stead from parameter customization. Fig. 1 also highlights three scenarios (A, B, and
                      C) along the Pareto frontier. These scenarios show the available tradeoffs between
                      timing closure and power reduction, e.g., point A closes timing with a 9% power re-
                      duction, whereas point C improves timing by 55% with a 29% power reduction.
                      These points along the Pareto set provide a number of potential steps towards the
                      ultimate goal, depending on the additional techniques at the designer’s disposal be-
                      yond parameter tuning. This example of a relatively simple macro underscores how
                      significantly the parameters settings can affect a design.

                      2      Related Work
                      The synthesis parameter tuning problem we address can be classified as a black-box
                      optimization problem, i.e., we treat the synthesis program as black-box software by
                      supplying input conditions (input data and parameter settings) and measuring the
Proceedings of 1st Workshop on Resource Awareness and Application Autotuning in Adaptive and Heterogeneous Computing (RES4ANT) 2016




                          Fig. 2. Architecture of the SynTunSys process, which employs a parallel and iterative
                          tuning process to optimize macros.

                      output response in terms of synthesis quality of results (QoR). Black-box problems
                      are often approached using techniques from the field of simulation optimization [2],
                      which is an umbrella term for optimization techniques that operate in the absence in
                      of an algebraic model of the system. Considering each macro exhibits a unique input-
                      output response to synthesis parameter settings and digital logic can take on an intrac-
                      table number of functionalities, the synthesis tool-flow of our focus is far too complex
                      to be modelled algebraically. Furthermore, our synthesis program often exhibits non-
                      deterministic behavior, which is also a characteristic of many simulation optimization
                      problems.
                                Black-box optimization techniques can also be employed for design-space
                      exploration (DSE) purposes, however, unlike convention DSE, the goal of black-box
                      optimization is often to find one or more optimal or near-optimal design points with-
                      out necessarily requiring a complete exploration of the design space to determine the
                      whole Pareto frontier of design tradeoff points.
                         Black-box optimization is a common problem seen across a number of fields, e.g.,
                      compiler tuning [3] and software engineering [4]. With respect to VLSI design, DSE
                      is becoming a more attractive solution for complex problems across various levels of
                      abstraction. At the architectural level, many DSE studies based on models or simula-
                      tors have been used to explore multi-objective design spaces, e.g., [5]. Architectural-
                      level studies, however, typically do not result in implemented designs. DSE ap-
                      proaches have also been applied in combination with high-level synthesis by a num-
                      ber of researchers, e.g., [6,7]. FPGA synthesis parameter tuning has been reported in
                      [8] using genetic algorithms and in [9] using Bayesian optimization. However, prior
                      to our work [10], we know of no publications targeting automated parameter tuning
                      for logic and physical synthesis.

                      3        System Architecture
                      The framework of our parameter tuning system is shown in Fig. 2. SynTunSys con-
                      sists of a main tuning loop that constructs synthesis scenarios consisting of synthesis
                      parameter settings (Step (1)), submits and monitors synthesis jobs (2-3), analyzes the
Proceedings of 1st Workshop on Resource Awareness and Application Autotuning in Adaptive and Heterogeneous Computing (RES4ANT) 2016




                      Table 1.   Average parameter tuning improvements over best known prior solution for a
                      22nm processor in production.

                                     Pre / Post            Latch-     Total
                                    SynTunSys           to-Latch     Negative       Total Power
                                    Comparison            Slack       Slack

                                 Improvement %              60%          36%               7%
                                Sum of 200 macros           (ps)          (ps)         (arb. units)
                                   pre-tuning              -1929       -2150385           17770
                                   post-tuning              -765       -1370731           16508

                      results (4), and iteratively refines the solutions (5). A second background loop ar-
                      chives the results of all runs from all macros, users, and projects, which can be mined
                      for historical trends across multiple macros and to provide feedback in terms of the
                      performance of synthesis parameters.
                         The SysTunSys cost function conveys the optimization goals. It converts multiple
                      design metrics into a single cost number, allowing cost ranking of scenarios. Exam-
                      ples of available metrics include: multiple timing metrics, power consumption, con-
                      gestion metrics, area utilization, electrical violations, runtime, etc. The selected met-
                      rics are assigned weights to signify their relative importance. The overall cost func-
                      tion is then a “normalized weighted sum” of the m selected metrics, expressed by
                      Equation (1) where Norm(Mi) is the normalized Mi across all the scenario results in a
                      SynTunSys run.


                                                                                (1)
                         The SynTunSys decision engine algorithms are key components of the system that
                      determine which scenarios should be run at each iteration, i.e., the decision engine
                      tackles the parameter tuning black-box optimization problem discussed previously.
                      The decision engine can also be upgraded independently and we constantly look to
                      improve these algorithms in terms of QoR prediction accuracy and compute efficien-
                      cy. Similar black-box tuning problems have been approached using a number of tech-
                      niques, e.g., machine learning [3], Markov decision processes [11] and Bayesian op-
                      timization [9,12]. During the development of SynTunSys we have explored algo-
                      rithms ranging from pseudo-genetic algorithms to on-line adaptive learning [13].

                      4      SynTunSys Results
                      SynTunSys was used during the design of the IBM z13 22nm server processor [14].
                      The processor underwent two chip releases (tapeouts) over a multi-year design cycle,
                      during which SynTunSys was applied to macros over both releases. The chip consists
                      of a few hundred macros that average around 30K gates in size, with larger macros in
                      the 300K gate range. Prior IBM server processors have also used systematic parame-
                      ter tuning on a smaller scale. The IBM POWER7+ [15] and POWER8 [16] processors
                      employed an earlier version of SynTunSys during the second chip releases, but main-
                      ly for power reduction purposes. In the case of the z13 processor, however, Syn-
Proceedings of 1st Workshop on Resource Awareness and Application Autotuning in Adaptive and Heterogeneous Computing (RES4ANT) 2016




                      TunSys was employed during the entire design cycle for timing closure, power reduc-
                      tion, and improving macro routability.
                         Based on the efforts of a dedicated tuning team we were able to track SynTunSys
                      results on approximately 200 macros from the processor core. Table 1 shows the av-
                      erage improvements achieved by SynTunSys over the best solution previously
                      achieved by the macro owners.
                         Note that these results are based on the routed macro timing and power analysis; in
                      most cases the best known prior solutions included manual parameter tuning by the
                      macro owner. SynTunSys resulted in a 36% improvement in total negative slack, a
                      60% improvement in worst latch-to-latch slack (macro internal slack), and a 7% pow-
                      er reduction. The actual values of the metrics, summed across all the macros, under-
                      scores that the changes in the absolute numbers were significant, e.g., ~780,000 pico-
                      seconds of total negative slack was saved across ~200 macros.

                      5      References
                      [1] L. Trevillyan, et al., “An Integrated Environment for Technology Closure of Deep-
                           Submicron IC Designs,” IEEE Design & Test of Computers, vol. 21:1, pp. 14-22, 2004.
                      [2] S. Amaran, et al., “Simulation Optimization: A Review of Algorithms and Applications,”
                           4OR - A Quarterly Journal of Operations Research, Dec. 2014.
                      [3] G. Fursin, et al., "Milepost GCC: Machine Learning Enabled Self-tuning Compiler," In-
                           ternational Journal Parallel Programming, 39:296-327, 2011.
                      [4] A. Arcuri, G. Fraser, "Parameter Tuning or Default Values? An Empirical Investigation in
                           Search-Based Software Engineering," Empirical Software Engineering, June 2013, Vol-
                           ume 18, Issue 3.
                      [5] O. Azizi, et al., "An Integrated Framework for Joint Design Space Exploration of Microar-
                           chitecture and Circuits," DATE 2010.
                      [6] S. Xydis, et al., "A Meta-Model Assisted Coprocessor Synthesis Framework for Compil-
                           er/Architecture Parameters Customization," DATE 2013.
                      [7] H.-Y. Liu and L. P. Carloni, “On Learning-Based Methods for Design-Space Exploration
                           with High-Level Synthesis,” DAC 2013.
                      [8] M. K. Papamichael, P. Milder, J. C. Hoe, "Nautilus: Fast Automated IP Design Space
                           Search Using Guided Genetic Algorithms," DAC 2015.
                      [9] N. Kapre, et al., “Driving Timing Convergence of FPGA Designs through Machine Learn-
                           ing and Cloud Computing,” FCCM 2015.
                      [10] M. M. Ziegler, H.-Y. Liu, G. Gristede, B. Owens, R. Nigaglioni, L. P. Carloni, “A Synthe-
                           sis-Parameter Tuning System for Autonomous Design-Space Exploration,” DATE 2016.
                      [11] G. Beltrame et al., “Decision-Theoretic Design Space Exploration of Multiprocessor Plat-
                           forms,” IEEE TCAD, 29(7):1083–1095, July 2010.
                      [12] Z. Wang et al., “Bayesian Optimization in High Dimensions via Random Embeddings,”
                           Int’l Joint Conf. on Artificial Intelligence (IJCAI-13), 2013.
                      [13] M. M. Ziegler, H.-Y. Liu, L. P. Carloni, “Scalable Auto-Tuning of Synthesis Parameters
                           for Optimizing High-Performance Processors,” ISLPED 2016.
                      [14] J. D. Warnock, et al., “22nm Next-Generation IBM System z Microprocessor,” ISSCC
                           2015.
                      [15] M. M. Ziegler, G. D. Gristede, V. V. Zyuban, “Power Reduction by Aggressive Synthesis
                           Design Space Exploration,” ISLPED 2013.
                      [16] M. M. Ziegler, et al., “POWER8 Design Methodology Innovations for Improving Produc-
                           tivity and Reducing Power,” CICC 2014.