Optimal Implementation of Power Saving
           Techniques in CGR Systems

                                   Tiziana Fanni

                      Università degli Studi di Cagliari, DIEE
                          tiziana.fanni@diee.unica.it


      Abstract. Coarse-Grained Reconfigurable (CGR) architectures com-
      bine high performance with flexibility, allowing the execution of a large
      set of applications over the same substrate. However, they are also re-
      quired to be energy efficient. This work focuses on a methodology to
      identify which parts of a CGR architecture may benefit from the ap-
      plication of power saving techniques, guiding the designers towards an
      optimal implementation of clock gated and power gated designs.

      Keywords: Reconfigurable Systems · Power Saving · Power Gating ·
      Clock Gating · Power Modeling · Datapath Merging.


1   Motivation and Background

Nowadays small portable devices are required to efficiently execute multiple
fancy functions. Reconfigurable systems [1] are a suitable solution for these collid-
ing requirements. In particular, Coarse-Grained Reconfigurable (CGR) systems
offer higher performance with a certain degree of flexibility. In CGR systems all
the resources belonging to the possible configurations are instantiated in the sub-
strate, and they are multiplexed in time. Thus CGR offer fast reconfiguration,
paying the cost of the power consumed by the resources present in the substrate
but not involved in active configuration.
    Power consumption in digital devices is computed as in Equation 1, where:
1) Plkg is the static power dissipation caused by leakage currents, consumed even
while no circuit activity is present; 2) Pint is the dynamic power consumption
mainly due to the cell switching activity; (3) Pnet is again a dynamic term,
related to the interconnection. With technologies below 90 nm, designers are
required to minimize both static (Plkg ) and dynamic (Pint + Pnet ) terms.

                               P =Plkg + Pint + Pnet                              (1)

   A popular technique to minimize dynamic power consumption is the clock
gating (CG). It consists on switching off the clock of resources not involved in
the computation, and it is applied automatically, at gate level, by commercial
synthesizers. On the other hand, power gating (PG) shuts-off the power of un-
used logic, thus acting also on static consumption. PG requires the instantiation
2       T. Fanni

of several resources (i.e. sleep transistors to switch on/off the power supply, iso-
lation cells to avoid the transmission of spurious signals and state retention logic
to maintain the internal state of the gated region) and the rules to manage them
are quite complicated, thus commercial tools do not provide it automatically.
    In a CGR architecture, considering the minimum set of disjointed Logic Re-
gions (LRs), composed of processing elements that are always active/inactive
together, it is possible to apply to all of them CG or PG techniques. However,
if on one hand CG requires only one AND gate for each region to switch off the
clock tree, PG has a much higher cost due to the additional logic, and the power
overhead may easily overcome the power saved by switching off the unused re-
sources. The research presented in this paper studied an automatic system-level
analysis and implementation flow to estimate in advance the cost of CG and PG
application in a CGR system, leading to the identification of those LRs that can
benefit from the application of these power saving techniques.


2   Methods and Algorithm
The proposed model (see Equations 2 and 3) estimate Plkg and Pint of Equa-
tion 1, when PG and CG are applied at LRs level. They are composed of
two terms: 1) Plkg/intON (LRi ) - consumption within the considered LRs; 2)
Ext Overlkg/int (LRi ) - consumption due to the logic inserted outside the LR.

           Plkg/int (LRi ) = Plkg/intON (LRi ) + Ext Overlkg/int (LRi ) =
               X
         =            [Plkg/int (cmb) + Plkg/int (RC) ∗ #rtn+
           actors∈LRi

         +Plkg/int (reg) ∗ (#reg − #rtn)/#reg] ∗ T iON +                         (2)
         +[Plkg/int (ISOON ) ∗ T iON + Plkg/int (ISOOF F ) ∗ T iOF F ] ∗ #iso+
         +[Plkg/int (ContrON ) ∗ T iON + Plkg/int (ContrOF F ) ∗ T iOF F ]+
         +[Plkg/int (CGON ) ∗ T iON + Plkg/int (CGOF F ) ∗ T iOF F ]


             Plkg/int (LRi ) = Plkg/intON (LRi ) + Ext Overlkg/int (LRi ) =
                 X
           =            [Plkg/int (cmb) + Plkg/int (reg) ∗ T iON ]+              (3)
             actors∈LRi

           +[Plkg/int (CGON ) ∗ T iON + Plkg/int (CGOF F ) ∗ T iOF F ]

In particular, the Equations present the following contributions:

– Combinatorial Logic [Plkg/int (cmb) ∗ T iON and Plkg/int (cmb)]: sum of the
  contributions of the combinational cells, weighted for the activation time
  T iON . CG switches off only the clock tree; therefore, combinational logic is
  always active when CG is applied.
– Sequential Logic [Plkg/int (reg)∗(#reg−#rtn)/#reg∗T iON and Plkg/int (reg)∗
  T iON ]: this term consider only those registers (#reg) inside the LR that are
  not replaced by the retention cells. CG does not have effect on the static
      Optimal Implementation of Power Saving Techniques in CGR Systems                                                                                    3

  power, thus when it is applied only the internal contribution is multiplied by
  T iON (Plkg (reg) and Pint (reg) ∗ T iON ).
– Retention Cells [Plkg/int (RC) ∗ #rtn ∗ T iON )]: This term consider the re-
  tention cells inserted to preserve the status of some registers (#rtn), when
  PG is applied. Plkg/int (RC) is the consumption of a single retention cell.
– Isolation Cells [[Plkg/int (ISOON ) ∗ T iON + Plkg/int (ISOOF F ) ∗ T iOF F ] ∗
  #iso]: In the PG case, isolation cells are inserted and their dissipation is
  proportional to their overall number.
– Clock Gating Cells Plkg/int (CGON ) ∗ T iON + Plkg/int (CGOF F ) ∗ T iOF F ]:
  Clock gating cells are used in both PG and CG cases. In the PG case they
  are required for the proper operation of the retention cells.
– Power Controller [Plkg/int (ContrON )∗T iON +Plkg/int (ContrOF F )∗T iOF F ]:
  inserted to properly drive the enable signals to the power saving logic.

    This estimation model has been embedded in the automatic design flow for
CGR systems offered by the Multi-Dataflow Composer (MDC) tool [2]. MDC
handles the automatic composition and deployment of CGR systems, starting
from the high-level specification of the kernels to be executed, represented as
networks [3]. MDC offers also other functionalities: 1) a structural profiler that
performs the design space exploration of the implementable multi-functional sys-
tems to determine the optimal CGR substrate [4]; 2) a power manager that par-
titions the multi-functional network, identifying the minimum set of disjointed
LRs and applies to all of them either CG [5] or PG [6] (generating an ad-hoc
common power format (CPF) file to specify the power intent early in the design
[7]); a rapid prototyper to embed the CGR system into a Xilinx compliant IP [8,
9]. The MDC power manager has been extended to analyze the design, exploit-
ing the estimation flow, to identify which regions may mostly benefit of PG and
CG application [10, 11].
                                                                                                                     Generation


                                                                                                                                   HDL
                                                 Baseline HDL Generation
                                                                                                         HDL


                                                                                                                                            Synthesizer
           Datapath Merging

                                                                                                                                  Scripts
                Merging Process


                                                                                                                                              Reports
                                       CGR
    NETs
                                       NET
                                                                                        Power Analysis
                                                        Logic Regions

                                                                        Identi cation


                                                                                                                     Generation
                                                                                                         CG/PG HDL


                                                  LR identication                                         Power Saving Application


                                  Fig. 1. Proposed analysis and implementation flow.
4        T. Fanni

    Fig. 1 shows the complete analysis and implementation flow. MDC derives the
HDL code of the baseline CGR system and provides the scripts to perform the
synthesis and all the simulations for the system back annotation. Commercial
tools are used to determine the baseline system power reports, which are fed
back to MDC. MDC identifies the LRs and estimates the number of isolation
cells (#iso) at dataflow level by analyzing the connections between the different
LRs. These data, together with the information gathered by the reports of a
single synthesis run of the baseline CGR system netlist, are used to estimate
power consumption of the identified LRs (consumption of the additional power
saving logic - isolation cells and retention cells is estimated by characterizing
the adopted technology libraries).
The power estimation flow analyzes the LRs according to Algorithm 1:

– area evaluation: PG overhead may affect the power consumption of the smaller
  LRs. Thus, too small LRs are evaluated only for CG application. The thresh-
  old value is set by the user.
– PG evaluation: the power variations due to PG and CG application are esti-
  mated. If PG results more convenient than CG, the LR is candidated for the
  implementation of PG application. Otherwise, CG is evaluated.
– CG evaluation: the power variation due to CG application is estimated. If it
  is not able to save any power, LR is discarded and its logic will be included
  in the always on domain.


    Algorithm 1: Power saving strategy selection for CGR systems.
     PG set is empty;                          if P G total overhead < 0 then
     CG set is empty;                              estimate CG total overhead;
     foreach LRi in set LRs do                     if P G total overhead <
        evaluate area(LRi , areath )                 CG total overhead then
                                                       add LRi to PG set;
     end
                                                   else
     function: evaluate area(LRi , areath ):           add LRi to CG set;
     calculate LRi area;                           end
     if areaLR > areath then                   else
         evaluate PG(LRi );                        evaluate CG(LRi );
     else                                      end
         evaluate CG(LRi );                    evaluate CG(LRi );
     end
                                               CG evaluation: evaluate CG(LRi ):
     PG evaluation: evaluate PG(LRi ):         estimate CG total overhead;
     estimate PG total overhead;               if CG total overhead < 0 then
                                                   add LRi to CG set;
                                               end
      Optimal Implementation of Power Saving Techniques in CGR Systems              5

Table 1. FFT and Zoom test-cases. area rows report percentage area of each LR wrt
total area. Other rows report percentage power variation (both estimated and real) of
the design when CG or PG are applied to the considered LR. N.A. refers to purely
combinational LRs, where CG is not applied.

         LR1   LR2 LR3 LR4 LR5 LR6 LR7 LR8 LR8 LR10 LR11 LR12 LR13
                                FFT targeting 90 nm
 area 62.48 0.65 15.8 0.44 0.45 0.43 7.92 1.36 –                 –     –     –     –
CG real -5.64 -1.42 -1.29 N.A. N.A. N.A. -0.39 -0.12 –           –     –     –     –
CG est -5.65 -1.42 -1.29 N.A. N.A. N.A. -0.39 -0.12 –            –     –     –     –
PG real -25.37 -0.59 -6.22 -0.67 0.14 0.18 -1.92 1.78 –          –     –     –     –
PG est -25.22 -0.54 -6.28 0.01 0.11 0.14 -1.92 1.89 –            –     –     –     –
                               Zoom targeting 90 nm
 area    6.39 34.39 6.44 2.00 0.65 5.01 1.86 13.13 3.40 5.18 3.95 5.58 6.92
CG real -6.31 -23.37 -6.51 -0.96 N.A. -0.87 -1.03 -6.98 -2.44 -5.48 -3.28 -4.27 0.65
CG est -6.30 -23.36 -6.50 -0.95 N.A. -0.87 -1.03 -6.98 -2.43 -5.47 -3.32 -4.26 -0.64
PG real -6.33 -23.46 -6.54 -0.95 0.16 -0.82 -1.03 -7.05 -2.45 -5-50 -3.29 -4.28 -0.61
PG est -6.32 -23.46 -6.53 -0.95 0.15 -0.81 -1.02 -7.06 -2.44 -5.49 -3-34 -4.27 -0.62
                               Zoom targeting 45 nm
 area    6.46 34.00 6.40 2.07 0.71 5.19 1.85 13.00 3.36 5.04 3.87 5.53 6.95
CG real -5.50 -22.02 -6.26 -0.91 N.A.- -0.83 -0.93 -6.71 -2.34 -5.26 -3.20 -4.16 -0.62
CG est -5.50 -22.06 -6.27 -0.91 N.A. -0.83 -0.93 -6.72 -2.35 -5.27 -3.25 -4.17 -0.62
PG real -5.69 -22.98 -6.56 -0.94 0.21 -0.74 -0.95 -7.46 -2.48 -5.47 -3.36 -4.33 -0.58
PG est -5.68 -22.99 -6.56 -0.94 0.19 -0.74 -0.95 -7.46 -2.47 -5.47 -3.42 -4.32 -0.58


3   Methodology Evaluation

To assess the proposed methodology, two different applications are considered,
an FFT application and a computing core accelerating a zoom application. The
Fast Fourier Transform (FFT) [12] is an optimised algorithm for the Discrete
Fourier Transform (DFT) calculation; a DFT of size 2 (radix-2) takes the name
of butterfly. The adopted use case involves a CGR radix-2 FFT of size 8, where 4
different configurations are available, FFT that uses: 1) 12 butterflies; 2) 4 but-
terflies; 3) 2 butterflies; 4) 1 butterfly. In this design MDC identified 8 LRs. The
Zoom coprocessor is composed of seven computational kernels: 1) absolute value
calculation; 2) Bilevel/grayscale block checking; 3) Linear combination calcula-
tion; 4) Cubic filter convolution; 5) Median calculation; 6) Maximum/minimum
finding; 7) Edge block checking. These kernels have been modelled as networks
and implemented over a CGR hardware accelerator. In this CGR co-processor
MDC identified 13 LRs. To validate cross-application and cross-technology ef-
fectiveness of the proposed estimation model, this section presents assessment
on three implementations: FFT targeting a 90 nm CMOS technology; Zoom tar-
geting a 90 nm CMOS technology; Zoom targeting a 45 nm CMOS technology.
    These designs have been synthesized using Cadence RTL Compiler, then
generated post-synthesis simulation reports have been fed back to MDC tool to
estimate the power consumption of gated LRs, by applying Equations 2 and 3.
In order to calculate the accuracy of the proposed estimation models, one power
6       T. Fanni

gated and one clock gated design for each identified LR have been implemented.
Tab. 1 reports the real (extracted from the post-synthesis reports) and estimated
percentage power variation for each LR, respectively when CG and PG strategies
are applied. Comparing real power variations with estimated ones we see that
the worst estimation is related to LR4 of FFT, whose percentage power variation
is so low (-0.67%) that it is impossible to estimate it accurately. For this reason,
the algorithm that evaluates the LRs keeps in consideration also their area.
    Tab. 1 also depicts the percentage area of each LR with respect to design area.
As expected, biggest regions (LR1 in FFT and LR2 in Zoom) are the ones with
the highest power saving, regardless of the considered strategy (PG or CG). Also
the composition of the considered LR has an impact on the effectiveness of the
applied power saving technique. LR1 of FFT is 99% combinational, thus CG can
save really little amount of power in this region. However, simply considering the
LR size may not be sufficient to identify the best candidate for power saving;
indeed it is possible to notice that LR3 of FFT is 15.8% of total area, but
switching it off only saves about 6% or total power.
    The presented methodology speeds-up the evaluation of power saving appli-
cation estimating in advance the effectiveness of PG or CG if applied to the LRs
of a CGR system. For a CGR system composed of N input networks, only one of
the baseline design without any power management, and N (one for each kernel)
simulations to retrieve the real switching activity of the system, are required.
Otherwise, it would be necessary to implement a design for each identified LR
for both CG and PG applications, plus the baseline one (to evaluate the cost
and benefit of the chosen power saving technique).
    As a follow up of this research it is necessary to improve the model, consid-
ering in the power estimation also the contribution of the interconnection which
currently is not addressed. Furthermore, power switches overhead is not consid-
ered yet; these sleep transistors are inserted only during place and route flow,
and a way to estimate in advance how many switches are going to be inserted
for each LR has not been explored yet. The number of switches has effect on the
rush currents during power-up transitions, and on the power-down/up timing,
thus also these issues are going to be addressed in a future work.


4   Related Works

To the best of our knowledge, literature does not treat the problem of modeling
PG and CG costs in CGR designs. Shafique at al. [13] focuses on low-power
techniques and power modeling for FPGAs. In [14], only CG is taken into ac-
count: different power states are defined and their consumption is characterized
by low-level power analysis results. The work in [15] focuses on estimating the
leakage reduction for PG and reverse body bias.
    Other approaches perform an estimation that considers different components.
For instance, Stokke at al. [16] propose a power modeling method for the Tegra
K1 CPU, that taking into account measured rail voltages and fine-grained hard-
ware activity predictors, expose components such as rail and core leakage cur-
      Optimal Implementation of Power Saving Techniques in CGR Systems               7

rents. The FALPEM framework [17] provides power estimations at pre-register
transfer level (RTL) stage, specifically targeting the power consumed by clock
network and interconnection, but PG and CG costs are not defined. Finally, Li et
al. [18] propose an architecture-level integrated power, area, and timing model-
ing framework for multi-core systems, that evaluates system building blocks (i.e.,
CPU, buses, etc.) for different technology nodes, providing also PG support.

References
 1. R. Hartenstein, “A Decade of Reconfigurable Computing: a Visionary Retrospec-
    tive,” in Proc. of Design, Automation and Test in Europe (DATE’01), 2001.
 2. F. Palumbo et al., “Power-Awarness in Coarse-Grained Reconfigurable Multi-
    Functional Architectures: a Dataflow Based Strategy,” Journal of Signal Processing
    Systems, vol. 87, pp. 81–106, Apr 2017. doi: 10.1007/s11265-016-1106-9.
 3. C. Sau et al., “Automated design flow for multi-functional dataflow-based plat-
    forms,” Journal of Signal Processing Systems, vol. 85, pp. 143–165, Oct 2016.
 4. F. Palumbo et al., “Power-awarness in coarse-grained reconfigurable designs: A
    dataflow based strategy,” in Workshop on Signal Processing Systems, 2014.
 5. F. Palumbo et al., “Coarse-grained reconfiguration: dataflow-based power manage-
    ment,” IET Computers Digital Techniques, vol. 9, no. 1, pp. 36–48, 2015.
 6. T. Fanni et al., “Automated Power Gating Methodology for Dataflow-based Re-
    configurable Systems,” in Conference on Computing Frontiers, 2015.
 7. SI2., Si2 Common Power Format SpecificationTM - Version 2.1, Dec. 2014.
 8. C. Sau et al., “Automatic generation of dataflow-based reconfigurable co-processing
    units,” in Conf. on Design and Architectures for Signal and Image Processing, 2014.
 9. C. Sau et al., “Reconfigurable coprocessors synthesis in the MPEG-RVC domain,”
    in Conference on ReConFigurable Computing and FPGAs, 2015.
10. T. Fanni et al., “Power and Clock Gating Modelling in Coarse Grained Reconfig-
    urable Systems,” in Conference on Computing Frontiers, 2016.
11. F. Palumbo et al., “Modelling and automated implementation of optimal power
    saving strategies in coarse-grained reconfigurable architectures,” Journal of Elec-
    trical and Computer Engineering, p. 27, 2016. doi: 10.1155/2016/4237350.
12. J. Cooley and J. Tukey, “An Algorithm for the Machine Computation of Complex
    Fourier Series, vol. 19,” Mathematics of Computation, 1965.
13. M. Shafique et al., “Adaptive Energy Management for Dynamically Reconfigurable
    Processors,” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 33,
    no. 1, pp. 50–63, 2014.
14. J. Yi and J. Kim, “Power modeling for digital circuits with clock gating,” IEICE
    Electronics Express, vol. 12, no. 24, pp. 20150817–20150817, 2015.
15. H. Xu et al., “Run-time Active Leakage Reduction by Power Gating and Reverse
    Body Biasing: An Energy View,” in Conference on Computer Design, 2008.
16. K. R. Stokke et al., “High-Precision Power Modelling of the Tegra K1 Variable
    SMP Processor Architecture,” in Symposium on Embedded Multicore/Many-core
    Systems-on-Chip, 2016.
17. A. Chhabra et al., “FALPEM: framework for architectural-level power estimation
    and optimization for large memory sub-systems,” IEEE Trans. on CAD of Inte-
    grated Circuits and Systems, vol. 34, no. 7, pp. 1138–1142, 2015.
18. S. Li et al., “The McPAT Framework for Multicore and Manycore Architectures:
    Simultaneously Modeling Power, Area, and Timing,” TACO, vol. 10, no. 1, p. 5,
    2013.