A Heuristic Scheduling Algorithm with Variable-Cycle Approximate Functional Units in High-Level Synthesis Koyu Ohata1 Hiroki Nishikawa2 Xiangbo Kong1 Hiroyuki Tomiyama1 1 Graduate School of Science and Engineering, Ritsumeikan University, Shiga, Japan 2 Graduate School of Information Science and Technology, Osaka University, Osaka, Japan Abstract Computational approximation is a promising paradigm to exploit hardware capabilities or mitigate computational demands, taking advantage of inherent tolerance to errors in applications. For several decades, a variety of works for approximate computing have been extensively studied. In particular, accuracy-controllable approximate functional units have been developed to be configurable to control the accuracy at runtime. To satisfy the requirement of performance, hardware cost and accuracy, the techniques for design space exploration are important since the excessive use of approximate circuits results in loss of accuracy, while that of accurate circuits can hardly meet time constraints. This paper proposes a list scheduling algorithm in the high-level synthesis of accuracy-controllable approximate functional units with variable latency. The experiments demonstrate that our proposed scheduling method can efficiently explore the trade-offs between performance, hardware cost, and accuracy in a practical time, compared to the state-of-the-art methods. Keywords 1 High-level synthesis, approximate computing, scheduling, approximate multiplication, list scheduling 1. Introduction Approximate computing is attracting attention to rapidly designing high-performance, low-cost and low-power circuits for error-tolerant applications such as image processing and machine learning. Design techniques for approximate arithmetic circuits have been studied at various design levels, from the transistor to the architecture level [1-2]. The quality requirements of an error-tolerant application using approximate computing may vary significantly at runtime. There have appeared several works [3-5] on approximate functional units that can dynamically change accuracy at runtime. A carry- maskable adder [3] was proposed to enable dynamically selecting the length of the carry propagation to satisfy the quality requirements. Also, an accuracy-controllable multiplier [4] was proposed where the last stage is generated by a carry-maskable adder. The approximate multiplier has an additional input, and it is utilized to control the error at the output. If the error is large, the latency and power of the circuit will be small. On the other hand, if the error is reduced, the latency and power of the circuit will be large. Thus, the trade-off between error, power, and latency can be optimized at runtime by using a functional unit with variable accuracy. Further on, the work in [5] proposed an approximate multiplier that can dynamically change its accuracy by using a digital signal processor (DSP) with an embedded field-programmable gate array (FPGA). These studies show that the more inaccurate the computation, the faster it becomes. There is a large amount of literature on high-level synthesis (HLS) of approximate computing circuits such as [6-8]. Unfortunately, most of the works previously mentioned have paid little attention to allocation, scheduling, and binding algorithms, which are the crucial techniques in HLS. Regarding HLS techniques aware of accuracy-controllable approximate functional units, there is the work [9]. The The 4th International Symposium on Advanced Technologies and Applications in the Internet of Things (ATAIT 2022), August 24-26, 2022, Ibaraki, Japan EMAIL: koyu.ohata@tomiyama-lab.org (K. Ohata); nishikawa.hiroki@ist.osaka-u.ac.jp (H. Nishikawa); kong@fc.ritsumei.ac.jp (X. Kong); ht@fc.ritsumei.ac.jp (H. Tomiyama) 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) work in [9] focused on accuracy-controllable functional units that can take variable cycles and proposed a scheduling scheme that takes into account the difference in delays between accurate and approximate computations in order to efficiently use accuracy-controllable approximate functional units. Their experiments show that it realized the tradeoffs between the errors, the resource, and the time constraints. However, the scheduling technique was based on integer linear programming (ILP) and the runtime for scheduling becomes exponentially longer as increasing the problem sizes. In order to solve the issue above, this paper presents a heuristic scheduling algorithm in HLS. Our proposal is based on a resource- constrained list scheduling algorithm and aims to minimize the output error by switching approximate functional units that cause large errors in the output values to accurately with satisfying the resource and time constraints. The remainder of the paper is organized as follows: Section II shows scheduling with variable-cycle functional units in HLS, and we propose a heuristic scheduling algorithm to solve this scheduling problem in polynomial time. Section III evaluates our proposed algorithm by comparing it to existing work. Finally, we conclude this paper in Section IV. 2. Variable-Cycle Scheduling in High-Level Synthesis 2.1. Variable-Cycle Scheduling This section presents algorithm [9] for scheduling variable-cycle approximate functional units. In the following section, for simplicity, we have an assumption that there are two types of functional units in the scheduling problem: accuracy-controllable multiplier and arithmetic logic unit (ALU). An accuracy-controllable multiplier, which has been developed in [5], is capable of dual operation modes. One is called accurate mode that the multiplication takes two cycles and does not produce the error. The other is called approximate mode that takes a single cycle and allows the error. A function other than multiplication is assumed to use an ALU that takes a cycle. Variable-cycle scheduling determines not only a schedule of operations in control states but also the accuracy mode for each multiplication simultaneously, with the goal of error minimization at the output under a given resource and time constraint. It should be noted that our proposed algorithm can be easily extended to consider another specific functional unit, not ALU, and to enable its functional unit to take several execution modes as well as the accuracy-controllable multiplier. The number of cycles for each function is not limited to the assumption above. Although this work is targeted at multipliers [5] that can dynamically change their computational accuracy, it can extend to other approximate adders [10, 11] and multipliers [12, 13]. An example of variable cycle scheduling is shown in Figure 1. We are given a data flow graph (DFG) as a directed acyclic graph. Each node represents an operation and each edge represents precedence dependency between the operations. The DFG in this example consists of two accuracy- controllable multipliers and two ALUs. The notation labeled as means a control state. For instance, 1 is the first state and is the second state. Figure 1 (a) shows that the multiplications are approximated (determined to take one cycle), and it results in the shortest cycles (three cycles) but the Figure 1: An example of variable-cycle scheduling largest error due to the approximation at the Algorithm 1. Proposed list scheduling algorithm same time. In contrast, the multiplications in 1 ListScheduling (G(V,E)) begin Figure 1 (b) are determined to execute in the 2 for do accurate mode so that the operations can 3 ALAP_schedule achieve the smallest error at the expense of the 4 endfor performance (four cycles). Figure 1 (c) shows 5 for do 6 variable-cycle scheduling. Unlike Figure 1 (b), 7 endfor the multiplication in the middle is 8 for n in 1..|M|+1 do approximated, and the generated circuit 9 for do achieves shorter cycles than that of Figure 1 10 if do (b). In addition, multicycling the rightmost 11 multiplication realizes a circuit with an 12 else endif accuracy higher than that of Figure 1 (a). 13 endfor 14 = 0, 15 while do 2.2. Proposed List Scheduling 16 Algorithm 17 18 for for do do 19 if do We propose a heuristic algorithm for the 20 variable-cycle scheduling problem and show a 21 if do pseudo code in Algorithm 1. Our scheduling 22 , algorithm is based on a typical resource- 23 24 endif constrained list scheduling. Given a DFG 25 endif , our algorithm determines a schedule 26 if do of operations in control states and the accuracy 27 mode for each multiplication simultaneously. 28 if do Let denote a set of operations and E denote 29 , , precedence dependency between the 30 operations. Recall that the operations are 31 endif 32 endif classified into two types: the multiplication 33 endfor with accurate and approximate modes and the 34 if do ALU-based operations. Let M denote a set of 35 , multiplications and A denote a set of ALU- 36 if do based operations. In the set M, a subset called 37 , is denoted, and indicates a set of 38 approximate multiplications. Therefore, the 39 elif do relationships are always held as follows; 40 , V=M A and Apx M. The number of cycles 41 else , endif for each operation is given in advance. The 42 endif multiplication takes either cycles for 43 if do 44 , accurate mode and cycles for 45 if do approximate mode and the ALU-based 46 , operations take cycles. In addition, we 47 denote and as a resource 48 else , endif constraint and denote as a time 49 endif constraint. 50 endfor The first process in the algorithm conducts 51 endwhile 52 if do as-late-as-possible (ALAP) scheduling for the , 53 DFG where all the multiplications assume to 54 be approximate mode and obtains the schedule 55 else for each operation (Lines 2-4). Then, 56 for each operation in the schedule, the 57 endif produced error is derived from the accurate 58 and approximate outputs with considering the 59 endfor 60 end error propagation from predecessor operations due to approximation (Lines 5-7). In this step, it is not possible to use error models such as [14, 15] for post-design analysis. Therefore, we adopt the error propagation model presented in [16] and [17]. Consider a typical multiplication and let the errors from the inputs and denote and respectively. Let the errors of the multiplication and its product be denoted as and . The error is derived as Formula (1): (1) Accuracy is evaluated by the magnitude of the error. In other words, the smaller the error, the higher the accuracy of the circuit. At the same time, the priority for each operation, is created, based on (Lines 9-13). Next, the algorithm relaxes the schedule by multicycling the multiplication, where the multiplication with the highest priority is selected, to reduce the error at the output. The algorithm iteratively conducts its process while the time constraint is satisfied (Lines 15-50). Here, we introduce three sets of states as , , and . Let denote a set of operations that are ready to execute, denote a set of operations during execution, and denote a set of operations that are already finished. Initially, and are set to empty. Let and denote the number of available multipliers and ALUs, respectively. The resources and are released after the execution of multiplication and an operation by ALU, respectively. represents the remaining cycles for each operation i. and represent the control states of start and finish or each operation i, respectively. If an operation i takes a single cycle, and indicate the same state. In Lines 18-34, functional units are ready to be executed and come to a set of ready operations under the resource constraint. If an operation in is finished, the operation is removed from and the resource is released. Then the operation becomes a member of . In order of priority, the operations are started the execution (Lines 34-50). If the time constraint is missed, the schedule with multicycling the selected multiplication is infeasible and the multiplication is returned to approximate mode. Otherwise, the schedule is feasible and determined to be executed in accurate mode (Lines 52- 57). The processes are repeatedly conducted until all the multiplication has tried to switch accurate mode from approximate mode. In the end, we obtain a feasible schedule such that the error is reduced under resource and time constraints. The computational complexity of our algorithm is O(N2 log N) for the number of multiplications N at a worst case. 2.3. Proposed Algorithm Example Our proposed method aims to minimize the error by accurately performing only those multiplications that cause large errors in the output. For the operation from step 8 to step 59 of Algorithm 1, an example problem in DFG with three multiplications and two additions is shown in Figure 2. We assume one cycle for approximate multiplications, two cycles for exact multiplications, and one cycle for additions. The resource constraint is list scheduling as two accuracy-controllable approximate multipliers and one adder. The time constraint is four cycles. Figure 2 (a) is the DFG given, and the number at the bottom left of the multiplication is assumed to be the magnitude of the influence of the error on the output value obtained in step 6 of Algorithm 1. First, as shown in Figure 2 (b), all the multiplications are approximate multiplications, and resource constraint-based list scheduling is performed. Next, the list scheduling is performed again with the multiplications with the largest errors made exactly. At this time, since all the operations can be performed within the time constraint of four cycles, the multiplication that was made exact in Figure 2 (c) is kept exact for the next scheduling. Next, as shown in Figure 2 (d), the multiplication with the second-largest error is made exact and list scheduling is performed. However, in Figure 2 (d) the time constraint of four cycles has been exceeded, so the upper left multiplication is returned to approximation. Next, the multiplication with the smallest error is made exact and list scheduling is performed. The result of Figure 2 (e) satisfies the time constraint, so we keep the multiplication as exact. Finally, since all the multiplications have been done correctly, the result of Figure 2 (e), which will have the smallest error so far, is output as the final solution as shown in Figure 2 (f). In this example, we have shown 1-2 cycles for simplicity, but our proposed method can be applied to operations with more than 3 cycles. a b c d e f + 2 1 3 + x (a) The given DFG (b) Approximate all multiplications (c) Exact multiplication with the highest error (d) Exact multiplication with the (e) Exact multiplication with the (f) Final output that satisfies resource and second highest error smallest error time constraints Figure 2: An example of proposed list scheduling algorithm 3. Experiments 3.1. Setup In the experiments, we have compared our heuristic algorithm to an existing ILP-based technique [9], which uses CPLEX 12.10 as a solver on a PC with an AMD Ryzen 7 PRO 4750G CPU and 64GB of main memory. We utilize MediaBench [18] for the benchmark programs. In the experiments, we are given resource and time constraints exhaustively to observe trade-offs. Note that the number of accuracy-controllable multipliers is restricted as a resource constraint but that of ALUs is ignored since sharing of ALUs largely incurs an overhead. The runtime for evaluation is limited to up to an hour in wall-clock time. If an optimal solution is not found within an hour, the best solutions at that time is employed to compare. 3.2. Results The experimental results are shown in Table 1 and Figure 3. Table 1 shows the comparison between our proposed algorithm and the ILP-based technique [9]. In the table, Nodes represent the total number of operations for each benchmark. Mult means the number of multiplications among them. Designs indicate the number of design problems for a variety of resource and time constraints. Wins, Losses, and Draws are the number of designs that our proposed algorithm that outperforms the ILP-based technique in accuracy even for just a bit, the ILP-based technique outperforms our proposed algorithm, and they find the same solutions, respectively. Figure 3 shows, on the other hand, the relationship between the output error, resource constraint, and time constraint in the cases of Motion Vectors Detector and Matrix Inversion. The abscissa axis represents the resource constraint, the ordinate axis represents the time constraint, and the applicate axis represents the error at the output. Note that the output error is expressed as the relative value of the error contained in the output to the accurate computation result, and the applicate axis is scaled in logarithm in Figure 3 (b). The results in Table 1 indicate that our proposed algorithm finds almost the same errors as the ILP- based technique in the cases where the number of operations is less than a hundred. In addition, the cases have enabled the ILP-based technique to find optimal solutions; therefore, our algorithm explicitly obtains an optimal solution in many cases. In Motion Vectors Detector, our algorithm produces larger errors than the ILP-based technique in 15 designs. According to the results in Figure 3 (a), however, the output errors look slightly larger but almost the same since the output errors by each technique are negligibly small. Over a hundred of the nodes, the ILP-based technique can hardly find an optimal solution within the runtime. In especially Matrix Inversion, Figure 3 (b) implies the ILP-based technique fails to find good solutions in many designs. In contrast, we demonstrate the effectiveness of our proposed algorithm since it can obtain flexible solutions under a variety of resource and time constraints. Table 2 shows the results of runtime evaluation between the ILP-based technique and our proposed algorithm. Max, Min, and Mean indicate the maximum, minimum, and average runtime of scheduling among the various constraints, respectively. In the cases where the number of operations is less than a hundred, the runtime to find optimal solutions is almost the same. However, in a large benchmark with hundreds of operations, our proposed algorithm is successfully finding a feasible solution in a shorter time than the ILP-based technique. Table 1. COMPARISON BETWEEN OUR PROPOSED ALGORITHM AND ILP [9] Benchmarks [18] Nodes (Mult) Designs Wins Losses Draws ILP [9] exceeds an hour HAL 11 (6) 14 0 0 14 0 FIR filter 21 (11) 19 0 0 19 0 Auto Regression Filter 28 (16) 36 0 0 36 0 Motion Vectors Decoder 32 (14) 37 0 15 22 0 Elliptic Wave Filter 34 (8) 16 0 3 13 0 Cosine 42 (14) 52 0 1 51 0 Feedback Points 53 (17) 43 0 10 33 1 Matrix Multiplication 109 (40) 129 1 20 108 4 Smooth Triangle 197 (69) 257 35 63 159 55 Matrix Inversion 333 (140) 516 235 104 177 257 (a) Motion Vectors Decoder (b) Matrix Inversion Figure 3: The output error in Motion Vectors Decoder and Matrix Inversion with exhaustive constraints of resource and time Table 2. RUNTIME EVALUATION BETWEEN ILP [9] AND OUR PROPOSED ALGORITHM (SEC.) Benchmarks [18] ILP [9] Our algorithm Max Min Mean Max Min Mean HAL 0.130 0.010 0.083 0.006 0.005 0.006 FIR filter 0.860 0.080 0.214 0.020 0.013 0.017 Auto Regression Filter 8.730 0.080 0.899 0.049 0.023 0.037 Motion Vectors Decoder 35.380 0.080 1.269 0.050 0.024 0.036 Elliptic Wave Filter 0.220 0.050 0.127 0.033 0.029 0.031 Cosine 2387 0.050 46.490 0.087 0.039 0.058 Feedback Points >3600 0.080 85.617 0.138 0.066 0.100 Matrix Multiplication >3600 0.200 149.748 2.044 0.527 1.092 Smooth Triangle >3600 0.300 1155 15.804 2.865 7.204 Matrix Inversion >3600 1.020 2116 170.288 16.217 70.913 4. Conclusions We proposed a heuristic scheduling algorithm with variable-cycle approximate functional units in HLS. The proposed algorithm is polynomial, and we have demonstrated that it finds better solutions than existing techniques in a short time. Therefore, our algorithm is expected to be adopted in deep learning-based and image processing applications. In the future, we plan to extend our algorithm to combine with other optimization techniques such as chaining, pipelining, and bit-width reduction. In addition, we are going to conduct a case study by the implementation of our algorithm into general HLS tools. Acknowledgments This work is supported partly by KAKENHI 20H00590, 20H04160, and 21K19776. References [1] ACM Computing Surveys, vol. 48, no. 4, 2016. [2] IEEE Design & Test, vol. 33, no. 1, 2016. [3] A low- International Symposium on Quality Electronic Design, 2018. [4] -power high-speed accuracy-controllable approximate Asia and South Pacific Design Automation Conference, 2018. [5] -bit accuracy- International SoC Design Conference, 2021. [6] K. Nepal, Y. Li, R. I. Bahar, and S. R Design, Automation & Test in Europe Conference & Exhibition, 2014. [7] -level synthesis of approximate hardware under joint Design, Automation & Test in Europe Conference & Exhibition, 2017. [8] - level synthesis of approximate accelerators using app International Conference On Computer Aided Design, 2020 [9] -cycle approximate functional units in high- International SoC Design Conference, 2021. [10] -oriented approximate adder International Conference on Computer-Aided Design, 2013. [11] V. Camus, J. Schlachter, and C. Enz -power carry cut-back approximate adder with fixed-point implementation and floating- Design Automation Conference, 2016. [12] -power, high-performance approximate multiplier with configurable part Design, Automation & Test in Europe Conference & Exhibition, 2014. [13] International Conference on Computer Design, 2013. [14] R. Venkatesan, A. Agarwal, K. International Conference on Computer Aided Design, 2011. [15] International Conference on Computer Design, 2013. [16] -driven error analysis for Design, Automation & Test in Europe Conference & Exhibition, 2018. [17] -level synthesis for Design Automation Conference, 2015. [18] C. Lee, M. Potkonjak, and W. H. M. Smi International Symposium on Microarchitecture, 1997.