1. Introduction

A Heuristic Scheduling Algorithm with Variable-Cycle Approximate Functional Units in High-Level Synthesis

Koyu Ohata

0 2

Hiroki Nishikawa

nishikawa.hiroki@ist.osaka-u.ac.jp 0 1

Xiangbo Kong

kong@fc.ritsumei.ac.jp 0 2

Hiroyuki Tomiyama

0 2 0 A low- International Symposium on Quality Electronic Design, 2018. -power high-speed accuracy-controllable approximate Asia and South Pacific Design Automation Conference , 2018 1 Graduate School of Information Science and Technology, Osaka University , Osaka , Japan 2 Graduate School of Science and Engineering, Ritsumeikan University , Shiga , Japan

2021

Computational approximation is a promising paradigm to exploit hardware capabilities or mitigate computational demands, taking advantage of inherent tolerance to errors in applications. For several decades, a variety of works for approximate computing have been extensively studied. In particular, accuracy-controllable approximate functional units have been developed to be configurable to control the accuracy at runtime. To satisfy the requirement of performance, hardware cost and accuracy, the techniques for design space exploration are important since the excessive use of approximate circuits results in loss of accuracy, while that of accurate circuits can hardly meet time constraints. This paper proposes a list scheduling algorithm in the high-level synthesis of accuracy-controllable approximate functional units with variable latency. The experiments demonstrate that our proposed scheduling method can efficiently explore the trade-offs between performance, hardware cost, and accuracy in a practical time, compared to the state-of-the-art methods.

1 High-level synthesis approximate computing scheduling approximate multiplication list scheduling

1. Introduction

Approximate computing is attracting attention to rapidly designing high-performance, low-cost and low-power circuits for error-tolerant applications such as image processing and machine learning. Design techniques for approximate arithmetic circuits have been studied at various design levels, from the transistor to the architecture level [1-2]. The quality requirements of an error-tolerant application using approximate computing may vary significantly at runtime. There have appeared several works [3-5] on approximate functional units that can dynamically change accuracy at runtime. A carrymaskable adder [3] was proposed to enable dynamically selecting the length of the carry propagation to satisfy the quality requirements. Also, an accuracy-controllable multiplier [4] was proposed where the last stage is generated by a carry-maskable adder. The approximate multiplier has an additional input, and it is utilized to control the error at the output. If the error is large, the latency and power of the circuit will be small. On the other hand, if the error is reduced, the latency and power of the circuit will be large. Thus, the trade-off between error, power, and latency can be optimized at runtime by using a functional unit with variable accuracy. Further on, the work in [5] proposed an approximate multiplier that can dynamically change its accuracy by using a digital signal processor (DSP) with an embedded field-programmable gate array (FPGA). These studies show that the more inaccurate the computation, the faster it becomes.

There is a large amount of literature on high-level synthesis (HLS) of approximate computing circuits such as [ 6-8 ]. Unfortunately, most of the works previously mentioned have paid little attention to allocation, scheduling, and binding algorithms, which are the crucial techniques in HLS. Regarding HLS techniques aware of accuracy-controllable approximate functional units, there is the work [9]. The work in [9] focused on accuracy-controllable functional units that can take variable cycles and proposed a scheduling scheme that takes into account the difference in delays between accurate and approximate computations in order to efficiently use accuracy-controllable approximate functional units. Their experiments show that it realized the tradeoffs between the errors, the resource, and the time constraints. However, the scheduling technique was based on integer linear programming (ILP) and the runtime for scheduling becomes exponentially longer as increasing the problem sizes. In order to solve the issue above, this paper presents a heuristic scheduling algorithm in HLS. Our proposal is based on a resourceconstrained list scheduling algorithm and aims to minimize the output error by switching approximate functional units that cause large errors in the output values to accurately with satisfying the resource and time constraints.

The remainder of the paper is organized as follows: Section II shows scheduling with variable-cycle functional units in HLS, and we propose a heuristic scheduling algorithm to solve this scheduling problem in polynomial time. Section III evaluates our proposed algorithm by comparing it to existing work. Finally, we conclude this paper in Section IV.

2. Variable-Cycle Scheduling in High-Level Synthesis 2.1. Variable-Cycle Scheduling

This section presents algorithm [9] for scheduling variable-cycle approximate functional units. In the following section, for simplicity, we have an assumption that there are two types of functional units in the scheduling problem: accuracy-controllable multiplier and arithmetic logic unit (ALU). An accuracy-controllable multiplier, which has been developed in [5], is capable of dual operation modes. One is called accurate mode that the multiplication takes two cycles and does not produce the error. The other is called approximate mode that takes a single cycle and allows the error. A function other than multiplication is assumed to use an ALU that takes a cycle. Variable-cycle scheduling determines not only a schedule of operations in control states but also the accuracy mode for each multiplication simultaneously, with the goal of error minimization at the output under a given resource and time constraint. It should be noted that our proposed algorithm can be easily extended to consider another specific functional unit, not ALU, and to enable its functional unit to take several execution modes as well as the accuracy-controllable multiplier. The number of cycles for each function is not limited to the assumption above. Although this work is targeted at multipliers [5] that can dynamically change their computational accuracy, it can extend to other approximate adders [10, 11] and multipliers [12, 13].

An example of variable cycle scheduling is shown in Figure 1. We are given a data flow graph (DFG) as a directed acyclic graph. Each node represents an operation and each edge represents precedence dependency between the operations. The DFG in this example consists of two accuracycontrollable multipliers and two ALUs. The notation labeled as means a control state. For instance, 1 is the first state and is the second state. Figure 1 (a) shows that the multiplications are approximated (determined to take one cycle), and it results in the shortest cycles (three cycles) but the largest error due to the approximation at the same time. In contrast, the multiplications in Figure 1 (b) are determined to execute in the accurate mode so that the operations can achieve the smallest error at the expense of the performance (four cycles). Figure 1 (c) shows variable-cycle scheduling. Unlike Figure 1 (b), the multiplication in the middle is approximated, and the generated circuit achieves shorter cycles than that of Figure 1 (b). In addition, multicycling the rightmost multiplication realizes a circuit with an accuracy higher than that of Figure 1 (a).

2.2. Proposed Algorithm List Scheduling

We propose a heuristic algorithm for the variable-cycle scheduling problem and show a pseudo code in Algorithm 1. Our scheduling algorithm is based on a typical resourceconstrained list scheduling. Given a DFG

, our algorithm determines a schedule of operations in control states and the accuracy mode for each multiplication simultaneously.

Let denote a set of operations and E denote precedence dependency between the operations. Recall that the operations are classified into two types: the multiplication with accurate and approximate modes and the ALU-based operations. Let M denote a set of multiplications and A denote a set of ALUbased operations. In the set M, a subset called

is denoted, and indicates a set of approximate multiplications. Therefore, the relationships are always held as follows; V=M A and Apx M. The number of cycles for each operation is given in advance. The multiplication takes either cycles for accurate mode and cycles for approximate mode and the ALU-based operations take cycles. In addition, we denote and as a resource constraint and denote as a time constraint.

The first process in the algorithm conducts as-late-as-possible (ALAP) scheduling for the DFG where all the multiplications assume to be approximate mode and obtains the schedule

for each operation (Lines 2-4). Then, for each operation in the schedule, the produced error is derived from the accurate and approximate outputs with considering the Algorithm 1. Proposed list scheduling algorithm

, do do endif do do endif endif error propagation from predecessor operations due to approximation (Lines 5-7). In this step, it is not possible to use error models such as [14, 15] for post-design analysis. Therefore, we adopt the error propagation model presented in [16] and [17]. Consider a typical multiplication and let the errors from the inputs and denote and respectively. Let the errors of the multiplication and its product be denoted as and . The error is derived as Formula (1): (1) Accuracy is evaluated by the magnitude of the error. In other words, the smaller the error, the higher the accuracy of the circuit.

At the same time, the priority for each operation, is created, based on (Lines 9-13). Next, the algorithm relaxes the schedule by multicycling the multiplication, where the multiplication with the highest priority is selected, to reduce the error at the output. The algorithm iteratively conducts its process while the time constraint is satisfied (Lines 15-50). Here, we introduce three sets of states as , , and . Let denote a set of operations that are ready to execute, denote a set of operations during execution, and denote a set of operations that are already finished. Initially, and are set to empty. Let and denote the number of available multipliers and ALUs, respectively. The resources and are released after the execution of multiplication and an operation by ALU, respectively. represents the remaining cycles for each operation i. and represent the control states of start and finish or each operation i, respectively. If an operation i takes a single cycle, and indicate the same state. In Lines 18-34, functional units are ready to be executed and come to a set of ready operations under the resource constraint. If an operation in is finished, the operation is removed from and the resource is released. Then the operation becomes a member of . In order of priority, the operations are started the execution (Lines 34-50). If the time constraint is missed, the schedule with multicycling the selected multiplication is infeasible and the multiplication is returned to approximate mode. Otherwise, the schedule is feasible and determined to be executed in accurate mode (Lines 5257). The processes are repeatedly conducted until all the multiplication has tried to switch accurate mode from approximate mode. In the end, we obtain a feasible schedule such that the error is reduced under resource and time constraints. The computational complexity of our algorithm is O(N2 log N) for the number of multiplications N at a worst case. 2.3.

Proposed Algorithm Example

Our proposed method aims to minimize the error by accurately performing only those multiplications that cause large errors in the output. For the operation from step 8 to step 59 of Algorithm 1, an example problem in DFG with three multiplications and two additions is shown in Figure 2. We assume one cycle for approximate multiplications, two cycles for exact multiplications, and one cycle for additions. The resource constraint is list scheduling as two accuracy-controllable approximate multipliers and one adder. The time constraint is four cycles. Figure 2 (a) is the DFG given, and the number at the bottom left of the multiplication is assumed to be the magnitude of the influence of the error on the output value obtained in step 6 of Algorithm 1. First, as shown in Figure 2 (b), all the multiplications are approximate multiplications, and resource constraint-based list scheduling is performed. Next, the list scheduling is performed again with the multiplications with the largest errors made exactly. At this time, since all the operations can be performed within the time constraint of four cycles, the multiplication that was made exact in Figure 2 (c) is kept exact for the next scheduling. Next, as shown in Figure 2 (d), the multiplication with the second-largest error is made exact and list scheduling is performed. However, in Figure 2 (d) the time constraint of four cycles has been exceeded, so the upper left multiplication is returned to approximation. Next, the multiplication with the smallest error is made exact and list scheduling is performed. The result of Figure 2 (e) satisfies the time constraint, so we keep the multiplication as exact. Finally, since all the multiplications have been done correctly, the result of Figure 2 (e), which will have the smallest error so far, is output as the final solution as shown in Figure 2 (f). In this example, we have shown 1-2 cycles for simplicity, but our proposed method can be applied to operations with more than 3 cycles. 2 x e 1 (a) The given DFG (b) Approximate all multiplications (c) Exact multiplication with the highest error

3. Experiments 3.1. Setup

In the experiments, we have compared our heuristic algorithm to an existing ILP-based technique [9], which uses CPLEX 12.10 as a solver on a PC with an AMD Ryzen 7 PRO 4750G CPU and 64GB of main memory. We utilize MediaBench [18] for the benchmark programs. In the experiments, we are given resource and time constraints exhaustively to observe trade-offs. Note that the number of accuracy-controllable multipliers is restricted as a resource constraint but that of ALUs is ignored since sharing of ALUs largely incurs an overhead. The runtime for evaluation is limited to up to an hour in wall-clock time. If an optimal solution is not found within an hour, the best solutions at that time is employed to compare. 3.2.

Results

The experimental results are shown in Table 1 and Figure 3. Table 1 shows the comparison between our proposed algorithm and the ILP-based technique [9]. In the table, Nodes represent the total number of operations for each benchmark. Mult means the number of multiplications among them. Designs indicate the number of design problems for a variety of resource and time constraints. Wins, Losses, and Draws are the number of designs that our proposed algorithm that outperforms the ILP-based technique in accuracy even for just a bit, the ILP-based technique outperforms our proposed algorithm, and they find the same solutions, respectively. Figure 3 shows, on the other hand, the relationship between the output error, resource constraint, and time constraint in the cases of Motion Vectors Detector and Matrix Inversion. The abscissa axis represents the resource constraint, the ordinate axis represents the time constraint, and the applicate axis represents the error at the output. Note that the output error is expressed as the relative value of the error contained in the output to the accurate computation result, and the applicate axis is scaled in logarithm in Figure 3 (b).

The results in Table 1 indicate that our proposed algorithm finds almost the same errors as the ILPbased technique in the cases where the number of operations is less than a hundred. In addition, the cases have enabled the ILP-based technique to find optimal solutions; therefore, our algorithm explicitly obtains an optimal solution in many cases. In Motion Vectors Detector, our algorithm produces larger errors than the ILP-based technique in 15 designs. According to the results in Figure 3 (a), however, the output errors look slightly larger but almost the same since the output errors by each technique are negligibly small.

Over a hundred of the nodes, the ILP-based technique can hardly find an optimal solution within the runtime. In especially Matrix Inversion, Figure 3 (b) implies the ILP-based technique fails to find good solutions in many designs. In contrast, we demonstrate the effectiveness of our proposed algorithm since it can obtain flexible solutions under a variety of resource and time constraints. Table 2 shows the results of runtime evaluation between the ILP-based technique and our proposed algorithm. Max, Min, and Mean indicate the maximum, minimum, and average runtime of scheduling among the various constraints, respectively. In the cases where the number of operations is less than a hundred, the runtime to find optimal solutions is almost the same. However, in a large benchmark with hundreds of operations, our proposed algorithm is successfully finding a feasible solution in a shorter time than the ILP-based technique.

4. Conclusions

We proposed a heuristic scheduling algorithm with variable-cycle approximate functional units in HLS. The proposed algorithm is polynomial, and we have demonstrated that it finds better solutions than existing techniques in a short time. Therefore, our algorithm is expected to be adopted in deep learning-based and image processing applications.

In the future, we plan to extend our algorithm to combine with other optimization techniques such as chaining, pipelining, and bit-width reduction. In addition, we are going to conduct a case study by the implementation of our algorithm into general HLS tools.

This work is supported partly by KAKENHI 20H00590, 20H04160, and 21K19776.

ACM Computing Surveys, vol. 48, no. IEEE Design & Test, vol. Acknowledgments References

4, 2016.

33, no. 1, 2016.

Design Automation Conference, 2015.

[18] C. Lee, M. Potkonjak, and W. H. M. Smi level synthesis of approximate accelerators using app

International Conference On Computer Aided Design, 2020

configurable part 2014.

International

[6]

Nepal ,

Li ,

R. I.

Bahar , and S. R