<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Heuristic Scheduling Algorithm with Variable-Cycle Approximate Functional Units in High-Level Synthesis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Koyu Ohata</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hiroki Nishikawa</string-name>
          <email>nishikawa.hiroki@ist.osaka-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiangbo Kong</string-name>
          <email>kong@fc.ritsumei.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hiroyuki Tomiyama</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>A low- International Symposium on Quality Electronic Design, 2018. -power high-speed accuracy-controllable approximate Asia and South Pacific Design Automation Conference</institution>
          ,
          <addr-line>2018</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Graduate School of Information Science and Technology, Osaka University</institution>
          ,
          <addr-line>Osaka</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Graduate School of Science and Engineering, Ritsumeikan University</institution>
          ,
          <addr-line>Shiga</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>Computational approximation is a promising paradigm to exploit hardware capabilities or mitigate computational demands, taking advantage of inherent tolerance to errors in applications. For several decades, a variety of works for approximate computing have been extensively studied. In particular, accuracy-controllable approximate functional units have been developed to be configurable to control the accuracy at runtime. To satisfy the requirement of performance, hardware cost and accuracy, the techniques for design space exploration are important since the excessive use of approximate circuits results in loss of accuracy, while that of accurate circuits can hardly meet time constraints. This paper proposes a list scheduling algorithm in the high-level synthesis of accuracy-controllable approximate functional units with variable latency. The experiments demonstrate that our proposed scheduling method can efficiently explore the trade-offs between performance, hardware cost, and accuracy in a practical time, compared to the state-of-the-art methods.</p>
      </abstract>
      <kwd-group>
        <kwd>1 High-level synthesis</kwd>
        <kwd>approximate computing</kwd>
        <kwd>scheduling</kwd>
        <kwd>approximate multiplication</kwd>
        <kwd>list scheduling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Approximate computing is attracting attention to rapidly designing high-performance, low-cost and
low-power circuits for error-tolerant applications such as image processing and machine learning.
Design techniques for approximate arithmetic circuits have been studied at various design levels, from
the transistor to the architecture level [1-2]. The quality requirements of an error-tolerant application
using approximate computing may vary significantly at runtime. There have appeared several works
[3-5] on approximate functional units that can dynamically change accuracy at runtime. A
carrymaskable adder [3] was proposed to enable dynamically selecting the length of the carry propagation
to satisfy the quality requirements. Also, an accuracy-controllable multiplier [4] was proposed where
the last stage is generated by a carry-maskable adder. The approximate multiplier has an additional
input, and it is utilized to control the error at the output. If the error is large, the latency and power of
the circuit will be small. On the other hand, if the error is reduced, the latency and power of the circuit
will be large. Thus, the trade-off between error, power, and latency can be optimized at runtime by
using a functional unit with variable accuracy. Further on, the work in [5] proposed an approximate
multiplier that can dynamically change its accuracy by using a digital signal processor (DSP) with an
embedded field-programmable gate array (FPGA). These studies show that the more inaccurate the
computation, the faster it becomes.</p>
      <p>
        There is a large amount of literature on high-level synthesis (HLS) of approximate computing
circuits such as [
        <xref ref-type="bibr" rid="ref1">6-8</xref>
        ]. Unfortunately, most of the works previously mentioned have paid little attention
to allocation, scheduling, and binding algorithms, which are the crucial techniques in HLS. Regarding
HLS techniques aware of accuracy-controllable approximate functional units, there is the work [9]. The
work in [9] focused on accuracy-controllable functional units that can take variable cycles and proposed
a scheduling scheme that takes into account the difference in delays between accurate and approximate
computations in order to efficiently use accuracy-controllable approximate functional units. Their
experiments show that it realized the tradeoffs between the errors, the resource, and the time constraints.
However, the scheduling technique was based on integer linear programming (ILP) and the runtime for
scheduling becomes exponentially longer as increasing the problem sizes. In order to solve the issue
above, this paper presents a heuristic scheduling algorithm in HLS. Our proposal is based on a
resourceconstrained list scheduling algorithm and aims to minimize the output error by switching approximate
functional units that cause large errors in the output values to accurately with satisfying the resource
and time constraints.
      </p>
      <p>The remainder of the paper is organized as follows: Section II shows scheduling with variable-cycle
functional units in HLS, and we propose a heuristic scheduling algorithm to solve this scheduling
problem in polynomial time. Section III evaluates our proposed algorithm by comparing it to existing
work. Finally, we conclude this paper in Section IV.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Variable-Cycle Scheduling in High-Level Synthesis 2.1.</title>
    </sec>
    <sec id="sec-3">
      <title>Variable-Cycle Scheduling</title>
      <p>This section presents algorithm [9] for scheduling variable-cycle approximate functional units. In
the following section, for simplicity, we have an assumption that there are two types of functional units
in the scheduling problem: accuracy-controllable multiplier and arithmetic logic unit (ALU). An
accuracy-controllable multiplier, which has been developed in [5], is capable of dual operation modes.
One is called accurate mode that the multiplication takes two cycles and does not produce the error.
The other is called approximate mode that takes a single cycle and allows the error. A function other
than multiplication is assumed to use an ALU that takes a cycle. Variable-cycle scheduling determines
not only a schedule of operations in control states but also the accuracy mode for each multiplication
simultaneously, with the goal of error minimization at the output under a given resource and time
constraint. It should be noted that our proposed algorithm can be easily extended to consider another
specific functional unit, not ALU, and to enable its functional unit to take several execution modes as
well as the accuracy-controllable multiplier. The number of cycles for each function is not limited to
the assumption above. Although this work is targeted at multipliers [5] that can dynamically change
their computational accuracy, it can extend to other approximate adders [10, 11] and multipliers [12,
13].</p>
      <p>An example of variable cycle scheduling is shown in Figure 1. We are given a data flow graph
(DFG) as a directed acyclic graph. Each node represents an operation and each edge represents
precedence dependency between the operations. The DFG in this example consists of two
accuracycontrollable multipliers and two ALUs. The notation labeled as means a control state. For instance,
1 is the first state and is the second state. Figure 1 (a) shows that the multiplications are
approximated (determined to take one cycle), and it results in the shortest cycles (three cycles) but the
largest error due to the approximation at the
same time. In contrast, the multiplications in
Figure 1 (b) are determined to execute in the
accurate mode so that the operations can
achieve the smallest error at the expense of the
performance (four cycles). Figure 1 (c) shows
variable-cycle scheduling. Unlike Figure 1 (b),
the multiplication in the middle is
approximated, and the generated circuit
achieves shorter cycles than that of Figure 1
(b). In addition, multicycling the rightmost
multiplication realizes a circuit with an
accuracy higher than that of Figure 1 (a).</p>
    </sec>
    <sec id="sec-4">
      <title>2.2. Proposed</title>
    </sec>
    <sec id="sec-5">
      <title>Algorithm</title>
    </sec>
    <sec id="sec-6">
      <title>List</title>
    </sec>
    <sec id="sec-7">
      <title>Scheduling</title>
      <p>We propose a heuristic algorithm for the
variable-cycle scheduling problem and show a
pseudo code in Algorithm 1. Our scheduling
algorithm is based on a typical
resourceconstrained list scheduling. Given a DFG</p>
      <p>, our algorithm determines a schedule
of operations in control states and the accuracy
mode for each multiplication simultaneously.</p>
      <p>Let denote a set of operations and E denote
precedence dependency between the
operations. Recall that the operations are
classified into two types: the multiplication
with accurate and approximate modes and the
ALU-based operations. Let M denote a set of
multiplications and A denote a set of
ALUbased operations. In the set M, a subset called</p>
      <p>is denoted, and indicates a set of
approximate multiplications. Therefore, the
relationships are always held as follows;
V=M A and Apx M. The number of cycles
for each operation is given in advance. The
multiplication takes either cycles for
accurate mode and cycles for
approximate mode and the ALU-based
operations take cycles. In addition, we
denote and as a resource
constraint and denote as a time
constraint.</p>
      <p>The first process in the algorithm conducts
as-late-as-possible (ALAP) scheduling for the
DFG where all the multiplications assume to
be approximate mode and obtains the schedule</p>
      <p>for each operation (Lines 2-4). Then,
for each operation in the schedule, the
produced error is derived from the accurate
and approximate outputs with considering the
Algorithm 1. Proposed list scheduling algorithm</p>
      <p>,
do
do
endif
do
do
endif
endif
error propagation from predecessor operations due to approximation (Lines 5-7). In this step, it is not
possible to use error models such as [14, 15] for post-design analysis. Therefore, we adopt the error
propagation model presented in [16] and [17]. Consider a typical multiplication and let the
errors from the inputs and denote and respectively. Let the errors of the multiplication and its
product be denoted as and . The error is derived as Formula (1):
(1)
Accuracy is evaluated by the magnitude of the error. In other words, the smaller the error, the higher
the accuracy of the circuit.</p>
      <p>At the same time, the priority for each operation, is created, based on (Lines 9-13).
Next, the algorithm relaxes the schedule by multicycling the multiplication, where the multiplication
with the highest priority is selected, to reduce the error at the output. The algorithm iteratively conducts
its process while the time constraint is satisfied (Lines 15-50). Here, we introduce three sets of states as
, , and . Let denote a set of operations that are ready to execute, denote a set of operations during
execution, and denote a set of operations that are already finished. Initially, and are set to empty.
Let and denote the number of available multipliers and ALUs, respectively. The resources
and are released after the execution of multiplication and an operation by ALU, respectively.
represents the remaining cycles for each operation i. and represent the control states of start
and finish or each operation i, respectively. If an operation i takes a single cycle, and indicate
the same state. In Lines 18-34, functional units are ready to be executed and come to a set of ready
operations under the resource constraint. If an operation in is finished, the operation is removed
from and the resource is released. Then the operation becomes a member of . In order of priority, the
operations are started the execution (Lines 34-50). If the time constraint is missed, the schedule with
multicycling the selected multiplication is infeasible and the multiplication is returned to approximate
mode. Otherwise, the schedule is feasible and determined to be executed in accurate mode (Lines
5257). The processes are repeatedly conducted until all the multiplication has tried to switch accurate
mode from approximate mode. In the end, we obtain a feasible schedule such that the error is reduced
under resource and time constraints. The computational complexity of our algorithm is O(N2 log N) for
the number of multiplications N at a worst case.
2.3.</p>
    </sec>
    <sec id="sec-8">
      <title>Proposed Algorithm Example</title>
      <p>Our proposed method aims to minimize the error by accurately performing only those
multiplications that cause large errors in the output. For the operation from step 8 to step 59 of
Algorithm 1, an example problem in DFG with three multiplications and two additions is shown in
Figure 2. We assume one cycle for approximate multiplications, two cycles for exact multiplications,
and one cycle for additions. The resource constraint is list scheduling as two accuracy-controllable
approximate multipliers and one adder. The time constraint is four cycles. Figure 2 (a) is the DFG given,
and the number at the bottom left of the multiplication is assumed to be the magnitude of the influence
of the error on the output value obtained in step 6 of Algorithm 1. First, as shown in Figure 2 (b), all
the multiplications are approximate multiplications, and resource constraint-based list scheduling is
performed. Next, the list scheduling is performed again with the multiplications with the largest errors
made exactly. At this time, since all the operations can be performed within the time constraint of four
cycles, the multiplication that was made exact in Figure 2 (c) is kept exact for the next scheduling. Next,
as shown in Figure 2 (d), the multiplication with the second-largest error is made exact and list
scheduling is performed. However, in Figure 2 (d) the time constraint of four cycles has been exceeded,
so the upper left multiplication is returned to approximation. Next, the multiplication with the smallest
error is made exact and list scheduling is performed. The result of Figure 2 (e) satisfies the time
constraint, so we keep the multiplication as exact. Finally, since all the multiplications have been done
correctly, the result of Figure 2 (e), which will have the smallest error so far, is output as the final
solution as shown in Figure 2 (f). In this example, we have shown 1-2 cycles for simplicity, but our
proposed method can be applied to operations with more than 3 cycles.
2
x
e
1
(a) The given DFG
(b) Approximate all multiplications
(c) Exact multiplication with the highest error</p>
    </sec>
    <sec id="sec-9">
      <title>3. Experiments 3.1.</title>
    </sec>
    <sec id="sec-10">
      <title>Setup</title>
      <p>In the experiments, we have compared our heuristic algorithm to an existing ILP-based technique
[9], which uses CPLEX 12.10 as a solver on a PC with an AMD Ryzen 7 PRO 4750G CPU and 64GB
of main memory. We utilize MediaBench [18] for the benchmark programs. In the experiments, we are
given resource and time constraints exhaustively to observe trade-offs. Note that the number of
accuracy-controllable multipliers is restricted as a resource constraint but that of ALUs is ignored since
sharing of ALUs largely incurs an overhead. The runtime for evaluation is limited to up to an hour in
wall-clock time. If an optimal solution is not found within an hour, the best solutions at that time is
employed to compare.
3.2.</p>
    </sec>
    <sec id="sec-11">
      <title>Results</title>
      <p>The experimental results are shown in Table 1 and Figure 3. Table 1 shows the comparison between
our proposed algorithm and the ILP-based technique [9]. In the table, Nodes represent the total number
of operations for each benchmark. Mult means the number of multiplications among them. Designs
indicate the number of design problems for a variety of resource and time constraints. Wins, Losses,
and Draws are the number of designs that our proposed algorithm that outperforms the ILP-based
technique in accuracy even for just a bit, the ILP-based technique outperforms our proposed algorithm,
and they find the same solutions, respectively. Figure 3 shows, on the other hand, the relationship
between the output error, resource constraint, and time constraint in the cases of Motion Vectors
Detector and Matrix Inversion. The abscissa axis represents the resource constraint, the ordinate axis
represents the time constraint, and the applicate axis represents the error at the output. Note that the
output error is expressed as the relative value of the error contained in the output to the accurate
computation result, and the applicate axis is scaled in logarithm in Figure 3 (b).</p>
      <p>The results in Table 1 indicate that our proposed algorithm finds almost the same errors as the
ILPbased technique in the cases where the number of operations is less than a hundred. In addition, the
cases have enabled the ILP-based technique to find optimal solutions; therefore, our algorithm explicitly
obtains an optimal solution in many cases. In Motion Vectors Detector, our algorithm produces larger
errors than the ILP-based technique in 15 designs. According to the results in Figure 3 (a), however,
the output errors look slightly larger but almost the same since the output errors by each technique are
negligibly small.</p>
      <p>Over a hundred of the nodes, the ILP-based technique can hardly find an optimal solution within the
runtime. In especially Matrix Inversion, Figure 3 (b) implies the ILP-based technique fails to find good
solutions in many designs. In contrast, we demonstrate the effectiveness of our proposed algorithm
since it can obtain flexible solutions under a variety of resource and time constraints. Table 2 shows the
results of runtime evaluation between the ILP-based technique and our proposed algorithm. Max, Min,
and Mean indicate the maximum, minimum, and average runtime of scheduling among the various
constraints, respectively. In the cases where the number of operations is less than a hundred, the runtime
to find optimal solutions is almost the same. However, in a large benchmark with hundreds of operations,
our proposed algorithm is successfully finding a feasible solution in a shorter time than the ILP-based
technique.</p>
    </sec>
    <sec id="sec-12">
      <title>4. Conclusions</title>
      <p>We proposed a heuristic scheduling algorithm with variable-cycle approximate functional units in
HLS. The proposed algorithm is polynomial, and we have demonstrated that it finds better solutions
than existing techniques in a short time. Therefore, our algorithm is expected to be adopted in deep
learning-based and image processing applications.</p>
      <p>In the future, we plan to extend our algorithm to combine with other optimization techniques such
as chaining, pipelining, and bit-width reduction. In addition, we are going to conduct a case study by
the implementation of our algorithm into general HLS tools.</p>
      <p>This work is supported partly by KAKENHI 20H00590, 20H04160, and 21K19776.</p>
      <sec id="sec-12-1">
        <title>ACM Computing Surveys, vol. 48, no.</title>
      </sec>
      <sec id="sec-12-2">
        <title>IEEE Design &amp; Test, vol.</title>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>Acknowledgments References</title>
      <p>4, 2016.</p>
      <p>33, no. 1, 2016.</p>
      <sec id="sec-13-1">
        <title>Design Automation Conference, 2015.</title>
        <p>[18] C. Lee, M. Potkonjak, and W. H. M. Smi
level synthesis of approximate accelerators using app</p>
      </sec>
      <sec id="sec-13-2">
        <title>International Conference On Computer Aided Design, 2020</title>
        <p>configurable part
2014.</p>
      </sec>
      <sec id="sec-13-3">
        <title>International</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Nepal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. I.</given-names>
            <surname>Bahar</surname>
          </string-name>
          , and S. R
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>