<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Minimization of Bitsliced Representation of 4×4 S-Boxes based on Ternary Logic Instruction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yaroslav Sovyn</string-name>
          <email>yaroslav.r.sovyn@lpnu.ua</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Volodymyr Khoma</string-name>
          <email>volodymyr.v.khoma@lpnu.ua</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Opirskyy</string-name>
          <email>ivan.r.opirskyi@lpnu.ua</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valerii Kozachok</string-name>
          <email>v.kozachok@kubg.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Borys Grinchenko Kyiv University</institution>
          ,
          <addr-line>18/2, Bulvarno-Kudryavska str., Kyiv, 04053</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>12 Stepan Bandera str., Lviv, 79000</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>12</fpage>
      <lpage>24</lpage>
      <abstract>
        <p>The article is devoted to methods and tools for generating software-oriented bit-sliced descriptions of bijective 4×4 S-Boxes with a reduced number of instructions based on a ternary logical instruction. Bitsliced descriptions generated by the proposed method make it possible to improve the performance and security of software implementations of crypto-algorithms using 4×4 S-Boxes on various processor architectures. The paper develops a heuristic minimization method that uses a ternary logical instruction, which is available in 86-64 processors with AVX-512 support and some GPU processors. Thanks to the combination of various heuristic techniques (preliminary calculations, exhaustive search to a certain depth, refinement search) in the method, it was possible to reduce the number of gates in bit-sliced descriptions of S-Boxes compared to other known methods. The corresponding software in the form of a utility in the Python language was developed and its operation was tested on 225 SBoxes of various crypto-algorithms. It was established that the developed method generates a bit-sliced description with a smaller number of ternary instructions in 90.2% of cases, compared to the best-known method implemented in the sboxgates utility.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Bit-slicing</kwd>
        <kwd>ternary logic instruction</kwd>
        <kwd>4×4 S-Box</kwd>
        <kwd>CPU</kwd>
        <kwd>logic minimization</kwd>
        <kwd>software implementation</kwd>
        <kwd>sboxgates</kwd>
        <kwd>speed</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Given the ever-increasing volumes and speeds
of data processing, a very important requirement
for Cryptographic Algorithms (CA) is to provide
sufficiently high performance for a wide class of
microprocessor architectures [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. The no less
important aspect for software implementation of
cryptographic algorithms is the increased
resistance to side-channel attacks: for low-end
CPUs (8/16/32-bit microcontrollers) these are
primarily energy consumption analysis attacks,
and for high-end CPUs (86, ARM Cortex-A) are
primarily time and cache attacks. To ensure the
high performance of crypto-algorithms, various
approaches to their software implementation are
used. This includes the creation of precomputed
tables (Lookup Tables, LUT) for certain
operations, the integration of hardware
cryptoaccelerators into the processor (e.g. AES-NI in
86 processors), the use of SIMD technology for
parallelization of the encryption process (e.g.
SSE/AVX2/AVX-512 vector instructions in 86–
64 CPU), the use of GPU computing power, etc.
However, all these approaches have several
limitations and cannot always be implemented in
a specific processor.
      </p>
      <p>
        Bit-slicing [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is one of the promising
approaches that provide a high-performance
constant-time implementation of a CA with
immunity to time and cache attacks [4]. It makes
the most of the capabilities of modern high-end
microprocessors to increase performance due to
the parallelization of both code execution and data
processing and also allows adaptation for low-end
CPU and hardware implementation on FPGA and
ASIC. For many CAs, it is the bit-sliced approach
that provides the highest speed in software
implementation (if hardware crypto accelerators
are not used) for various types of processor
architectures [
        <xref ref-type="bibr" rid="ref3 ref4">3–9</xref>
        ]. The absence of references to
precomputed tables in memory and cache, as well
as data-dependent conditional transitions, makes
bit-sliced implementations invulnerable to time
and cache attacks, at the same time complicating
attacks through third-party channels.
      </p>
      <p>
        The basic idea of bit-slicing is to convert a
cryptographic algorithm into a sequence of bit
logical operations of the type AND, XOR, OR,
NOT. In processors, each such logical operation
can be represented by a corresponding instruction,
and in hardware—by a corresponding gate (we
will use the concepts of gate and instruction as
synonyms in this work). The high speed of
software bit-slicing is achieved because the CPU
processes many cipher elements (bytes, blocks) in
parallel, using fast logical instructions and easier
execution of some operations (for example, bit
permutations, shifts, etc.). For software-oriented
bit-sliced implementations, in addition to classic
ones, it is possible to use more complex logical
instructions supported by a certain processor, and
thus reduce their total number. So, for example,
many processors support the AND-NOT
instruction (86–64, ARM), some NOR and
NAND (ARM), etc. [
        <xref ref-type="bibr" rid="ref5">10</xref>
        ].
      </p>
      <p>In order to get the maximum speed, you need
to minimize the number of logical operations
included in the bit-sliced description of the crypto
algorithm. Most cryptographic operations
generate a single-valued description when going
to a bit-sliced description or do not give much
room for minimization except for nonlinear
transformations. In CA, nonlinear replacement
operations are specified in the form of n×m
LUTtables, so-called S-Boxes, which are mostly 4×4
(n = 4) or 8×8 (n = 8) bits in size. Tables of 4×4
bits are typical for both lightweight crypto
algorithms specially designed for efficient
implementation on resource-constrained
processors (e.g. block ciphers PRINCE, LED,
Piccolo, hash functions PHOTON, Spongent) and
general purpose crypto algorithms (e.g. block
symmetric ciphers Serpent, Twofish, Magma,
hash functions BLAKE, Whirlpool).</p>
      <p>Thus, the main resource for increasing
performance in the bit-sliced implementation of
the CA is the representation of S-Boxes with the
minimum possible number of logic
gates/instructions. This problem is NP-complete
and allows an exact solution only for very simple
cases (n ≤ 3 and some n = 4), so most modern
methods and utilities for generating bit-sliced
S-Boxes use heuristic approaches that do not
guarantee that the obtained solution is optimal but
provide a much better result compared to
universal methods of minimizing logical
functions (for example, the method of Carnot
maps as well as the method of simple
QuineMcCluskey implicants).</p>
      <p>In addition to traditional dual-operand logic
instructions, some processors support the ternary
logic instruction ternary logic (a, b, c, imm8),
which allows calculating an arbitrary Boolean
function from the three operands a, b, c specified
by the truth table in the 8-bit variable imm8
(Fig. 1).</p>
      <p>The Ternary Instruction (TI) is present in the
following processors:
• 86–64 with support for 512-bit SIMD
instructions from the AVX-512F extension.
Since the vpternlogq zmm_a, zmm_b, zmm_c,
imm8 instruction is included in the AVX-512F
(Foundation) basic extension, it is supported
by all processors with AVX-512 technology.
• some GPU processors. For example, on
an Nvidia GPU, this instruction has the form
lop3.b32 d, a, b, c immLut, which calculates
over the 32-bit operands lop3.b32 d, a, b, c
immLut, the logical function given by the truth
table in immLut and stores the result in d.</p>
      <p>One ternary instruction can replace several
dual-operand logical instructions, so its use allows
generating a bit-sliced description with a
significantly smaller number of operations, and
therefore increasing the speed of program
implementation. However, in this case, even for
4-bit S-Boxes, finding a guaranteed optimal
representation is mostly impossible and it is
necessary to use heuristic minimization methods.
However, the existing heuristic minimization
methods are mostly focused on the use of
dualoperand instructions, so they cannot be directly
used to generate a bit-sliced description based
on TI.</p>
      <p>Therefore, the problem of finding the optimal
bit-sliced representation based on the ternary
instruction even for 4×4 S-Boxes is far from being
solved, which requires the search for new
heuristic approaches, one of which is presented in
our work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Review of Literary Sources</title>
      <p>Let us analyze the methods and means for
finding the bit-sliced description of 4×4 S-Boxes
according to the Bitslice Gate Complexity (BGC)
criterion, which denotes the optimal solution with
the minimum number of operations.</p>
      <p>
        The bit-sliced approach to cryptographic
representation was first proposed by E. Biham in
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to speed up the software implementation of the
DES cipher. In the same work, E. Biham
described the algorithm of bit-sliced
representation of DES S-Boxes (6×4) by logic
gates XOR, AND, OR, NOT. In the algorithm,
from six input variables, two input variables have
been selected that form all possible combinations
with the help of given logic gates, from which four
output variables are then constructed. On average,
with a bit-sliced description using this method,
one DES S-Box requires 100 gates.
      </p>
      <p>In [8], M. Kwan proposed a much more
efficient approach to finding a bit-sliced
representation using DES S-Boxes as an example.
It treats each S-Box output bit as a function of the
six input bits, represented by a Carnot map, and
placed in a 64-bit variable. All input and
intermediate variables can also be considered as
6-bit Carnot maps described by 64-bit numbers.
Then the task is formulated as follows: it is
necessary to combine the existing input and
intermediate maps in such a way as to obtain the
desired output variable. One input variable acts as
a selector combining the functions of five
variables. To find the representation of functions
of five variables with the minimum number of
gates, an exhaustive search (brute force) is used,
and the gates are found in the previous steps.
Depending on the order in which the search is
carried out, 6! options are available for input
variables, and 4! Options—for output variables.
This gives a total of 17,280 search options, among
which the option with the minimum number of
gates is selected. As a result, the average number
of gates for a bit-sliced description of one DES
SBox decreased from 100 to 56.</p>
      <p>To minimize S-Boxes, the SAT-Solvers
programs can be used, designed to effectively
solve the feasibility problem of Boolean formulas
(SATfeasibility problem, SAT). The object of the
SAT problem is a Boolean formula consisting
only of constants (0/1), variables, and AND, OR,
and NOT operations. The problem is as follows:
can all variables be assigned the values False and
True so that the formula becomes true?
Specialized SAT-Solvers programs, built on
efficient solution algorithms, accept a set of
equations as input and produce the result in the
form of SAT if a solution is found and UNSAT if
no solution is found. To find a logic circuit with a
given number of gates, you can form an equation
where the variables would specify all possible
connections between gates and the operation and
try to solve them with the help of SAT-Solvers.
The advantage of this approach is that if a solution
with n gates (SAT) is found and UNSAT is
obtained for n—1 gates, then we are guaranteed to
have found the minimum possible bit-sliced
description.</p>
      <p>
        SAT-Solvers were used in the works [
        <xref ref-type="bibr" rid="ref4 ref5">9, 10</xref>
        ] to
find the bit-sliced representation of some 4-bit
S-Boxes. In general, the problem with
SATSolvers is that they do not always find solutions
for “heavy” S-Boxes that require more than 12–13
gates. For relatively simple S-Boxes with 11–13
gates, SAT-Solvers cannot always prove that the
found representation is minimal.
      </p>
      <p>
        The work [
        <xref ref-type="bibr" rid="ref6">11</xref>
        ] describes the open-source
utility LIGHTER, which is currently the most
effective utility for finding the bit-sliced
description of 4×4-bit S-Boxes. LIGHTER can
flexibly specify a set of two and three-inlet gates
and their weighting factors, which are taken into
account during minimization. This allows more
realistic optimization in the case of hardware
implementation, when different logic gates differ
in crystal area, power consumption, delay, etc.,
due to the consideration of these parameters in the
weighting factors. For a software implementation,
when logical instructions are equivalent, it is
enough to set the same weighting coefficients for
all gates.
      </p>
      <p>The LIGHTER search algorithm itself
combines two approaches: searching using the
Breath-First-Search (BFS) algorithm and the
Meet-In-The-Middle (MITM) strategy. That is,
two graphs are built: one starts from the base
vectors and performs a forward search, and the
other starts from the searched vectors and
performs a backward search. Both graphs move
toward each other using the given logical
operations until they meet. Next, a path is selected
that connects these two graphs with the minimum
weight that takes into account the weighting
factors for each gate. The utility demonstrates
high time efficiency compared to SAT methods,
and its results, although they cannot be considered
optimal, are quite close to the results obtained by
SAT utilities.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref7">12</xref>
        ], the open-source utility Peigen
(Platform for Evaluation, Implementation, and
Generation of S-boxes) is described, which makes
it possible to find a bit-sliced description of
S-Boxes in various logical bases, applying the
specified minimization criteria for hardware and
software implementations. Peigen’s bit-sliced
description search algorithms are based on
algorithms from the LIGHTER utility, but their
time efficiency has been improved, in particular,
recalculation and several additional techniques
have been used. However, even with the
improvements made, the utility only works
effectively with 4-bit S-Boxes.
      </p>
      <p>
        Generating an optimized bit-sliced
implementation of the CA requires considerable
time spent on writing and debugging the code and
requires a good knowledge of processor
architecture, low-level tools, and optimization
techniques at the hardware and software levels.
Therefore, in [
        <xref ref-type="bibr" rid="ref8">13</xref>
        ], the high-level Usuba language
is presented, which makes it possible to describe
a symmetric cryptographic primitive, and the
Usuba compilator itself will generate a highly
optimized, parallelized, and vectorized bit-sliced
code. However, to generate the bit-sliced
description of the S-Box, either a simple
minimization algorithm is used, which gives a far
from the optimal result, or a ready-made
optimized description is taken from the database
included in Usuba if the S-Box is present in it.
Thus, description generation for S-Box is a weak
point of the Usuba bit-sliced compiler.
      </p>
      <p>
        The considered methods and utilities form a
bit-sliced description using mainly two-input
logic elements and do not support a ternary logic
instruction. The only utility known to us today for
generating bit-sliced descriptions of S-Boxes
based on the ternary instruction is sboxgates [
        <xref ref-type="bibr" rid="ref9">14</xref>
        ].
This open-source utility implements the M. Kwan
algorithm with some improvements and is able to
generate a bit-sliced description for arbitrary
S-Boxes up to and including 8×8. Many
optimizations of M. Kwan’s algorithm in
sboxgates are borrowed from the
SBOXDiscovery project, which was intended
exclusively for generating bit-sliced descriptions
of DES S-Boxes. The utility allows you to specify
an arbitrary set of two-input gates, use a ternary
logic instruction, specify the number of iterations
of the search algorithm, parallelize the search
between processor cores, etc. [
        <xref ref-type="bibr" rid="ref10 ref11">15,16</xref>
        ]. In the case
of 4×4 S-Boxes, sboxgates produce results that, as
the article will show, can be greatly improved,
which is a price for versatility.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Setting Objectives</title>
      <p>The purpose of our article is to present a
method and a utility for generating a bit-sliced
description of bijective 4×4 S-Boxes based on a
ternary logical instruction, which provides better
results compared to existing ones, and this will
make it possible to increase the speed and security
of hardware and software implementations of a
wide range of cryptographic algorithms, which
use S-Boxes of a given type.</p>
      <p>S-Boxes representation format for bit-sliced
implementation</p>
      <p>In the specifications of cryptographic
algorithms, S-Boxes are mostly specified in the
form of LUT tables. For example, the 4×4 S-Box
of the PRESENT cipher has the form shown in
Table 1
x
S(x)</p>
      <p>In the bit-sliced representation, LUT tables are
treated as logical functions defined by truth tables.
For example, the S-Box of the PRESENT cipher
will have the form shown in Table 2.</p>
      <p>So, a compact representation of this S-Box in
the form of a truth table will have the following
form: S(x) = y, where x = {x0, x1, x2, x3} =
{0xff00, 0xf0f0, 0xcccc, 0xaaaa}— input
bitsliced variables, y = {y0, y1, y2, y3} = {0x0ed9,
0x3687, 0xa74c, 0x659a}—the output bit-sliced
variables defining a specific substitution table,
and we will call the 16-bit numbers that specify x
and y as vectors. The task of finding a bit-sliced
S-Box representation according to the BGC
criterion can be formulated as follows: given four
base vectors base = {x0, x1, x2, x3}, you need to
find the vectors y = {y0, y1, y2, y3} using the
minimum number of ternary logical instructions
ternary logic (a, b, c, imm8).</p>
    </sec>
    <sec id="sec-4">
      <title>4. Preliminary Calculations</title>
      <p>At the pre-computation stage, certain data are
found and stored once, which are then repeatedly
used in our bit-sliced description search
algorithm. This data is of two types:</p>
      <p>1. For each 16-bit vector v there is a
BGC(v)—the minimum number of ternary
instructions required for its representation, the
socalled “complexity” of the vector.</p>
      <p>Since vectors are represented as 16-bit
numbers, there are 65536 vectors in total, four of
which are based vectors base = {x0-x3} and two
are logical constants const = {0x0000, 0xffff} to
denote 0 and 1 for which BGC is 0, so there
remain 65530 vectors whose complexity needs to
be estimated. Table 3 presents the found
distribution of vectors according to their BGC
value.</p>
      <p>As can be seen from Table 3, a maximum
complexity is 3, meaning that any 16-bit vector
can be represented using no more than 3 ternary
instructions. This gives an upper estimate of the
bit-sliced complexity of an arbitrary S-Box
described by four vectors y0-y3 equal to 12th TI.</p>
      <p>For an indirect preliminary assessment of the
“complexity” of the S-Box, such an indicator as
the total value of the complexity of the vectors
y0-y3 can be used: the closer the total value is to
12, the more TI should be expected in the
bitsliced description and vice versa. For example,
SBox UDCIKMP11 (bit-sliced description requires
4 instructions) has the minimum total value of 6
of all S-Boxes considered in the article, and S-Box
mCrypton_S0 has the maximum total value of 12
(bit-sliced description requires 8 instructions).</p>
      <p>2. Construction of a LUT table for
representing all graphs to a depth of ge = 2
instructions.</p>
      <p>The table is built step by step. In the first step,
the table q0 is built, which contains all possible
values that can be obtained from the base vectors
х0-х3 with the help of one ternary instruction and
which are not included in base and const. For this,
the VECT_SQUARE algorithm is used, which
forms all possible combinations from the input
vectors using TI.</p>
      <p>There are a total of 936 such vectors. The
generated values are entered into the table, and the
х0-х3 values are not stored in the table to save
memory, although they are implicitly present for
each row. Therefore, the table
q0 = VECT_SQUARE ({x0, x1, x2, x3}) has a
dimension of 936×1 (Fig. 2).</p>
      <p>Each ith line of the table q0 can be considered
as a new basis {x0-x3, q0[i]}, which is used to
generate all possible vectors in the next step, and
the LUT lines of the tables themselves will be
called graphs.</p>
      <p>So, in particular, in the second step, the table
q1 = GEN_TABLE(q0) is generated, for which the
VECT_SQUARE algorithm again generates from
each row of the table q0 all possible values that can
be obtained from the base vectors х0-х3 and q0[i]
using one ternary instruction, lines are formed
from two vectors and added to the general table.
After that, logically equivalent graphs are filtered:
if the rows of table q1 contain the same values in
any order, then only one row remains (Fig. 3).
minimum TI and returns the constructed
complement matrices of graphs gr0-gr3. From the
obtained results, the graph with the minimum
BGC value is selected (Fig. 4).</p>
      <p>Further construction of the tables is
impractical, as they require too much memory due
to the large value of the branching coefficient.
Table q1 will be used to form all possible unique
graphs for the representation of vectors у.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Search Algorithm</title>
    </sec>
    <sec id="sec-6">
      <title>Representation of</title>
    </sec>
    <sec id="sec-7">
      <title>Bitsliced</title>
      <p>At the top level of the search algorithm, all
values у0-у3 are sorted, the matrix of candidate
graphs gri = STEP_0(yi) is generated for each of
them from the precomputed LUT-table q1 and
transferred to the depth-first search algorithm
FIND_BS(gri).</p>
      <p>The depth-first search algorithm FIND_BS
finds the remaining values у trying to use the</p>
      <p>STEP_0
STEP_0
STEP_0
STEP_0
gr0
gr1
gr2
gr3</p>
      <p>FIND_BS</p>
      <p>So, the search algorithm performs four
iterations, starting with different values of у. Let
us denote this initial value by уstart. At the stage
gri = STEP_0(уstart), using the LUT table q1, the
matrix of graphs gri, is generated, containing all
possible graphs with the vector уstart at a certain
depth dstart of the gates. Depending on which BGC
group the уstart vector belongs to, heuristically
selected dstart values are presented in Table 5 to
ensure acceptable calculation time and amount of
required memory.</p>
      <p>So, if, for example, BGC(y0) = 2, then the
matrix of graphs gr0 after STEP_0 will contain all
possible graphs with a length of 3 instructions
(dstart = 3) in which the vector у0 occurs.</p>
      <p>Next, the candidate graphs in gri are sorted into
three groups: gr_1y, gr_2y, gr_3 with the same
number of vectors y in each graph of the group—
1, 2, and 3, respectively. We denote this number
by y _find. Next, the search is conducted for each
non-empty group separately according to Fig. 5.</p>
      <p>The FIND_NEXT algorithm makes alternate
searches уі until all four values у0-у3 are found.
The matrix of graphs gr in the form of an n×m
table, each row of which contains y_find values
from the set {у0-у3}, is given to the input. Each
row of the table stores m vectors explicitly and
vectors х0-х3 implicitly.</p>
      <p>First, for group gr, the minimum distance dmin
is estimated, at which the closest value of уx is
located among all graphs—ESTIMATE_DEPTH.
For this purpose, the FAST_FIND function of
comprehensive forward search to a given depth of
1/2/3 steps has been developed. The search and
selection of options are carried out using the
algorithm of depth-first search with iterative
deepening—IDDFS (Iterative Deepening
DepthFirst Search).</p>
      <p>After the dmin estimate is found using the
GEN_DEPTH algorithm, a transition is made
from the set of graphs with y_find = n_y to the set
of graphs with y_find = n_y + 1.</p>
      <p>For this, the graphs with the found value dmin
are selected from the group gr and a step forward
gr = GEN_TABLE(grmin) is made. For the
generated set gr, graphs with the found value
d = dmin—1 are selected again, a step forward is
made for them, and so on, until d becomes equal
to 0. After that, only graphs containing n_y + 1
values of y are selected for the group. Then these
steps are repeated until all values of y are found.</p>
      <p>The most computationally intensive
procedures ESTIMATE_DEPTH and
GEN_DEPTH are implemented by using GPU
and OpenCL technology, which makes it possible
to significantly parallelize calculations and reduce
the algorithm’s operating time.</p>
      <p>At each step, the FIND_BS algorithm
evaluates the minimum distance dmin, at which the
nearest value of уx is located, and generates the
corresponding graphs. As shown in Fig. 6, this
route starts with the graphs containing ya,
generated using STEP_0, from which the nearest
value yb is located at the distance dab of the gates,
then we go to yс located at the minimum distance
dbc from yb and at the distance dcd we find the last
vector yd.</p>
      <p>However, not always moving in minimum
steps along the trajectory from vector ya to yd gives
the optimal result in general (although it is so in
most cases). There may be a situation when
choosing the minimum value of d in the first steps
leads to larger values of d in the following steps
and, as a result, to a non-optimal logical
representation.</p>
      <p>For example, let’s assume that at the first step,
we got dab = 1, at the second dbc = 2, and at the
third—dcd = 2, that is, the route will be a total of 5
TIs (Fig. 6). But it is possible that if at the first
step, we followed a different route and graphs
with dab = 2 would be selected, then in the second
step it would be possible to find the value of ус
with dbc = 1 and in the third step yd with dcd = 1,
and we would get a shorter total route with 4 TIs.
Hence, the second route resulted in a bit-sliced
representation with a lower BGC value.</p>
      <p>To take into account different possible routes
in the search algorithm, refinement searches are
carried out according to the scheme presented in
Fig. 7. If we have a set of graphs containing 3 out
of 4 possible values of у, then the search for the
fourth value is always carried out at the minimum
possible depth dmin (SEARCH_3Y). For graphs
with two values in у (y_find = 2), the third value
is searched for by two routes: dmin і dmin+1, after
which the SEARCH_3Y search is applied to the
found graphs with y_find = 3. For graphs with one
value in у (y_find = 1) , the search for the second
value takes place along three routes: dmin, dmin+1 і
dmin+2 , after which the SEARCH_2Y search is
applied to the found graphs with y_find = 2.
SEARCH_3Y
gr_3y
y_find = 3 dmin yd
(ya,yb,yc)</p>
      <p>SEARCH_2Y
gr_2y dmin yc
y_(fyina,dyb=) 2 dmin+1 yc
-SEARCH_3Y
SEARCH_3Y</p>
      <p>SEARCH_1Y
dmin yb
y_gfirn_d1y= 1 dmin+1 yb
(ya) dmin+2 yb</p>
      <p>SEARCH_2Y
SEARCH_2Y
SEARCH_2Y</p>
    </sec>
    <sec id="sec-8">
      <title>6. Results</title>
      <p>The method proposed in the work was
implemented in the Python language, and to
ensure speed, the main data processing functions
are implemented based on the numpy and
pyopencl libraries.</p>
      <p>
        To evaluate our algorithm, 225 4×4 S-Boxes of
various cryptographic algorithms were taken. The
open-source sboxgates project was used to obtain
a BGC estimate for selected S-Boxes and to be
able to compare with our results. Bitsliced
descriptions of S-Boxes obtained by our method
are available at the link [
        <xref ref-type="bibr" rid="ref10">15</xref>
        ].
      </p>
      <p>The results are presented in Table 6. Column
data in Table 6 should be interpreted as follows:</p>
      <p>LUT is a representation of S-Box in the tabular
form, where the line ‘0123456789abcdef’ should
Таble 6
BGC comparison for different S-Boxes
be understood as S(x) = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15.</p>
      <p>BS—representation of S-Box in
bit-slicedformat. The line ‘0ed9_3687_a74c_659a’ should
be understood as: y0 = 0x0ed9, y1 = 0x3687,
y2 = 0xa74c, y3 = 0x659a.</p>
      <p>CY is BGC of vectors у0-y3. The line ‘2133’
should be interpreted as follows: BGC(y0) = 2,
BGC(y1) = 1, BGC(y2) = 3, BGC(y3) = 3.</p>
      <p>OURS contains BGC values obtained using the
method described in the article.</p>
      <p>
        SG contains the BGC value obtained using the
sboxgates utility; the number of iterations for the
search was set to 1000 [
        <xref ref-type="bibr" rid="ref4">9</xref>
        ]. The red color in the SG
column indicates the S-Boxes that have a higher
BGC value compared to the one obtained by our
algorithm. The yellow ones have the same BGC
value as our algorithm.
      </p>
      <p>S-Box
Piccolo
Piccolo-1
LAC
Prost
Rectangle
Rectangle-1
Minalpher
Skinny
TWINE
PRINCE
Lucifer_S0
Lucifer_S1
Present
Present-1
JH_S0
JH_S1
Iceberg_S0
Iceberg_S1</p>
      <p>In general, as evidenced by the results in
Table 7, the method we developed showed
significantly better results compared to the
competitor represented by the sboxgates utility.
For 203 S-Boxes out of 225 (90.2%), our
method provides a bit-sliced description with
fewer TIs. The total number of ternary
instructions to represent all 225 S-Boxes in our
method is 1582, which is 15.3% less compared to
1867 instructions for the sboxgates utility. The
sboxgates utility did not generate a bit-sliced
description with fewer instructions than
obtained by our method for any S-Box, and for
only 22 S-Boxes (9.8%) it was able to generate a
bit-sliced description with the same BGC value as
our algorithm.</p>
    </sec>
    <sec id="sec-9">
      <title>7. Conclusions</title>
      <p>The paper presents a method for generating a
bit-sliced description of arbitrary 4×4 bijective
S-Boxes with a reduced number of ternary logic
instructions. The obtained descriptions make it
possible to generally increase the speed of
software implementations of the corresponding
crypto-algorithms on any processors that support
the 3-operand ternary logic instruction
(CPU/GPU). To date, the method proposed in the
article is the most effective method known to us
according to the BGC criterion, which confirms
the research results presented in the work. The
method combines heuristic techniques at various
stages of searching a bit-sliced representation, in
particular: recalculation, exhaustive search to a
depth of up to 3 gates using GPU, IDDFS
algorithm for searching and cutting options,
refinement search, which by this set of measures
ensure its efficiency and acceptable speed of
action.</p>
    </sec>
    <sec id="sec-10">
      <title>8. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bessalov</surname>
          </string-name>
          , et al.,
          <source>Computing of Odd Degree Isogenies on Supersingular Twisted Edwards Curves, in: Workshop on Cybersecurity Providing in Information and Telecommunication Systems</source>
          , vol.
          <volume>2923</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bessalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sokolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Skladannyi</surname>
          </string-name>
          ,
          <article-title>Modeling of 3- and 5-Isogenies of Supersingular Edwards Curves</article-title>
          ,
          <source>in: 2nd International Workshop on Modern Machine Learning Technologies and Data Science, no. I</source>
          , vol.
          <volume>2631</volume>
          (
          <year>2020</year>
          )
          <fpage>30</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Biham</surname>
          </string-name>
          , A Fast New DES Implementation in Software,
          <source>International Workshop on Fast Software Encryption</source>
          ,
          <year>1997</year>
          ,
          <fpage>260</fpage>
          -
          <lpage>272</lpage>
          . E. Kasper,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schwabe</surname>
          </string-name>
          ,
          <article-title>Faster and TimingAttack Resistant AES-GCM</article-title>
          ,
          <source>in Proc. 11th Int. Workshop Cryptographic Hardware and Embedded Systems</source>
          (
          <year>2009</year>
          )
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          . A.
          <string-name>
            <surname>Adomnicai</surname>
          </string-name>
          , T. Peyrin, Fixslicing AESLike Ciphers:
          <article-title>New bitsliced AES Speed Records on ARM-Cortex M and RISC-V, IACR Transactions on Cryptographic Hardware</article-title>
          and
          <source>Embedded Systems</source>
          ,
          <volume>1</volume>
          (
          <year>2021</year>
          )
          <fpage>402</fpage>
          -
          <lpage>425</lpage>
          . P.
          <article-title>Schwabe K Stoffelen, All the AES You Need on Cortex-M3 and M4</article-title>
          , International Conference on Selected Areas in Cryptography,
          <year>2016</year>
          ,
          <fpage>180</fpage>
          -
          <lpage>194</lpage>
          . J. Zhang, M. Ma, P. Wang,
          <article-title>Fast Implementation for SM4 Cipher Algorithm Based On Bit-Slice Technology</article-title>
          ,
          <source>International Conference on Smart Computing and Communication</source>
          ,
          <year>2018</year>
          ,
          <fpage>104</fpage>
          -
          <lpage>113</lpage>
          . N.
          <string-name>
            <surname>Nishikawa</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Amano</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Iwai</surname>
          </string-name>
          , Implementation of Bitsliced AES Encryption on
          <string-name>
            <surname>CUDA-enabled</surname>
            <given-names>GPU</given-names>
          </string-name>
          ,
          <source>International Conference on Network and System Security</source>
          ,
          <year>2017</year>
          ,
          <fpage>273</fpage>
          -
          <lpage>287</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Matsuda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Moriai</surname>
          </string-name>
          ,
          <article-title>Lightweight Cryptography for the Cloud: Exploit the Power of Bitslice Implementation</article-title>
          ,
          <source>International Workshop on Cryptographic Hardware and Embedded Systems</source>
          ,
          <year>2012</year>
          ,
          <fpage>408</fpage>
          -
          <lpage>425</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sokolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Skladannyi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hulak</surname>
          </string-name>
          ,
          <article-title>Stability Verification of Self-Organized Wireless Networks with Block Encryption</article-title>
          ,
          <source>in: 5th International Workshop on Computer Modeling and Intelligent Systems</source>
          , vol.
          <volume>3137</volume>
          (
          <year>2022</year>
          )
          <fpage>227</fpage>
          -
          <lpage>237</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kwan</surname>
          </string-name>
          ,
          <article-title>Reducing the Gate Count of Bitslice DES</article-title>
          ,
          <source>IACR Cryptology ePrint Archive</source>
          ,
          <year>2000</year>
          (
          <volume>51</volume>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Stoffelen</surname>
          </string-name>
          ,
          <article-title>Optimizing S-Box Implementations for Several Criteria Using SAT Solvers</article-title>
          ,
          <source>Proc. 23rd International Conf. on Fast Software Encryption</source>
          ,
          <year>2016</year>
          ,
          <fpage>140</fpage>
          -
          <lpage>160</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Courtois</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mourouzis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hulme</surname>
          </string-name>
          ,
          <article-title>Exact Logic Minimization and Multiplicative Complexity of Concrete Algebraic</article-title>
          and Cryptographic Circuits,
          <source>International Journal On Advances in Intelligent Systems</source>
          .
          <volume>6</volume>
          (
          <issue>3</issue>
          )(4) (
          <year>2013</year>
          )
          <fpage>165</fpage>
          -
          <lpage>176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jean</surname>
          </string-name>
          , et. al.,
          <source>Optimizing Implementations of Lightweight Building Blocks, IACR Transactions on Symmetric Cryptology</source>
          ,
          <volume>4</volume>
          (
          <year>2017</year>
          )
          <fpage>130</fpage>
          -
          <lpage>168</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Bao</surname>
          </string-name>
          , et. al.,
          <article-title>nPeigen-a Platform for Evaluation, Implementation, and Generation of S-boxes</article-title>
          ,
          <source>IACR Transactions on Symmetric Cryptology</source>
          ,
          <year>2019</year>
          ,
          <fpage>330</fpage>
          -
          <lpage>394</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Mercadier</surname>
          </string-name>
          , Usuba, Optimizing Bitslicing Compiler,
          <source>PhD Thesis</source>
          , Sorbonne University, France,
          <year>2020</year>
          ,
          <volume>195</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dansarie</surname>
          </string-name>
          ,
          <article-title>Sboxgates: A Program for Finding Low Gate Count Implementations of S-boxes”</article-title>
          ,
          <source>Journal of Open Source Software</source>
          ,
          <volume>6</volume>
          (
          <issue>62</issue>
          )
          <year>2021</year>
          1-
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>I.</given-names>
            <surname>Opirskyy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sovyn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Mykhailova</surname>
          </string-name>
          ,
          <source>Heuristic Method of Finding BitslicedDescription of Derivative Cryptographic Sbox</source>
          ,
          <source>2022 IEEE 16th International Conference on Advanced Trends in Radioelectronics</source>
          , Telecommunications and Computer Engineering (TCSET),
          <year>2022</year>
          ,
          <fpage>104</fpage>
          -
          <lpage>109</lpage>
          , doi:10.1109/TCSET55632.
          <year>2022</year>
          .9766883
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [19]
          <string-name>
            <surname>AQ. Sahun</surname>
          </string-name>
          , et. al.,
          <article-title>Devising a Method for Improving Crypto Resistance of the Symmetric Block Cryptosystem RC5 Using Nonlinear Shift Functions</article-title>
          .
          <source>EasternEuropean J. of Enterprise Technologies</source>
          ,
          <volume>5</volume>
          (
          <issue>9</issue>
          ) (113) (
          <year>2021</year>
          )
          <fpage>17</fpage>
          -
          <lpage>29</lpage>
          . doi:
          <volume>10</volume>
          .15587/
          <fpage>1729</fpage>
          -
          <lpage>4061</lpage>
          .
          <year>2021</year>
          .240344
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>