1. Introduction

Minimization of Bitsliced Representation of 4×4 S-Boxes based on Ternary Logic Instruction

Yaroslav Sovyn

yaroslav.r.sovyn@lpnu.ua 1

Volodymyr Khoma

volodymyr.v.khoma@lpnu.ua 1

Ivan Opirskyy

ivan.r.opirskyi@lpnu.ua 1

Valerii Kozachok

v.kozachok@kubg.edu.ua 0 0 Borys Grinchenko Kyiv University , 18/2, Bulvarno-Kudryavska str., Kyiv, 04053 , Ukraine 1 Lviv Polytechnic National University , 12 Stepan Bandera str., Lviv, 79000 , Ukraine

12 24

The article is devoted to methods and tools for generating software-oriented bit-sliced descriptions of bijective 4×4 S-Boxes with a reduced number of instructions based on a ternary logical instruction. Bitsliced descriptions generated by the proposed method make it possible to improve the performance and security of software implementations of crypto-algorithms using 4×4 S-Boxes on various processor architectures. The paper develops a heuristic minimization method that uses a ternary logical instruction, which is available in 86-64 processors with AVX-512 support and some GPU processors. Thanks to the combination of various heuristic techniques (preliminary calculations, exhaustive search to a certain depth, refinement search) in the method, it was possible to reduce the number of gates in bit-sliced descriptions of S-Boxes compared to other known methods. The corresponding software in the form of a utility in the Python language was developed and its operation was tested on 225 SBoxes of various crypto-algorithms. It was established that the developed method generates a bit-sliced description with a smaller number of ternary instructions in 90.2% of cases, compared to the best-known method implemented in the sboxgates utility.

1 Bit-slicing ternary logic instruction 4×4 S-Box CPU logic minimization software implementation sboxgates speed

1. Introduction

Given the ever-increasing volumes and speeds of data processing, a very important requirement for Cryptographic Algorithms (CA) is to provide sufficiently high performance for a wide class of microprocessor architectures [ 1, 2 ]. The no less important aspect for software implementation of cryptographic algorithms is the increased resistance to side-channel attacks: for low-end CPUs (8/16/32-bit microcontrollers) these are primarily energy consumption analysis attacks, and for high-end CPUs (86, ARM Cortex-A) are primarily time and cache attacks. To ensure the high performance of crypto-algorithms, various approaches to their software implementation are used. This includes the creation of precomputed tables (Lookup Tables, LUT) for certain operations, the integration of hardware cryptoaccelerators into the processor (e.g. AES-NI in 86 processors), the use of SIMD technology for parallelization of the encryption process (e.g. SSE/AVX2/AVX-512 vector instructions in 86– 64 CPU), the use of GPU computing power, etc. However, all these approaches have several limitations and cannot always be implemented in a specific processor.

Bit-slicing [ 3 ] is one of the promising approaches that provide a high-performance constant-time implementation of a CA with immunity to time and cache attacks [4]. It makes the most of the capabilities of modern high-end microprocessors to increase performance due to the parallelization of both code execution and data processing and also allows adaptation for low-end CPU and hardware implementation on FPGA and ASIC. For many CAs, it is the bit-sliced approach that provides the highest speed in software implementation (if hardware crypto accelerators are not used) for various types of processor architectures [ 3–9 ]. The absence of references to precomputed tables in memory and cache, as well as data-dependent conditional transitions, makes bit-sliced implementations invulnerable to time and cache attacks, at the same time complicating attacks through third-party channels.

The basic idea of bit-slicing is to convert a cryptographic algorithm into a sequence of bit logical operations of the type AND, XOR, OR, NOT. In processors, each such logical operation can be represented by a corresponding instruction, and in hardware—by a corresponding gate (we will use the concepts of gate and instruction as synonyms in this work). The high speed of software bit-slicing is achieved because the CPU processes many cipher elements (bytes, blocks) in parallel, using fast logical instructions and easier execution of some operations (for example, bit permutations, shifts, etc.). For software-oriented bit-sliced implementations, in addition to classic ones, it is possible to use more complex logical instructions supported by a certain processor, and thus reduce their total number. So, for example, many processors support the AND-NOT instruction (86–64, ARM), some NOR and NAND (ARM), etc. [ 10 ].

In order to get the maximum speed, you need to minimize the number of logical operations included in the bit-sliced description of the crypto algorithm. Most cryptographic operations generate a single-valued description when going to a bit-sliced description or do not give much room for minimization except for nonlinear transformations. In CA, nonlinear replacement operations are specified in the form of n×m LUTtables, so-called S-Boxes, which are mostly 4×4 (n = 4) or 8×8 (n = 8) bits in size. Tables of 4×4 bits are typical for both lightweight crypto algorithms specially designed for efficient implementation on resource-constrained processors (e.g. block ciphers PRINCE, LED, Piccolo, hash functions PHOTON, Spongent) and general purpose crypto algorithms (e.g. block symmetric ciphers Serpent, Twofish, Magma, hash functions BLAKE, Whirlpool).

Thus, the main resource for increasing performance in the bit-sliced implementation of the CA is the representation of S-Boxes with the minimum possible number of logic gates/instructions. This problem is NP-complete and allows an exact solution only for very simple cases (n ≤ 3 and some n = 4), so most modern methods and utilities for generating bit-sliced S-Boxes use heuristic approaches that do not guarantee that the obtained solution is optimal but provide a much better result compared to universal methods of minimizing logical functions (for example, the method of Carnot maps as well as the method of simple QuineMcCluskey implicants).

In addition to traditional dual-operand logic instructions, some processors support the ternary logic instruction ternary logic (a, b, c, imm8), which allows calculating an arbitrary Boolean function from the three operands a, b, c specified by the truth table in the 8-bit variable imm8 (Fig. 1).

The Ternary Instruction (TI) is present in the following processors: • 86–64 with support for 512-bit SIMD instructions from the AVX-512F extension. Since the vpternlogq zmm_a, zmm_b, zmm_c, imm8 instruction is included in the AVX-512F (Foundation) basic extension, it is supported by all processors with AVX-512 technology. • some GPU processors. For example, on an Nvidia GPU, this instruction has the form lop3.b32 d, a, b, c immLut, which calculates over the 32-bit operands lop3.b32 d, a, b, c immLut, the logical function given by the truth table in immLut and stores the result in d.

One ternary instruction can replace several dual-operand logical instructions, so its use allows generating a bit-sliced description with a significantly smaller number of operations, and therefore increasing the speed of program implementation. However, in this case, even for 4-bit S-Boxes, finding a guaranteed optimal representation is mostly impossible and it is necessary to use heuristic minimization methods. However, the existing heuristic minimization methods are mostly focused on the use of dualoperand instructions, so they cannot be directly used to generate a bit-sliced description based on TI.

Therefore, the problem of finding the optimal bit-sliced representation based on the ternary instruction even for 4×4 S-Boxes is far from being solved, which requires the search for new heuristic approaches, one of which is presented in our work.

2. Review of Literary Sources

Let us analyze the methods and means for finding the bit-sliced description of 4×4 S-Boxes according to the Bitslice Gate Complexity (BGC) criterion, which denotes the optimal solution with the minimum number of operations.

The bit-sliced approach to cryptographic representation was first proposed by E. Biham in [ 1 ] to speed up the software implementation of the DES cipher. In the same work, E. Biham described the algorithm of bit-sliced representation of DES S-Boxes (6×4) by logic gates XOR, AND, OR, NOT. In the algorithm, from six input variables, two input variables have been selected that form all possible combinations with the help of given logic gates, from which four output variables are then constructed. On average, with a bit-sliced description using this method, one DES S-Box requires 100 gates.

In [8], M. Kwan proposed a much more efficient approach to finding a bit-sliced representation using DES S-Boxes as an example. It treats each S-Box output bit as a function of the six input bits, represented by a Carnot map, and placed in a 64-bit variable. All input and intermediate variables can also be considered as 6-bit Carnot maps described by 64-bit numbers. Then the task is formulated as follows: it is necessary to combine the existing input and intermediate maps in such a way as to obtain the desired output variable. One input variable acts as a selector combining the functions of five variables. To find the representation of functions of five variables with the minimum number of gates, an exhaustive search (brute force) is used, and the gates are found in the previous steps. Depending on the order in which the search is carried out, 6! options are available for input variables, and 4! Options—for output variables. This gives a total of 17,280 search options, among which the option with the minimum number of gates is selected. As a result, the average number of gates for a bit-sliced description of one DES SBox decreased from 100 to 56.

To minimize S-Boxes, the SAT-Solvers programs can be used, designed to effectively solve the feasibility problem of Boolean formulas (SATfeasibility problem, SAT). The object of the SAT problem is a Boolean formula consisting only of constants (0/1), variables, and AND, OR, and NOT operations. The problem is as follows: can all variables be assigned the values False and True so that the formula becomes true? Specialized SAT-Solvers programs, built on efficient solution algorithms, accept a set of equations as input and produce the result in the form of SAT if a solution is found and UNSAT if no solution is found. To find a logic circuit with a given number of gates, you can form an equation where the variables would specify all possible connections between gates and the operation and try to solve them with the help of SAT-Solvers. The advantage of this approach is that if a solution with n gates (SAT) is found and UNSAT is obtained for n—1 gates, then we are guaranteed to have found the minimum possible bit-sliced description.

SAT-Solvers were used in the works [ 9, 10 ] to find the bit-sliced representation of some 4-bit S-Boxes. In general, the problem with SATSolvers is that they do not always find solutions for “heavy” S-Boxes that require more than 12–13 gates. For relatively simple S-Boxes with 11–13 gates, SAT-Solvers cannot always prove that the found representation is minimal.

The work [ 11 ] describes the open-source utility LIGHTER, which is currently the most effective utility for finding the bit-sliced description of 4×4-bit S-Boxes. LIGHTER can flexibly specify a set of two and three-inlet gates and their weighting factors, which are taken into account during minimization. This allows more realistic optimization in the case of hardware implementation, when different logic gates differ in crystal area, power consumption, delay, etc., due to the consideration of these parameters in the weighting factors. For a software implementation, when logical instructions are equivalent, it is enough to set the same weighting coefficients for all gates.

The LIGHTER search algorithm itself combines two approaches: searching using the Breath-First-Search (BFS) algorithm and the Meet-In-The-Middle (MITM) strategy. That is, two graphs are built: one starts from the base vectors and performs a forward search, and the other starts from the searched vectors and performs a backward search. Both graphs move toward each other using the given logical operations until they meet. Next, a path is selected that connects these two graphs with the minimum weight that takes into account the weighting factors for each gate. The utility demonstrates high time efficiency compared to SAT methods, and its results, although they cannot be considered optimal, are quite close to the results obtained by SAT utilities.

In [ 12 ], the open-source utility Peigen (Platform for Evaluation, Implementation, and Generation of S-boxes) is described, which makes it possible to find a bit-sliced description of S-Boxes in various logical bases, applying the specified minimization criteria for hardware and software implementations. Peigen’s bit-sliced description search algorithms are based on algorithms from the LIGHTER utility, but their time efficiency has been improved, in particular, recalculation and several additional techniques have been used. However, even with the improvements made, the utility only works effectively with 4-bit S-Boxes.

Generating an optimized bit-sliced implementation of the CA requires considerable time spent on writing and debugging the code and requires a good knowledge of processor architecture, low-level tools, and optimization techniques at the hardware and software levels. Therefore, in [ 13 ], the high-level Usuba language is presented, which makes it possible to describe a symmetric cryptographic primitive, and the Usuba compilator itself will generate a highly optimized, parallelized, and vectorized bit-sliced code. However, to generate the bit-sliced description of the S-Box, either a simple minimization algorithm is used, which gives a far from the optimal result, or a ready-made optimized description is taken from the database included in Usuba if the S-Box is present in it. Thus, description generation for S-Box is a weak point of the Usuba bit-sliced compiler.

The considered methods and utilities form a bit-sliced description using mainly two-input logic elements and do not support a ternary logic instruction. The only utility known to us today for generating bit-sliced descriptions of S-Boxes based on the ternary instruction is sboxgates [ 14 ]. This open-source utility implements the M. Kwan algorithm with some improvements and is able to generate a bit-sliced description for arbitrary S-Boxes up to and including 8×8. Many optimizations of M. Kwan’s algorithm in sboxgates are borrowed from the SBOXDiscovery project, which was intended exclusively for generating bit-sliced descriptions of DES S-Boxes. The utility allows you to specify an arbitrary set of two-input gates, use a ternary logic instruction, specify the number of iterations of the search algorithm, parallelize the search between processor cores, etc. [ 15,16 ]. In the case of 4×4 S-Boxes, sboxgates produce results that, as the article will show, can be greatly improved, which is a price for versatility.

3. Setting Objectives

The purpose of our article is to present a method and a utility for generating a bit-sliced description of bijective 4×4 S-Boxes based on a ternary logical instruction, which provides better results compared to existing ones, and this will make it possible to increase the speed and security of hardware and software implementations of a wide range of cryptographic algorithms, which use S-Boxes of a given type.

S-Boxes representation format for bit-sliced implementation

In the specifications of cryptographic algorithms, S-Boxes are mostly specified in the form of LUT tables. For example, the 4×4 S-Box of the PRESENT cipher has the form shown in Table 1 x S(x)

In the bit-sliced representation, LUT tables are treated as logical functions defined by truth tables. For example, the S-Box of the PRESENT cipher will have the form shown in Table 2.

So, a compact representation of this S-Box in the form of a truth table will have the following form: S(x) = y, where x = {x0, x1, x2, x3} = {0xff00, 0xf0f0, 0xcccc, 0xaaaa}— input bitsliced variables, y = {y0, y1, y2, y3} = {0x0ed9, 0x3687, 0xa74c, 0x659a}—the output bit-sliced variables defining a specific substitution table, and we will call the 16-bit numbers that specify x and y as vectors. The task of finding a bit-sliced S-Box representation according to the BGC criterion can be formulated as follows: given four base vectors base = {x0, x1, x2, x3}, you need to find the vectors y = {y0, y1, y2, y3} using the minimum number of ternary logical instructions ternary logic (a, b, c, imm8).

4. Preliminary Calculations

At the pre-computation stage, certain data are found and stored once, which are then repeatedly used in our bit-sliced description search algorithm. This data is of two types:

1. For each 16-bit vector v there is a BGC(v)—the minimum number of ternary instructions required for its representation, the socalled “complexity” of the vector.

Since vectors are represented as 16-bit numbers, there are 65536 vectors in total, four of which are based vectors base = {x0-x3} and two are logical constants const = {0x0000, 0xffff} to denote 0 and 1 for which BGC is 0, so there remain 65530 vectors whose complexity needs to be estimated. Table 3 presents the found distribution of vectors according to their BGC value.

As can be seen from Table 3, a maximum complexity is 3, meaning that any 16-bit vector can be represented using no more than 3 ternary instructions. This gives an upper estimate of the bit-sliced complexity of an arbitrary S-Box described by four vectors y0-y3 equal to 12th TI.

For an indirect preliminary assessment of the “complexity” of the S-Box, such an indicator as the total value of the complexity of the vectors y0-y3 can be used: the closer the total value is to 12, the more TI should be expected in the bitsliced description and vice versa. For example, SBox UDCIKMP11 (bit-sliced description requires 4 instructions) has the minimum total value of 6 of all S-Boxes considered in the article, and S-Box mCrypton_S0 has the maximum total value of 12 (bit-sliced description requires 8 instructions).

2. Construction of a LUT table for representing all graphs to a depth of ge = 2 instructions.

The table is built step by step. In the first step, the table q0 is built, which contains all possible values that can be obtained from the base vectors х0-х3 with the help of one ternary instruction and which are not included in base and const. For this, the VECT_SQUARE algorithm is used, which forms all possible combinations from the input vectors using TI.

There are a total of 936 such vectors. The generated values are entered into the table, and the х0-х3 values are not stored in the table to save memory, although they are implicitly present for each row. Therefore, the table q0 = VECT_SQUARE ({x0, x1, x2, x3}) has a dimension of 936×1 (Fig. 2).

Each ith line of the table q0 can be considered as a new basis {x0-x3, q0[i]}, which is used to generate all possible vectors in the next step, and the LUT lines of the tables themselves will be called graphs.

So, in particular, in the second step, the table q1 = GEN_TABLE(q0) is generated, for which the VECT_SQUARE algorithm again generates from each row of the table q0 all possible values that can be obtained from the base vectors х0-х3 and q0[i] using one ternary instruction, lines are formed from two vectors and added to the general table. After that, logically equivalent graphs are filtered: if the rows of table q1 contain the same values in any order, then only one row remains (Fig. 3). minimum TI and returns the constructed complement matrices of graphs gr0-gr3. From the obtained results, the graph with the minimum BGC value is selected (Fig. 4).

Further construction of the tables is impractical, as they require too much memory due to the large value of the branching coefficient. Table q1 will be used to form all possible unique graphs for the representation of vectors у.

5. Search Algorithm Representation of Bitsliced

At the top level of the search algorithm, all values у0-у3 are sorted, the matrix of candidate graphs gri = STEP_0(yi) is generated for each of them from the precomputed LUT-table q1 and transferred to the depth-first search algorithm FIND_BS(gri).

The depth-first search algorithm FIND_BS finds the remaining values у trying to use the

STEP_0 STEP_0 STEP_0 STEP_0 gr0 gr1 gr2 gr3

FIND_BS

So, the search algorithm performs four iterations, starting with different values of у. Let us denote this initial value by уstart. At the stage gri = STEP_0(уstart), using the LUT table q1, the matrix of graphs gri, is generated, containing all possible graphs with the vector уstart at a certain depth dstart of the gates. Depending on which BGC group the уstart vector belongs to, heuristically selected dstart values are presented in Table 5 to ensure acceptable calculation time and amount of required memory.

So, if, for example, BGC(y0) = 2, then the matrix of graphs gr0 after STEP_0 will contain all possible graphs with a length of 3 instructions (dstart = 3) in which the vector у0 occurs.

Next, the candidate graphs in gri are sorted into three groups: gr_1y, gr_2y, gr_3 with the same number of vectors y in each graph of the group— 1, 2, and 3, respectively. We denote this number by y _find. Next, the search is conducted for each non-empty group separately according to Fig. 5.

The FIND_NEXT algorithm makes alternate searches уі until all four values у0-у3 are found. The matrix of graphs gr in the form of an n×m table, each row of which contains y_find values from the set {у0-у3}, is given to the input. Each row of the table stores m vectors explicitly and vectors х0-х3 implicitly.

First, for group gr, the minimum distance dmin is estimated, at which the closest value of уx is located among all graphs—ESTIMATE_DEPTH. For this purpose, the FAST_FIND function of comprehensive forward search to a given depth of 1/2/3 steps has been developed. The search and selection of options are carried out using the algorithm of depth-first search with iterative deepening—IDDFS (Iterative Deepening DepthFirst Search).

After the dmin estimate is found using the GEN_DEPTH algorithm, a transition is made from the set of graphs with y_find = n_y to the set of graphs with y_find = n_y + 1.

For this, the graphs with the found value dmin are selected from the group gr and a step forward gr = GEN_TABLE(grmin) is made. For the generated set gr, graphs with the found value d = dmin—1 are selected again, a step forward is made for them, and so on, until d becomes equal to 0. After that, only graphs containing n_y + 1 values of y are selected for the group. Then these steps are repeated until all values of y are found.

The most computationally intensive procedures ESTIMATE_DEPTH and GEN_DEPTH are implemented by using GPU and OpenCL technology, which makes it possible to significantly parallelize calculations and reduce the algorithm’s operating time.

At each step, the FIND_BS algorithm evaluates the minimum distance dmin, at which the nearest value of уx is located, and generates the corresponding graphs. As shown in Fig. 6, this route starts with the graphs containing ya, generated using STEP_0, from which the nearest value yb is located at the distance dab of the gates, then we go to yс located at the minimum distance dbc from yb and at the distance dcd we find the last vector yd.

However, not always moving in minimum steps along the trajectory from vector ya to yd gives the optimal result in general (although it is so in most cases). There may be a situation when choosing the minimum value of d in the first steps leads to larger values of d in the following steps and, as a result, to a non-optimal logical representation.

For example, let’s assume that at the first step, we got dab = 1, at the second dbc = 2, and at the third—dcd = 2, that is, the route will be a total of 5 TIs (Fig. 6). But it is possible that if at the first step, we followed a different route and graphs with dab = 2 would be selected, then in the second step it would be possible to find the value of ус with dbc = 1 and in the third step yd with dcd = 1, and we would get a shorter total route with 4 TIs. Hence, the second route resulted in a bit-sliced representation with a lower BGC value.

To take into account different possible routes in the search algorithm, refinement searches are carried out according to the scheme presented in Fig. 7. If we have a set of graphs containing 3 out of 4 possible values of у, then the search for the fourth value is always carried out at the minimum possible depth dmin (SEARCH_3Y). For graphs with two values in у (y_find = 2), the third value is searched for by two routes: dmin і dmin+1, after which the SEARCH_3Y search is applied to the found graphs with y_find = 3. For graphs with one value in у (y_find = 1) , the search for the second value takes place along three routes: dmin, dmin+1 і dmin+2 , after which the SEARCH_2Y search is applied to the found graphs with y_find = 2. SEARCH_3Y gr_3y y_find = 3 dmin yd (ya,yb,yc)

SEARCH_2Y gr_2y dmin yc y_(fyina,dyb=) 2 dmin+1 yc -SEARCH_3Y SEARCH_3Y

SEARCH_1Y dmin yb y_gfirn_d1y= 1 dmin+1 yb (ya) dmin+2 yb

SEARCH_2Y SEARCH_2Y SEARCH_2Y

6. Results

The method proposed in the work was implemented in the Python language, and to ensure speed, the main data processing functions are implemented based on the numpy and pyopencl libraries.

To evaluate our algorithm, 225 4×4 S-Boxes of various cryptographic algorithms were taken. The open-source sboxgates project was used to obtain a BGC estimate for selected S-Boxes and to be able to compare with our results. Bitsliced descriptions of S-Boxes obtained by our method are available at the link [ 15 ].

The results are presented in Table 6. Column data in Table 6 should be interpreted as follows:

LUT is a representation of S-Box in the tabular form, where the line ‘0123456789abcdef’ should Таble 6 BGC comparison for different S-Boxes be understood as S(x) = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15.

BS—representation of S-Box in bit-slicedformat. The line ‘0ed9_3687_a74c_659a’ should be understood as: y0 = 0x0ed9, y1 = 0x3687, y2 = 0xa74c, y3 = 0x659a.

CY is BGC of vectors у0-y3. The line ‘2133’ should be interpreted as follows: BGC(y0) = 2, BGC(y1) = 1, BGC(y2) = 3, BGC(y3) = 3.

OURS contains BGC values obtained using the method described in the article.

SG contains the BGC value obtained using the sboxgates utility; the number of iterations for the search was set to 1000 [ 9 ]. The red color in the SG column indicates the S-Boxes that have a higher BGC value compared to the one obtained by our algorithm. The yellow ones have the same BGC value as our algorithm.

S-Box Piccolo Piccolo-1 LAC Prost Rectangle Rectangle-1 Minalpher Skinny TWINE PRINCE Lucifer_S0 Lucifer_S1 Present Present-1 JH_S0 JH_S1 Iceberg_S0 Iceberg_S1

In general, as evidenced by the results in Table 7, the method we developed showed significantly better results compared to the competitor represented by the sboxgates utility. For 203 S-Boxes out of 225 (90.2%), our method provides a bit-sliced description with fewer TIs. The total number of ternary instructions to represent all 225 S-Boxes in our method is 1582, which is 15.3% less compared to 1867 instructions for the sboxgates utility. The sboxgates utility did not generate a bit-sliced description with fewer instructions than obtained by our method for any S-Box, and for only 22 S-Boxes (9.8%) it was able to generate a bit-sliced description with the same BGC value as our algorithm.

7. Conclusions

The paper presents a method for generating a bit-sliced description of arbitrary 4×4 bijective S-Boxes with a reduced number of ternary logic instructions. The obtained descriptions make it possible to generally increase the speed of software implementations of the corresponding crypto-algorithms on any processors that support the 3-operand ternary logic instruction (CPU/GPU). To date, the method proposed in the article is the most effective method known to us according to the BGC criterion, which confirms the research results presented in the work. The method combines heuristic techniques at various stages of searching a bit-sliced representation, in particular: recalculation, exhaustive search to a depth of up to 3 gates using GPU, IDDFS algorithm for searching and cutting options, refinement search, which by this set of measures ensure its efficiency and acceptable speed of action.

8. References

[1]

Bessalov , et al., Computing of Odd Degree Isogenies on Supersingular Twisted Edwards Curves, in: Workshop on Cybersecurity Providing in Information and Telecommunication Systems , vol. 2923 ( 2021 ) 1 - 11 .

[2]

Bessalov ,

Sokolov ,

Skladannyi , Modeling of 3- and 5-Isogenies of Supersingular Edwards Curves , in: 2nd International Workshop on Modern Machine Learning Technologies and Data Science, no. I , vol. 2631 ( 2020 ) 30 - 39 .

[3]

Biham , A Fast New DES Implementation in Software, International Workshop on Fast Software Encryption , 1997 , 260 - 272 . E. Kasper,

Schwabe , Faster and TimingAttack Resistant AES-GCM , in Proc. 11th Int. Workshop Cryptographic Hardware and Embedded Systems ( 2009 ) 1 - 17 . A. Adomnicai , T. Peyrin, Fixslicing AESLike Ciphers: New bitsliced AES Speed Records on ARM-Cortex M and RISC-V, IACR Transactions on Cryptographic Hardware and Embedded Systems , 1 ( 2021 ) 402 - 425 . P. Schwabe K Stoffelen, All the AES You Need on Cortex-M3 and M4 , International Conference on Selected Areas in Cryptography, 2016 , 180 - 194 . J. Zhang, M. Ma, P. Wang, Fast Implementation for SM4 Cipher Algorithm Based On Bit-Slice Technology , International Conference on Smart Computing and Communication , 2018 , 104 - 113 . N. Nishikawa , H.

Amano , K.

Iwai , Implementation of Bitsliced AES Encryption on CUDA-enabled

GPU

, International Conference on Network and System Security , 2017 , 273 - 287 .

[9]

Matsuda ,

Moriai , Lightweight Cryptography for the Cloud: Exploit the Power of Bitslice Implementation , International Workshop on Cryptographic Hardware and Embedded Systems , 2012 , 408 - 425 .

[10]

Sokolov ,

Skladannyi ,

Hulak , Stability Verification of Self-Organized Wireless Networks with Block Encryption , in: 5th International Workshop on Computer Modeling and Intelligent Systems , vol. 3137 ( 2022 ) 227 - 237 .

[11]

Kwan , Reducing the Gate Count of Bitslice DES , IACR Cryptology ePrint Archive , 2000 ( 51 ).

[12]

Stoffelen , Optimizing S-Box Implementations for Several Criteria Using SAT Solvers , Proc. 23rd International Conf. on Fast Software Encryption , 2016 , 140 - 160 .

[13]

Courtois ,

Mourouzis ,

Hulme , Exact Logic Minimization and Multiplicative Complexity of Concrete Algebraic and Cryptographic Circuits, International Journal On Advances in Intelligent Systems . 6 ( 3 )(4) ( 2013 ) 165 - 176 .

[14]

Jean , et. al., Optimizing Implementations of Lightweight Building Blocks, IACR Transactions on Symmetric Cryptology , 4 ( 2017 ) 130 - 168 .

[15]

Bao , et. al., nPeigen-a Platform for Evaluation, Implementation, and Generation of S-boxes , IACR Transactions on Symmetric Cryptology , 2019 , 330 - 394 .

[16]

Mercadier , Usuba, Optimizing Bitslicing Compiler, PhD Thesis , Sorbonne University, France, 2020 , 195 .

[17]

Dansarie , Sboxgates: A Program for Finding Low Gate Count Implementations of S-boxes” , Journal of Open Source Software , 6 ( 62 ) 2021 1- 3 .

[18]

Opirskyy ,

Sovyn ,

Mykhailova , Heuristic Method of Finding BitslicedDescription of Derivative Cryptographic Sbox , 2022 IEEE 16th International Conference on Advanced Trends in Radioelectronics , Telecommunications and Computer Engineering (TCSET), 2022 , 104 - 109 , doi:10.1109/TCSET55632. 2022 .9766883

[19] AQ. Sahun , et. al., Devising a Method for Improving Crypto Resistance of the Symmetric Block Cryptosystem RC5 Using Nonlinear Shift Functions . EasternEuropean J. of Enterprise Technologies , 5 ( 9 ) (113) ( 2021 ) 17 - 29 . doi: 10 .15587/ 1729 - 4061 . 2021 .240344