Introduction

A Hardware Implementation for Code-based Post-quantum Asymmetric Cryptography.∗

Kristjane Koleci

kristjane.koleci@polito.it 0

Marco Baldi

m.baldi@univpm.it 1

Maurizio Martina

maurizio.martina@polito.it 0

Guido Masera

guido.masera@polito.it 0 0 Politecnico di Torino , Italy 1 Universit`a Politecnica delle Marche , Italy

This paper presents a dedicated hardware implementation of the LEDAcrypt cryptosystem, which uses Quasi-Cyclic Low-Density Parity-Check codes and a decoding algorithm known as Q-decoder for the decryption function. The designed architecture is synthesized for both FPGA and ASIC technologies, featuring an intrinsic scalability over a wide range of parallelism degrees, which makes it possible to target multiple application scenarios, with different trade-offs between decryption latency and implementation complexity. The proposed system achieves a large speed-up over both software execution and a previous hardware implementation, with a the decryption latency as low as 3.16 ms for the FPGA version, and 1.2 ms when synthesized for a 65 nm CMOS technology.

Introduction

The efficiency of post-quantum cryptographic algorithms when implemented in hardware is considered among the requirements of candidates to the NIST post-quantum cryptography standardization process [22]. This motivates research in this area, and in the design of efficient hardware solutions for the implementation of these new cryptographic primitives. 1.1

Related work

Several works have already appeared in the literature concerning the hardware implementation of post-quantum cryptographic primitives. Among them, isogeny-based cryptographic primitives have been considered in [17, 16] and lattice-based primitives have been considered in [13, 12, 8], while primitives based on the lattice-based problem variant known as ring-learning with errors (Ring-LWE) have been considered in [1].

Concerning the implementation of code-based post-quantum primitives, the classic McEliece scheme based on binary Goppa codes has been considered in [24], while variants based on QuasiCyclic Moderate-Density Parity-Check (QC-MDPC) codes have been considered in [18, 15]. The LEDAcrypt post-quantum cryptographic primitives based on QC-LDPC codes have also been recently considered for hardware implementation in [14].

Concerning post-quantum digital signatures, a hardware-oriented analysis of NIST postquantum cryptography standardization candidates is reported in [23]. 1.2

Contribution

This work addresses the implementation of an hardware accelerator for the LEDAcrypt primitives. The proposed architecture is synthesizable for both ASIC and FPGA technologies. Moreover, it is scalable in terms of processing parallelism, thus achieving different trade-offs between performance and implementation complexity.

The paper is organized as follows: firstly the description of the algorithm is given in Section 2, then the overview of the architecture and additional details on one key processing unit are provided in Section 3 and 4. Finally, the synthesized results are summarized in Section 5 and the conclusions are given in Section 6. 2

Algorithm

From the complexity standpoint, a crucial algorithm for the LEDAcrypt primitives is the decoding algorithm, which is an iterative algorithm that estimates a sparse error vector e starting from a syndrome vector s, by exploiting the knowledge of two secret matrices: the secret QCLDPC code parity-check matrix H and the secret transformation matrix Q. In LEDAcrypt, this is performed through an iterative algorithm derived from the classic bit-flipping decoding algorithm, and known as Q-decoder.

The decoding algorithm starts from an initial syndrome s that is computed from the encrypted message m and the two secret matrices H and Q as follows:

Initial Syndrome : s(0)T = (HQ)mT ( 1 ) Then, at every iteration l = 1, · · · , Itmax, the algorithm generates an updated syndrome s(l) and a refined estimate of the error vector e(l), by computing the following quantities:

Sigma : σ(l) = s(l−1)H Correlation : ρ(l) = [ρ(1l), ρ(2l), . . . , ρ(nl)] = σ(l)Q Thresholds : b(l) =

max j=1,··· ,n !ρ(l)"

j Positions : Pl = {v ∈ [1, n]|Rv(l) > b(l)}

Errors : e(l) = e(l) = e(l−1) + # qv Syndrome : s(l) = s(0) + e(l)HT v∈P l ( 2 ) ( 3 ) ( 4 ) ( 5 ) ( 6 ) ( 7 ) where qv is the vth row of QT . The product in 1 and 7 are performed in GF( 2 ), while the result of equations 2 and 3 are integers. The stop condition is reached when s = 0 (s being the sum of the entries of s) or l = Itmax. When the decoding process terminates recovering the correct error vector, such a vector can be straightforwardly used to retrieve the cleartext message m.

LEDAcrypt provides various parameters related to the involved matrices and vectors. Table 1 gives the main algorithm parameters for different levels of security (Category): n0 is the number of circulant blocks forming the parity-check matrix H, p is the size of the circulant blocks, dv and m are the numbers of asserted bits in each column of the matrices H and Q respectively, t is the number of intentional errors used for encryption and Itmax is the maximum number of iterations required to successfully recover the encrypted message.

The estimation of the most time consuming units in the decoder is important to achieve an efficient implementation. Therefore, we used a Matlab implementation with the relevant profiling capabilities to derive the processing time required for each main task. Table 2 provides the collected results for a code with n0 = 2, p = 27, 779 and Itmax = 3. Two versions of the algorithm software implementation have been simulated: the first version is the original model [3], while the second version has been obtained by exploiting specific Matlab options to accelerate the execution. The Matlab code has been run on a laptop with 16GB of RAM and Intel Core i7-6700 HQ CPU with 2.60GHz clock frequency.

It is clear from the profiling results that the most time consuming functions in the Qdecoder are the Initial Syndrome calculation and the Syndrome update. Therefore, a parallel formulation of these tasks is desirable to map them onto a dedicated hardware architecture and to accelerate the whole algorithm. An additional advantage that is expected from the hardware implementation of the Q-decoder is the lower energy dissipation. The complete architecture is divided into three main blocks (Figure 1): the Memory, the Control Unit (CU) and the Data Path (DP). The Memory unit contains several memory components that store the data structures used by the decoding process. The CU is implemented as a Finite State Machine (FSM) that drives the execution of the algorithm. Finally, the DP includes the

Memory

Message Syndrome

Sigma Correlation Error H Q L

Control Unit

Idle

Initial Syndrome

Cmp Message Corrected Message Not Corrected true false

Syndrome, Correlation and Message Evaluation SW = 0

false It < ItMax true

Data Path

Syndrome Computation

Unit Correlation Computation

Unit

Message Update

Unit Syndrome Update

Unit resources necessary to process all the algorithm variables and it is structured into four main units: (i) the Syndrome Computation Unit (SCU), which includes the evaluation of the Initial Syndrome s(0) as in ( 1 ), (ii) the Correlation Computation Unit (CCU), which evaluates the Correlation ρ, namely ( 3 ), (iii) the Message Update Unit (MUU), which derives the errors e in the message and its correction ( 6 ), and (iv) the Syndrome Update Unit (SUU), which iteratively updates the syndrome (s(l)) based on the obtained errors ( 7 ).

The Control Unit FSM basically follows the sequence of processing tasks given in Section 2. The iterative nature of the algorithm and the data dependencies within a single iteration impose a sequential execution of the key steps. However, several operations inside the syndrome and correlation computations allow for a parallel execution. Moreover, the similarities among these operations suggest the reuse of some hardware resources in both SCU and CCU. 3.1

Memory Organization

The vectors and matrices sizes are derived from Table 1. There are three kinds of vectors handled by the algorithm: vectors containing positions, binary vectors and integer vectors. Vectors containing positions in the range [0, p] require np = ⌈log2(p)⌉ bits. Binary vectors are stored in a matrix format, as Nb bit words, where Nb is the decided degree of processing parallelism. Integer vectors are stored in a matrix format, as nc and ns bit words for Correlation and Sigma values, respectively.

Based on the dynamic range of each data, the expected size for the required memories can be evaluated as in Table 3. The reported size values are related to the case of two circulant block (n0 = 2). Therefore, every array (except the syndrome s and the matrix Q) is represented as a two-block component. As an example, the message m is expressed as m = [m0 m1] and L = HQ = [L0 L1]T . Similarly, the matrix Q is divided into four components (Q00 to Q11). where ⊕ indicates the bit wise xor operation, while the CCU calculates $m0 m1% & L0 '

L1 (m0L0*

(s0* = )

⊕ m1L1 + = )⊕+

s1 σ = s $ H0

H1 % = $ σ0

σ1 % ρ = $ σ0 σ1 % & Q00

Q10

Q01 ' Q11 = ) (σ0Q00

+ σ1Q10 σ0Q01*

+ + = $ρ0 σ1Q11 ρ1% ( 8 ) ( 9 ) ( 10 ) In both units, the key requirement is the calculation of the product between a vector and a sparse cyclic matrix, that is a VectorByCirculant operation. In the SCU a binary vector m is multiplied, while in the CCU the product is needed twice, involving a binary vector in ( 9 ) and an integer vector in ( 10 ). However, a unified architecture can be conceived to cover both types of product, as detailed in Section 4. 3.3

Message Update Unit

The update of the message requires to find the positions of the errors intentionally inserted at the encoding side. This estimation is based on Syndrome and Correlation. The circuit in Figure 2(a) shows the evaluation of the syndrome weight. Initially, the syndrome weight is set to 0, then the Syndrome memory is read row by row and the number of ones per row is accumulated in the Syndrome Weight register.

The circuit for the evaluation of the error is in Figure 2(b). A row is read from the Correlation memory and its elements are compared with the threshold b: if at least one value in the row is higher than b (Cmp or = 1), the error position is saved into the Error memory, in terms of row address and location within the row (index). The process ends when the last row of the Correlation memory is checked. The threshold is derived from a Look-Up-Table that returns a value of b given the input range of Syndrome Weighs (SWs) [14].

Finally, Figure 2(c) shows the update of the message vector. The circuit receives a message word (Nb elements) from the Message memory and the corresponding error positions from the Error memory. A set of Nb xor gates applies the correction and the updated message is stored back into the Message memory.

Syndrome Row Register Syndrome Row Register

N bit N bit

Cou1nster Cou1nster N bit N bit N bit

+ + Syndrome Syndrome Weight Weight

N bit (a) Syndrome Weight evaluation logic.

Correlation Row Register Correlation Row Register

Message Row Regis

Message Row Register Threshold > T>hre>sho>ld >> >> >> >> > > > > Cmp_or

Cmp_Comrp

Cmp (b) Error Position evaluation.

Bit Selection

Updated MessageURpedgaitsetedrMessage Re

Decoded index

Message Row Register Updated Message Register (c) Message Update.

Decoded

index 3.4

Syndrome Update Unit

The syndrome update equations ( 6 ) and ( 7 ) are modified into a different form that simplifies the hardware implementation: e(sly)n = e(l)L s(l) = s(l−1) ⊕ e(sly)n where L = HQ and esyn is the error location vector for the syndrome. The latter equation can be implemented by means of the same circuit given in Figure 2(c), while a sparse vector by circulant product is needed for the calculation of e(sly)n.

This operation is implemented as described in Algorithm 1.

Algorithm 1 SparseVectorByCirculant

Input: el,L; Output: elSyn; r ⇐ 0; indexSyn = 0; while indexP os < dv do while indexErr < M axP os do elSyn(indexSyn) = mod(el(indexErr) − L(indexP os), p); indexSyn = indexSyn + 1; end while end while Algorithm 2 VectorByCirculant

Input: v(1, n),Pos(1, d); Output: r(1, n); r ⇐ 0; indexP os = 1; while indexP os ≤ d do k = Pos(indexP os); vshifted = [v(k : n), v(1 : k − 1)] ; r ⇐ r + vshifted end while 4

Vector By Circulant product architecture The Vector By Circulant unit, included in both the SCU and CCU, evaluates the product r = Av of a cyclic and sparse binary matrix A by a vector (integer or binary), v. In a cyclic matrix, all the rows are cyclic shifts of the first one. An example of the Vector By Circulant product is given in ( 11 ) and ( 12 ) for the case of size p = 15, with dv = 2 (a2 and a5 asserted values in the first row of A). The real values of p are given in Table 1. The strategy described is the same employed in the QcBits Algorithm[10].

r14 = a1v0 + a2v1 · · · + a0v14 + v4

The straightforward implementation of the product, i.e. the direct mapping of these equations into an hardware architecture is not efficient, because of the size of v and A .

However, it can be seen from ( 12 ) that the elements of r are obtained by combining shifted versions of the v vector. For example, in the given example for p = 15, r can be calculated by taking two circular shifts of v, respectively starting at positions 2 and 5 (the asserted values in the first row), and adding them element by element. In general, r can be calculated by adding (modulo-2 addition for the binary case) dv shifted versions of v (indicated as vshifted). Their initial positions are determined by the elements in the first row of A, stored in the vector Pos. The product calculation is detailed in Algorithm 2.

To efficiently implement Algorithm 2, we proceed in a partially parallel way and update Nb elements of r at a time. Let us assume that the A matrix is stored in the sparse format, i.e. all non-zero elements in the first row are available in a linear array (Pos). Moreover, we assume that v is stored in the memory Mv, organized as p/Nb words of Nb elements. At every read from the memory, we obtain Nb elements of v in parallel. Said x the initial position of a shifted version of v (that is an element of the Pos vector), we find the first element in vshifted by calculating the address i = ⌊x/Nb⌋ and the index j = x mod Nb. If Nb is a power of two, i is simply equal to the log2 Nb most significant bits of x and j is equal to the remaining least significant bits of x. Therefore, the desired shifted version of v is obtained by selecting in the read word all elements with index ≥ j; the vector is then completed by means of additional reads from Mv, at addresses > i. Overall, up to ⌈p/Nb⌉ reads are needed to obtain the complete vshifted. For example, with p = 15, Nb = 4 and x = 3, the first vector element, v3, is read at the first cycle, together with elements v0 to v2, which are not used at this time. In the two following cycles, elements v4 to v7, and v8 to v11 are taken from Mv. Finally elements v12, v13 and v14 are read at the fourth cycle. The architecture of the complete architecture for the product Vector by Circulant is shown in Figure 3.

Pos index log N initial address Row

Counter address p

Row N

collapse unit …

Binary + + … + Integer

N M

N N N

Add

Result

Control Unit counters and register enables

N M

In order to process the data read from the Mv memory, we need two Nb-element registers (Rown and Rown+1) that store two consecutive rows of Mv. At every cycle, a new row is loaded into the Rown + 1 register and the previous row is moved to the Rown register, at the same time. Moreover, a collapse unit merges the elements in the two registers in a sequence of Nb ordered elements starting from the position pointed by index. The circuit, at the first cycle, skips elements with index lower than the initial shift position and reuse them at the last cycle, to complete the tail of the sequence.

The collapse unit provides Nb consecutive elements to an accumulator that calculates the final product as the sum of several shifted versions of v. The temporary accumulated values are stored in the result memory, Mr, which is set to zero at the beginning. Then, at every iteration, the Add unit receives a new portion of a shifted vector from the collapse unit and combines it with the old accumulated values read from Mr. The Add unit actually contains plain xor gates in the case of a binary vector and a set of complete adders for an integer vector.

The complete result is available in the Mr memory after the loading of the last vshifted.

This solution is scalable and provides a speed-up by a factor nb over the sequential implementation. 5

Simulations and Synthesis results The proposed architecture has been described in a parameterized form, for multiple degrees of parallelism, ranging from Nb = 8 up to Nb = 64. The synthesis has been completed using Synopsys Design Compiler and a CMOS 65nm technology library. Functional and post-synthesis simulations have been done to exactly estimate the required number of cycles and to derive the power dissipation.

The results provided by the Modelsim simulations are given in Table 4 for two codes with n0 = 2 and p = 27, 779 and p = 15013.

The FPGA synthesis has been carried out with Xilinx Vivado, targeting the Artix-7 xc7a50tcpg236-3 device and setting the clock frequency at 100 MHz. The occupied resources are detailed in Table 6.

Although the hardware implementation of QC-LDPC code decoders for wireless communications has been deeply investigated [11], only a few dedicated architectures for code-based post-quantum cryptography are available in the open literature. The FPGA results reported in Table 6 can be compared against the LEDAcrypt implementation proposed in [14], which supports the n0 = 2, p = 15, 013 code with a bit parallelism of 32 bits: the reported resource usage for a Xilinx Virtex-6 device is 650 FFs and 2222 LUTs, the achievable clock frequency is 140 MHz and the required number of cycles is 2.62 · 106. Therefore, the architecture described in [14] has a decryption latency of 2,620,000 cycles at 140 MHz, corresponding to 18.7 ms. On the other side, the solution described in this paper achieves a decryption latency, for the same code and degree of parallelism, equal to 507,000 cycles at 100 MHz, equal to 5.05 ms. 6

Conclusions and Future Work This paper presents a hardware implementation of the LEDAcrypt post-quantum cryptographic primitives based on QC-LDPC codes. A key advantage of the present design is the scalability of the obtained decoder, which allows for different trade-offs between computational time and implementation cost. The decryption latency ranges between 3.16 ms up to 16 ms, for an internal degree of parallelism between 8 and 64.

As a future work, the effect of an extended parallelism in the critical processing tasks could be explored more in detail. Given the large impact of processing parallelism on the total amount of occupied resources or Silicon area, several architecture-level choices have to be explored in order to balance the reduction of the decoding time and the increase of the implementation cost. A second line of research for future works can aim at improving the implementation efficiency by means of joint algorithm and architecture optimizations. As an example, possible algorithm simplifications can be investigated to evaluate both the effects on the cryptographic primitives and the provided advantages in terms of their hardware implementation.

[1]

Rashmi

Agrawal , Lake Bu, Alan Ehret, and Michel

Kinsy . Open-source fpga implementation of post-quantum cryptographic hardware primitives . In Field Programmable Logic and Applications (FPL) , 2019 International Conference on, Sep . 2019 .

[2]

Frank

Arute , Kunal Arya,

Ryan

Babbush , et al. Quantum supremacy using a programmable superconducting processor . Nature , 574 : 505 - 510 , 2019 .

[3]

Baldi ,

Barenghi ,

Chiaraluce , G. Pelosi, and

Santini . Ledacrypt. https://github. com/LEDAcrypt/LEDAcrypt/tree/master last viewed February 2019 , 2019 .

[4]

Baldi ,

Bodrato , and

Chiaraluce . A new analysis of the McEliece cryptosystem based on QC-LDPC codes . In Proceedings of the 6th international conference on Security and Cryptography for Networks (SCN 2008 ), pages 246 - 262 , Berlin, Heidelberg, 2008 . Springer-Verlag.

[5]

Marco

Baldi , Alessandro Barenghi, Franco Chiaraluce, Gerardo Pelosi, and Paolo Santini. LEDAcrypt: QC-LDPC code-based cryptosystems with bounded decryption failure rate . In Marco Baldi, Edoardo Persichetti, and Paolo Santini, editors, Code-Based Cryptography , pages 11 - 43 , Cham, 2019 . Springer International Publishing.

[6]

Berlekamp ,

McEliece , and H. van Tilborg. On the inherent intractability of certain coding problems (corresp .). Information Theory , IEEE Transactions on, 24 ( 3 ): 384 - 386 , 5 1978 .

[7] Daniel

Bernstein . Grover vs . McEliece. In Proceedings Post-Quantum Cryptography: Third International Workshop (PQCrypto 2010 ), pages 73 - 80 , Darmstadt, Germany, 5 2010 . Springer Berlin Heidelberg.

[8]

Braun ,

Fritzmann , G. Maringer,

Schamberger , and

Seplveda . Secure and compact full ntru hardware implementation . In 2018 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC) , pages 89 - 94 , Oct 2018 .

[9]

Lily

Chen , Stephen Jordan , Yi-Kai

Liu

, Dustin Moody, Rene Peralta, Ray Perlner, and

Daniel

Smith-Tone . Report on post-quantum cryptography . National Institute of Standards and Technology Internal Report , 8105 , 2016 .

[10]

Tung

Chou . Qcbits: Constant-time small-key code-basedcryptography . https://www.win.tue. nl/~tchou/papers/qcbits.pdf, 2016 .

[11]

Condo ,

Martina , and

Masera . A network-on-chip-based turbo/ldpc decoder architecture . In 2012 Design, Automation Test in Europe Conference Exhibition (DATE) , pages 1525 - 1530 , 2012 .

[12]

Farahmand ,

M. U.

Sharif ,

Briggs , and

Gaj . A high-speed constant-time hardware implementation of ntruencrypt sves . In 2018 International Conference on Field-Programmable Technology (FPT) , pages 190 - 197 , 12 2018 .

[13]

Howe ,

Moore , M. O'Neill , F.

Regazzoni , T.

Gneysu , and K.

Beeden . Lattice-based encryption over standard lattices in hardware . In 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC) , pages 1 - 6 , June 2016 .

[14]

Hu ,

Baldi ,

Santini ,

Zeng ,

Ling , and

Wang . Lightweight key encapsulation using ldpc codes on fpgas . IEEE Transactions on Computers , pages 1 - 1 , 2019 .

[15]

Hu and

R. C. C.

Cheung . Area-time efficient computation of niederreiter encryption on qc-mdpc codes for embedded hardware . IEEE Transactions on Computers , 66 ( 8 ): 1313 - 1325 , Aug 2017 .

[16]

Koziel ,

Azarderakhsh , and

M. M.

Kermani . A high-performance and scalable hardware architecture for isogeny-based cryptography . IEEE Transactions on Computers , 67 ( 11 ): 1594 - 1609 , Nov 2018 .

[17]

Koziel ,

Azarderakhsh ,

M. Mozaffari

Kermani , and

Jao . Post-quantum cryptography on fpga based on isogenies on elliptic curves . IEEE Transactions on Circuits and Systems I: Regular Papers , 64 ( 1 ): 86 - 99 , Jan 2017 .

[18]

Ingo

Von Maurich , Tobias Oder, and Tim Gu¨neysu. Implementing qc-mdpc mceliece encryption . ACM Trans. Embed. Comput. Syst. , 14 ( 3 ), April 2015 .

[19]

R. J.

McEliece . A public-key cryptosystem based on algebraic coding theory . Deep Space Network Progress Report , 44 : 114 - 116 , January 1978 .

[20] National Institute of Standards and Technology. Post-quantum cryptography project .

[21] National Institute of Standards and Technology. Post-quantum cryptography project - round 2 submissions.

[22] National Institute of Standards and Technology. Submission requirements and evaluation criteria for the post-quantum cryptography standardization process .

[23]

Deepraj

Soni . A hardware evaluation study of nist post-quantum cryptographic signature schemes . In Second PQC Standardization Conference, Santa Barbara, CA, August 2019 .

[24] Wen

Wang

, Jakub Szefer , and Ruben Niederhagen . Fpga-based niederreiter cryptosystem using binary goppa codes . In Tanja Lange and Rainer Steinwandt , editors, Post-Quantum Cryptography , pages 77 - 98 , Cham, 2018 . Springer International Publishing.