Reconfigurable modular arithmetic logic unit supporting high-performance RSA and ECC over GF( p )

This paper presents a reconfigurable hardware architecture for public-key cryptosystems. By changing the connections of coarse grain carry-save adders (CSAs), the datapath provides high performance modular operations that can be used for both RSA and elliptic curve cryptography (ECC). In addition, we introduce reconfigurable flip-flops in order to make an optimal choice of hardware resources. The proposed datapath is implemented with a 0.25-µm complementary metal oxide semiconductor (CMOS) technology and on a field programmable gate array (FPGA). We compare the performance of modular exponentiation for RSA and scalar multiplication for ECC based on the prototype implementation. The results show that higher performance is obtained for ECC on the same hardware platform.


Introduction
The idea of public-key cryptography was introduced in the mid 1970s (Diffie and Hellman 1976).They showed that one can eliminate the need for prior agreement of a key in order to exchange some confidential data.Public-key cryptosystems also enable digital signatures.The best-known and most commonly used public-key cryptosystems are RSA and elliptic curve cryptography (ECC).The RSA public-key cryptosystem is named after its inventors Rivest et al. (1978).ECC is based on a different algebraic structure (Miller 1985, Koblitz 1987).In the case of ECC, the group used is the group of points on an elliptic curve.It is important to point out that ECC offers the same level of security as RSA for much smaller key sizes.
The contribution in this paper deals with an architectural solution for a reconfigurable datapath that is used for RSA and ECC over a field of a prime characteristic.The design is further evaluated with ASIC and FPGA implementations.The proposed reconfigurable datapath can achieve arbitrary precision up to 2048 bits, hence easily bridging the gap between the bit-lengths for ECC from 160 bits to 2048 bits long moduli for RSA.Modular multiplications are used based on Montgomery's method without any modular reduction which is also beneficial for side-channel attacks.
The results show that the proposed reconfigurable datapath is indeed a suitable solution for high-performance public-key cryptosystems, such as RSA and ECC.Comparing the two with the same hardware resources and with corresponding bit-lengths that provide similar security, we found that ECC-256p allows higher performance than RSA-2048.This research is of interest because, due to a constant progress in cryptography and security applications, an alternative solution for public-key services, such as signatures, key-distribution etc. is needed.
This paper is organized as follows.Section 2 lists some relevant previous work.Some mathematical background is explained briefly in x 3.In x 4, the details of our architecture are given.The main contribution of our work i.e. the reconfigurable datapath is explained in x 5.The implementation results are given in x 6. Section 7 concludes the paper.

Related work
This section reviews some of the most relevant previous work in hardware implementations for RSA and ECC.To consider both RSA and ECC on the same platform has only recently became more popular, since ECC has proven to be a mature technology.Some of the work is done on FPGAs and only very few implementations are presenting an ASIC implementation of ECC in the field of prime characteristic.
More recent work on hardware implementation of RSA includes the work proposed in McIvor et al. (2003).They use carry save adders (CSAs) to perform the large word length additions required for Montgomery modular multiplication (MMM).The obtained performance for one 1024-bit RSA decryption on the Xilinx Virtex-2 board was 2.63 msec.Satoh and Takano (2003) present a dual field multiplier with the best performance so far in both binary and prime fields.The throughput of an elliptic curve scalar multiplication is maximized by use of the MMM and an on-the-fly redundant binary converter.The biggest advantage of their design is in scalability in operand size and also flexibility between speed and hardware area.The work by Tenca and Koc¸(2003) also introduces a scalable architecture for the computation of modular multiplication, based on the MMM.Their proposed multiplier works with any precision of the input operands, limited only by memory or control constraints.Andres et al. (2005) improve the version of the Tenca-Koc¸Montgomery multiplier and achieve half the latency and half the queue memory requirement.

Mathematical background
In this section, we give some mathematical background for the RSA cryptosystem and ECC over a prime field.

RSA
In a public-key cryptosystem, as introduced by Diffie and Hellman (1976), each user owns a pair of keys, the private and the public key.The private key of a user in this case consists of two large primes p and q and an exponent d.The public key consists of a pair (N, e), where N ¼ p Á q is the modulus (at least 1024 bits) and the exponent e is such that The corresponding p, q and d are kept secret.To encrypt a message M, the user computes C ¼ M e mod N and decryption is described by The previous equality follows from Fermat's theorem and the fact that ðN Þ ¼ lcmðp À 1, q À 1Þ is a divisor of 'ðN Þ ¼ ðp À 1Þðq À 1Þ.The RSA encryption is the modular exponentiation with the public exponent e and the private exponent d is used for decryption.
Here, we explain the CRT implementation of RSA, which is an efficient way to reduce the work factor of the decryption process.The use of CRT for RSA was proposed in 1982 by Quisquater and Couvreur (1982).Namely, by means of the Chinese remainder theorem (CRT), the speed for the RSA decryption scheme can be increased up to 4 times (Koblitz 1994).This possibility is very attractive for practical applications.However, it includes some pitfalls on security, so it has to be carefully implemented.
The two most straightforward algorithms to implement modular exponentiation are given in Algorithm 1, where G is a finite abelian group and e is a positive integer.
Algorithm 1: Algorithms for left-to-right and right-to-left binary exponentiation.
Require: g 2 G, e ¼ ðe nÀ1 e nÀ2 Á Á Á e 1 e 0 Þ 2 Ensure: g e 1: A 1 2: for i from n À 1 downto 0 do 3: High-performance RSA and ECC The basic operations in both algorithms are multiplications and squarings.To be able to use the same datapath for both operations and also for side-channel issues (Kocher et al. 1998) the squarings are not performed on a dedicated squarer, but on the multiplier.Taking into account an expected value of n/2 ones in e, the total number of multiplications in both algorithms is 3n/2.In the left-to-right algorithm, the multiplications have to be performed consecutively requiring one memory location for intermediate values.In the right-to-left algorithm, the multiplications can be parallelized, which doubles the speed.However, the right-to-left algorithm uses two memory locations for intermediate values.

ECC over GF(p)
The main operation in any ECC-based primitive is scalar multiplication.This operation is often performed by some sort of double-and-add algorithm that is using the point group operation (Blake et al. 1999).On the other hand, finite field operations in GF( p), such as addition, subtraction, multiplication and inversion are required to perform the group operations.
When E is a curve defined over GF( p) the typical equation is For an arbitrary point P on a curve E, an inverse of the point For P ¼ Q, we get the following, ''doubling'' formulae: The point at infinity O plays a role analogous to that of the number 0 in ordinary addition.Thus, P þ O ¼ P and P þ ðÀPÞ ¼ O for all points P.There are many types of coordinates in which an elliptic curve may be represented.In the equations above affine coordinates are used, but so-called projective coordinates have some implementation advantages.More precisely, in this case, the point addition can be done using field multiplications only and almost no inversion required, except only one at the end of a point multiplication.There are many types of projective coordinates that were proposed in the literature.In particular, a weighted projective representation (also referred to as Jacobian representation) is preferred in the sense of faster arithmetic on elliptic curves.In this representation, a triplet ðX, Y, ZÞ corresponds to the affine coordinates (X=Z 2 , Y=Z 3 ) for Z 6 ¼ 0. In this case, we have a weighted projective curve equation of the form Weighted projective coordinates provide faster arithmetic than the ''normal'' projective coordinates.Conversion from projective to affine coordinates costs 1 inversion and 4 multiplications, while vice versa is trivial.If one implements addition and doubling as specified in the IEEE standard (IEEE P1363 2000, the total cost for general addition is 1I þ 3M in affine coordinates and 16M in projective coordinates (11M if Z 1 ¼ 1, i.e., one point is given in affine coordinates, and the other one in projective coordinates).Here, I and M are denoting the modular inversion and multiplication operations, respectively.In the case of doubling (with a ¼ p À 3), this relation is 1I þ 4M in affine coordinates against 8M in projective coordinates.Thus, the choice of coordinates is determined by the ratio I : M. Therefore, multiplication in a finite field is the most important operation to focus on when working with projective coordinates.On the other hand, the extra inverter is required for affine coordinates' representation because one inversion has to be performed for every point operation.
Similar to the left-to-right and right-to-left binary algorithms for modular exponentiation, a point multiplication can be performed using Algorithm 2 (Menezes 1993), where P is a point on the elliptic curve and k is a positive integer.The point at infinity O is the identity element for elliptic curve operations.
Algorithm 2: Algorithm for left-to-right and right-to-left binary point multiplicattion Similar to the modular exponentiation algorithms, the left-to-right algorithm will be used when the storage of intermediate values is the bottleneck, while the right-to-left algorithm will be used for higher speed when the datapath allows parallelism.
The point operations in Algorithm 2 are point additions and point doublings.In our case, a point addition and a point doubling, respectively, consist of 14 and 21 multiply/add operations by the MALU in the underlying finite field.Therefore, the total number of multiplications for point multiplication is estimated as (49l)/2.However, as we allocate two MALUs for the ECC case, the number becomes 21l by processing point additions and doublings in parallel.

Datapath of the MALU
The proposed architecture is an MMM with digit-serial multiplications (Algorithm 1).Four-to-two (4-2) CSAs (figure 1a) are used in the hardware implementation because they are considered as one of the most optimal solutions for a multi-operand addition.
The cell, a column of the datapath of the MALU uses d sets of 4-2 CSAs (figure 1b), i.e., the inputs and outputs of the cell are presented in 2-bit CS-form during the operation.Therefore, the cell needs 2d sets of FAs.The critical path of the datapath is estimated based on the critical path delay of the cell as follows.
Here, we assumed that the delay for the sum and carry calculations are the same.The propagation of s i,j goes through d sets of the cell and uses two FAs in every cell.
The right-most cell, cell(i, 0) provides the m i vector for the rest of the cells.
As expressed in Equation ( 7), the path for generating a bit of m i only consists of a 3-input XOR in the right-most cell.
The worst combination of the paths through the logic generating m i results in This path delay is assumed to be equivalent to or shorter than T 4À2CSAs .As can been seen from the hardware configuration and the delay calculations, the datapath of the MALU has an area and delay that can be adapted with the size of d.In this way, the propagation can be tuned to the speed of the system.The proposed array is flexible regarding the size of d; a trade-off can be made between performance and cost.

Functionality of the MALU
Before explaining the general case, the main functionality of the MALU is explained with the case that d ¼ 1.In this configuration, each cell is composed of one 4-2 CSA (figure 1a).The 4-2 CSA sums up the four-bit inputs xy, mn, s and c and outputs two bits in the redundant CS-form whose value is 2ðc next Þ þ s next where s and c are the virtual sum and carries.The bit multiplications xy and mn are the main inputs for computing the bit level of the Montgomery multiplication in Algorithm 1, i.e., ðT þ xy þ mnÞ=2.
Algorithm 3: Algorithm for d-digit serial montgomery modular multiplication over GF(p) without final subtraction.
Require: High-performance RSA and ECC

m
NÞ=2 // addition stage 3þ2d.end for 4þ2d.Return T Simply thinking, a multiplication can be computed with ðk þ Þ 2 times 4-2 CSA operations if the multiplicand and multiplier have ðk þ Þ bit.However, considering that there are no carry propagations in the j-direction shown on figure 1, it is natural to allocate ðk þ Þ sets of cells in the j-direction to take the speed merit.This CSA array is defined as the minimal configuration of our proposed MALU.The connection of the CSA arrays in the i-direction are determined by the bit weights of the CSA's outputs (numbers in parenthesis in figure 1a, b) and the division of the bit-level Montgomery algorithm (1-bit right-shift).The connection is latched with ð2k þ 2 À 1Þ sets of F/Fs for virtual carries.
The explanation of the MALU for a general d is given as follows.As illustrated in figure 1(c), the introduced MALU with 4-2 CSAs has four types of input vectors, Here, X is the multiplier, Y is the multiplicand and N is the modulus.The augend vector S is provided to the MALU by d bits in every cycle and eventually added to the result of the modular multiplication of X and Y (modulo N ).The intermediate results are stored in VS ¼ ðvs i, kþÀ1 Á Á Á vs i, 1 vs i, 0 Þ 2 and VC ¼ ðvc i, kþÀ1 Á Á Á vc i, 1 vc i, 0 Þ 2 .They are reset to zero when a modular multiplication starts to execute (i ¼ 0).After finishing a Montgomery multiplication, the result is output from the right-most cell by d bits in every cycle as The MALU has two independent stages for GF( p) operation.One is the Carry-Save(CS)-stage that implements the Montgomery algorithm in a CS-form.Another converts the CS-form integer into a normal integer by propagating carries, namely the Carry-Propagate(CP)-stage.Moreover, the CP-stage is capable of adding/subtracting S to/from the result of the CS-stage.When subtracting S from XY, we use the 2's complement of S.More precisely, each bit of S is inverted in setting a register for S and ð2N þ 1Þ is provided from the inputs of mn at the first cycle of the CP-stage.For reducing the hardware cost and the critical path delay, the CP calculations are executed in the same datapath of the MALU as the CS-calculations.The operation of the MALU is explained in equation ( 9) Here R is selected as R ¼ 2 kþ where k is the bit-length of the secret key (i.e., N < 2 k ) and is a value determined so that the final reductions can be avoided.In our case, we chose ¼ 4. The details are explained by the following Lemma based on previous work by Batina et al. (2004).
Lemma 1: If the Montgomery parameter R satisfies the following inequality R > 16N, then for inputs X, Y < 4N and S < 2N the result T will satisfy T < 4N (as required ).
Proof: The Montgomery multiplication as implemented in the MALU calculates the following: While the reduction step was needed in the original notation of Montgomery's algorithm, we use a method which does not require reduction.For convenience of repeating usage of equation ( 9), the so-called Montgomery form is applied because the output is in the Montgomery form as well.The latency to calculate a MALU N is 2 Á dðk þ Þ=d e cycles in total.

Reconfigurable datapath
In order to perform a high-performance modular operation (equation ( 9)), we need to allocate ðk þ À 1Þ cells for the datapath of the MALU.For instance, the case of ECC-256p and RSA-2048 need 260 and 2,052 cells, respectively.For the general case (K Â D sets of MALU kÂd ), the block diagram of the reconfigurable datapath is illustrated in figure 2. Since we target a platform which supports both ECC-256p and RSA-2048 (with or without CRT), we introduce a coarse grain datapath of the MALU 260Â1 (k ¼ 256 High-performance RSA and ECC and d ¼ 1) and allocate its clones.Figure 3 shows the interconnection for ECC and RSA in this case.
The datapath can be configured by changing the interconnection of the MALU 260Â1 that is determined by three multiplexors (figure 4) and two-or three-bit registers for selecting them.The signals sel1 and sel2 are used for configuring the datapath, and sel3 is used for configuring the flip-flops.Those multiplexors and flip-flops for the configuration are considered as the area overhead (denoted as AO), introduced by the reconfigurable feature.It is approximately estimated as follows: Here, we ignored the area increase caused by the complexity of the wiring.In addition to AO base , some more flip-flops are not used depending on the configuration.As an example, we consider the case of using eight clones of MALU 260Â1 (K ¼ 2 and D ¼ 4).When supporting RSA-2048, the datapath is configured as MALU 2080Â1 .In this configuration, the horizontal connections of the MALUs are not re-timed by flip-flops for S and R (REG S and REG R ).Therefore, the total area overhead becomes as follows: Likewise, for RSA-2048 with CRT (Quisquater and Couvreur 1982), the datapath is configured to have two sets of MALU 1040Â1 .Therefore, the area overhead for RSA-2048 with CRT is estimated as follows: For ECC-256p, we configure the datapath, so that two sets of MALU 260Â4 can be used in parallel.This configuration uses vertical series of MALU 260Â1 .The intermediate values, the virtual carry and sum, are stored in the flip-flops for VS and VC only at the bottom of figure 2. In this configuration, only one fourth of REG X , REG Y and REG N are used in each MALU 260Â1 .Therefore, the total area of the overhead becomes as follows: In the case of the RSA configuration, we can utilize the flip-flops with almost no waste, while the configuration of ECC cannot use them effectively.In order to exploit the unused flip-flops, we introduce another kind of reconfigurability.Different from RSA, ECC needs to store the intermediate variables during point operations.For this purpose, two sets of 14 words of 260-bit RAM (28Â260-bit RAM) can be configured with the unused flip-flops.In this case, the area overhead becomes as follows: Thus, we can make the best use of the hardware resources also for the ECC configuration.For the critical path delay for each configuration, we have different delays as follows: We assumed that the wiring delay in the critical path of ECC is four times longer than that of RSA.As can be seen from the equation ( 16), the critical path delay of ECC is about four times longer than RSA.Therefore, we need to assume a circumstance where we can use two different clock frequencies, e.g., providing a divided clock for the ECC configuration, in order to facilitate high performance for both configurations.
6. Performance comparison 6.1.Performance estimation for RSA and ECC We implemented the proposed datapath on a Xilinx FPGA (Spartan 3).The place-and-route result of Xilinx ISE is shown in table 1.The same design is also synthesized with a 0.25-mm CMOS technology by using synopsys design vision and the pre-layout netlist is used for the cost and performance estimation.As for the constraints of the synthesis, the design is set to the ECC configuration and the critical path delay for the RSA configuration is estimated by the result of (Static Timing Analysis) STA.
Based on the required number of multiplications for RSA and ECC, we estimate the performance for both and compare them with each other.The result is summarized in table 2. For the ECC case, the latency of one point multiplication is used for the performance, and the latency of modular exponentiation is estimated for the RSA case.As can be seen from the result, the performance of the modular exponentiation for RSA-2048 is lower than ECC by a factor of about 8 and 12 for the results of the FPGA implementation and the synthesis with a 0.25-mm CMOS technology respectively.Even when applying CRT acceleration, ECC-256p outperforms RSA-2048.Moreover, considering that ECC-256p offers stronger security than RSA-2048 (Lenstra and Verheul 2000), ECC-256p shows significantly higher performance than RSA-2048 in our design case.

Comparison with previous work
In this part, we elaborate on the state-of-the-art in Montgomery multipliers.For comparison, we focus on multipliers for RSA, implemented on FPGA.Table 3 lists the previous work and compares it with our results.As for the flexibility of the design, in most of the previous work the operation size is fixed to facilitate high-speed modular multiplication.On the contrary, our design can support 256-bit ECC and 2048-bit RSA effectively by changing the settings of the datapath configuration.The performance of our design is nevertheless competitive, compared to previously designed high-performance modular multipliers that are dedicated to ECC or RSA in the previous work.

Conclusions
We presented a new reconfigurable datapath that enables modular operations for different bit-widths.In addition, the flip-flops are also reconfigured depending on the configuration in order to use hardware resources effectively.The estimated performance based on an FPGA implementation is shown as a case study of RSA-2048 (with and without CRT) and ECC-256p.The results prove that our proposed datapath is suitable for a high-performance cryptosystem supporting both RSA and ECC over GF( p).Especially in our case, ECC-256p shows higher performance than RSA-2048 using the same amount of hardware resources.High-performance RSA and ECC

Table 1 .
Implementation result of the proposed datapath.

Table 3 .
Performance comparison of Montgomery's multipliers over GF( p).