Performance Evaluation of Scalar Multiplication in Elliptic Curve Cryptography Implementation using Different Multipliers Over Binary Field

This paper presents a point multiplication processor over the binary field GF (2) with internal registers integrated within the point-addition architecture to enhance the Performance Index (PI) of scalar multiplication. The proposed design uses one of two types of finite field multipliers, either the Montgomery multiplier or the interleaved multiplier supported by the additional layer of internal registers. Lopez Dahab coordinates are used for the computation of point multiplication on Koblitz Curve (K-233bit). In contrast, the metric used for comparison of the implementations of the design on different types of FPGA platforms is the Performance Index. The first approach attains a performance index of approximately 0.217610202 when its realization is over Virtex-6 (6vlx130tff1156-3). It uses an interleaved multiplier with 3077 register slices, 4064 lookup tables (LUTs), 2837 flip-flops (FFs) at a maximum frequency of 221.6Mhz. This makes it more suitable for high-frequency applications. The second approach, which uses the Montgomery multiplier, produces a PI of approximately 0.2228157 when its implementation is on Virtex-4 (6vlx130tff1156-3). This approach utilizes 3543 slices, 2985 LUTs, 3691 FFs at a maximum frequency of 190.47MHz. Thus, it is found that the implementation of the second approach on Virtex-4 is more suitable for applications with a low frequency of about 86.4Mhz and a total number of slices of about 12305.


INTRODUCTION
Elliptic Curve Cryptography (ECC) is a type of asymmetric key (Jwad, Abdulaah, and Effing, 2012) cryptography that provides higher security than Rivest-Shamir-Adleman (RSA) for a smaller key size. A short key is a proper choice for hardware implementations of ECC, especially in devices with restricted resources as they require less area and processing time (Kilts, 2006), (Kawther E. Abdullah, 2018). Hardware implementations of cryptosystems produce systems with higher speeds and better security than software implementations. Point Multiplication (PM) is the heartbeat of ECC. Different projective coordinates can be used for point representation, but this work uses the Lopez Dahab coordinate system to skip the inversion process that consumes lots of resources (Bilal and Rajaram, 2010). The efficiency of the highperformance hardware implementation of scalar multiplication depends on the polynomial representation. Both performance metrics, time, and area are desirable to be considered during the design. Still, incompatible features, as in some projects, can deliver a high speed within a compacted area while others attain lower area and speed. Consequently, hardware implementations require the consideration of speed and area parameters (Strukov, 2006). Different architectures are adopted to design and realize a multiplier unit such as a Montgomery, Karatsuba, Mastrovito, bit-parallel, and digit serial. This work considers two types of binary fields for GF (2 233 ) multipliers: Interleaved and Montgomery. This paper aims to enhance the performance index of PM by adding internal registers within the data path of point multiplication, and the proposed designs of PM are implemented on different FPGA platforms. The FPGA which are appropriate for intensive computations (Hassan, 2010).
The rest of this paper presents previous work in this field, the mathematical background of finite field and elliptic curves, simulation, implementation of the proposed design, followed by the results and discussion then finalized by the conclusion.

2.RELATED WORK
A proposed design by (Urbano-Molano, Trujillo-Olaya, and Velasco-Medina, 2013), presented parallel multiplication and bit-serial multipliers then obtained an execution time of 0.025μs and 1.62μs, respectively, with a value of k equal to 9. (Fournaris, Dimopoulos, and Koufopavlou, 2017) presented a strategy for digit serial multiplier based on binary Edwards curve scalar multiplier architectures. It relied on the use of GF (2 k ) digit serial multiplication with a balance in speed and consumption resources in addition to parallelism for distributing GF (2 k ) operations while keeping a high level of usability of units in each layer.
The design of point multiplication over the binary field GF (2 233 ) is presented by (Kadu and Adane, 2018) as a secured curve based on the recommendations of NIST.
Performances obtained from this design were assessed by comparing them with the Karatsubabased point multiplier for area and delay. The results show that the Vedic multiplier occupied 22% less area on FPGA and caused 12% less delay than the Karatsuba-based scalar multiplier. The proposed design was coded using Verilog HDL and simulated and synthesized on Virtex-6.
The architecture of the proposed design by (Rashidi, 2018) was built on Virtex-5 XC5VLX110 and Virtex-4 XC4VLX100 FPGAs to achieve two fields, F2 163 and F2 233 . The results show enhancement in execution time and area when compared to previous work.
Finally, the proposed design of a coprocessor by (Parrilla et al., 2019) allowed the acceleration of secure services that can be applied in the next generations of FPGA. Thus, permitting to host in the same chip, a secure web or database server, and the cryptographic processor. This coprocessor provided an improvement over other hardware implementations in terms of area, performance, and scalability.
The purpose of the paper is to enhance the PI of scalar multiplication by adding a layer of registers, then compare the outcome PI among different FGPAs.

MATHEMATICAL BACKGROUND 3.1 Finite Field
A field with a finite number of elements is called a Finite Field Fq. It is used in cryptography, where q=2 m , to implement software or hardware with fast performance. The elements in a binary representation can be presented in a binary representation degree less than m, where A(x)=∑ −1

=0
; the arithmetic operations in a binary field are reduced using an irreducible Journal of Engineering Volume 26 September 2020 Number 9 48 polynomial that have an m degree. A polynomial with degree m can represented in the following formula: where is called the ith terms of polynomial, and represents the coefficient and m represents the length of key size.
For example, an 8-bit word is represented by a polynomial as follows in Fig.1: It is clear from Fig.1 that the term of 0 coefficient is omitted; moreover, is 1.

Multiplication
This refers to multiplying two polynomials C(x) and D(x) based on normal multiplication and polynomial reduction f(x) and has a specific value based on the curve type.

Squaring
The squaring of polynomial C(x) 2 is too cheap, as it can be accomplished by inserting zero into the bit vector (Hankerson, 2004). The division of two polynomials can be accomplished by dividing the polynomial on modulo f(x) and keeping the remainders, for example, the division of polynomial with degree 12 on modulo with 8 degrees, as shown below:

Elliptic Curve over GF (2 m ).
The Elliptic curve, from a mathematical aspect, is a cubic equation in the standard form. Eq.
(2) defines the elliptic curve over the binary field GF (2 m ); the curve is set with points, and each point located on the Elliptic curve is represented by the x and y coordinates when using the Affine coordinate projective. The values of a and b in Eq.(1) specifies the shape of the curve, while ≠ 0 f(x) represents an irreducible polynomial.
Operations in the Elliptic curve have a hierarchy model and contain four layers. Layer one represents the finite field arithmetic operations such as multiplication, addition, division, and inversion. Layer 2 consists of two main components: point addition and point doubling. Point multiplication (scalar multiplication) is layer 3 in layer 4 lie security schemes such as Elliptic Curve Digital Signature Algorithm (ECDSA) and Elliptic Curve Diffe-Halmen (ECDH). Different kinds of elliptic curves are available. This work is based on the Koblitz curve with on field 233 and specifications mentioned by NIST. Therefore, Eq. (3) is used instead of Eq.
Where b = 1, a=0, and ( ) = 233 + 74 + 1 The Koblitz curve is attractive because of its advantages in computational aspects. These advantages lie in using Frobenius endomorphism (φ), and the point P (x, y) can be mapped such that: Clearly, the Frobenius endomorphism is very cheap: two or three square operations are required depending on the objective coordinates. Using Frobenius endomorphism instead of point doubling it, which is not a straightforward operation, it requisites a manipulation of the value of k.

=0
. Then, to apply fast Frobenius endomorphism, k must be converted to − .

Interleaved Multiplier
An easy model of multiplication over the finite field is the interleaved multiplier. The principal work of this multiplier is based on the shift and add algorithm, and the products of

Montgomery Multiplier
The Montgomery multiplier is a sequential multiplier model, and the products of the two polynomials c(x) and d(x) are defined in Eq. (6).
where M(x) is a constant element in the field and gcd (M(x), f(x)) = 1; one can find two polynomials M (x) − 1 and f (x) -1 so that ( ) ( ) −1 + ( ) ( ) −1 = 1 where M(x) -1 is the inverse of M(x) modulo f(x). The two polynomials can be computed with the Extended Euclidean Algorithm (EEA). Therefore, the Montgomery multiplication over GF (2 m ) can be computed using algorithm 1 and the data-path, as shown in Fig. 3.

Classic squaring
In the classic squaring of polynomials E(x), inserting a zero value in bit vector is all that is required for getting E(x) 2 . There is another method for squaring a polynomial, i.e., by applying the classic multiplication E(x) 2 =E(x) E(x) mod f(x).

Koblitz Point addition
This refers to adding two points P (x1, y1) and Q (x2, y2) on the Koblitz curve (E0: y 2 + xy = x 3 + 1). Three arithmetic units are used for point addition: a multiplier, division unit, and squaring unit. The inversion component is not necessary for point operations when using the Koblitz curve. From a computational aspect, the third point R (X3, Y3) can be calculated using Eq. (8) and Eq. (9). 3 = 2 + + 2 + 1 + (8) Koblitz's point addition consists of two squaring components and one interleaved multipliers and a reduction component and binary division.

Koblitz point multiplication
All the points on the elliptic curve, including the infinity points, form a finite communicative group in point addition and point doubling. If there is a Generation point on the curve called P, and there is a positive number k, then the Q can be calculated as follows: The equation is called scalar multiplication or point multiplication. It is clear that the point multiplication is computed by repeating the adding and doubling of points, which absolutely depends on the components of the finite field arithmetic, such as polynomial multiplication, addition, inversion, and division.
The inversion module consumes more resources; therefore, most of the design uses projective coordinates for the representation of the point. Many projective coordinates exist, such as Affine, Lopez Dahap (LD), and Jacobean. The work in this paper is based on LD as the projective coordinates are shown in Algorithm3, and by using these types of coordinates, no inversion is used during the operation over a finite field, which quickens the point operation and consumes a few resources such as power, area, and low latency.

Scalar Multiplication
Since the point multiplication is based on the Koblitz point addition, which consists of two squaring components, i.e., the interleaved multiplier, reduction component, and binary division algorithm. The proposed system can be defined over different scenarios to build the proposed elliptic curve point multiplication booster with internal registers on different platforms (Virtex-4, Virtex-5, Virtex-6, and Virtex-7), as shown below: 1. Building ECP with interleaved multiplier and classic squaring on different platforms.
2. Building ECP with Montgomery multiplier and classic squaring on different platforms. Fig.5 illustrates the data-path of the Koblitz point addition over the 233-bit field and the proposed design using internal registers. These registers are used to store data after each operation.

RESULTS
The proposed design of scalar multiplication is based on point addition component. So, in order to get K.P, point P is added k-times to obtain the result. A strategy for architectural Journal of Engineering Volume 26 September 2020 Number 9 54 timing enhancement is to build intermediate layers of registers to the critical-path. This technique is used in pipeline design when latency, due to a few additional clock-cycles, does not affect specifications of the design. The throughput of the circuit is obtained using Eq. (11).
To compare with previous work, PI is used instead of the throughput indicator, as PI is fairer to use for comparison among different platforms.   Fig.8 shows the RTL schematic of scalar multiplication using interleaved multiplier with finite field GF (2 233 ) bits. The structure consists of three components, a Koblitz point addition, and two classic-squaring. The Point-addition in this scenario is composed of three components as follows: binary division, Interleaved multiplier, classic squaring, and some other components such as XOR gates for bitwise addition.    The RTL schematic of point multiplication over Virtex-6 is shown in Fig.12. K233_point_multiplication-1 represents the top component of the scalar multiplication.

Simulation on
The numbers of Slice-registers, Slice of LUTs, and FFs over this platform are 3077, 4064, and 2837, respectively, as shown in Fig.13. Table 1, the proposed design on Virtex-4 utilizes 3193 Slices, 340 LUTs, and 3772 Flip-flops and requires 21.805μs for point-multiplication. Thus, the PI approximates 0.38751206. On Virtex-5, the results show that 3768, 4572, and 5335 of a slice, LUTs, and Flip-flops are required, respectively. And the number of clock-cycles increases, leading to a maximum frequency of 115.9MhZ and an estimated PI 0. 0.308518945. By changing the platform to Virtex-6, the value of maximum frequency increases to 190MHz. Thus, the number of clock-cycles is approximately 4142, and the PI is 0.217610202. In Virtex-7, the results show that 3780 slices, 3646 LUTs, and 2682 Flip-flops are used, while the maximum frequency is 221.6Mhz. It is clear that the proposed design on Virtex-7 is working at a highfrequency 221.6Mhz of PI 0.2599156, which is higher than 0.217610202 on Virtex-6. However, the design on Virtex-6 is appropriate for limited area applications. A higher Performance Index is obtained on Virtex-4 with a value of 0.38751206 and a low frequency of 86.6Mhz. In addition to a low number of clock-cycles (1884), when compared to other platforms. From the above, it can be seen that this proposal is appropriate for lowfrequency applications.      Fig.16 shows the performance index for scalar multiplication among different FGPA technologies using an interleaved multiplier, and the results show that ECP on Virtex-6 represents a better performance index with an estimated 0.21761202 when compared to the same design applied among different Xining platforms.

Simulation of ECP for scenario 2
In this approach, the Montgomery multiplier with internal registers is applied in a Koblitzpoint-addition instead of an interleaved multiplier, and the proposed design is implemented on different FPGA devices. Fig.17 shows the time required for point multiplication, which is approximately 21.625000us. The maximum operating frequency is illustrated in Fig.18, respectively.
As shown in Table.2 and Fig.19, the second proposed design based on the Montgomery multiplier representing the main component of ECP was implemented on different FGPA platforms. On Virtex-4, the design utilizes over 3340 slices, 3772 LUTs, and 5575 Flip-flops and requires 21.805μs for scalar multiplication. Thus, the PI approaches 0.276640035. On Virtex-5, the results showed that 3768, 4573, and 3005 are used from slices, LUTs, and Flipflops, respectively. And the number of clock-cycles is increased. Thus, the maximum frequency was 115.9MhZ, and the estimated PI was 0.24739953. By changing the platform to Virtex-6, the value of maximum frequency is increased to 190MHz. Thus, the number of Clock-cycles approximates 4142, and the PI is 0.222815076. In Virtex-7, the results show 61 that with 3077 slice, 4065 LUTs, and 4305 Flip-flop, the maximum frequency is 196Mhz. It is clear that the proposed design on Virtex-7 is working on a high frequency of 196Mhz with PI 0.249601835, which is higher than 0.217610202 on Virtex-6. This makes it a better choice for high-frequency applications. However, the design implemented on Virtex-6 is appropriate for low area applications occupying a small area. Due to the higher Performance Index obtained on Virtex-4 with 0.276640035 with a low frequency of 86.6Mhz and a low number of clock-cycles (1883) as compared to other platforms, this proposed design is more appropriate for low-frequency applications. The number of slices that represent registers, LUTs, and FFs for this approach over Virtex-4, Virtex-5, Virtex-6, and Virtex-7 is clearly shown in Fig.20. Obviously, the proposed design attains a smaller index of performance estimated at 0.22815076 as compared to the same design on other platforms. Fig.21 shows the comparison of PI for the two approaches over different FPGAs. The total number of slices utilized in this approach is 12687, 11346, 10219, and 11447 on Virtex-4, Virtex-5, Virtex-6, and Virtex-7, respectively. It can be seen that the proposed design on Virtex-6 consumes a lesser number of slices (10219) as compared to the same design on other platforms. (3) (4)

PI Works
Journal of Engineering Volume 26 September 2020 Number 9

63
To achieve fair comparison for the results obtained in this paper, with those obtained from the more related work done by (Li and Li, 2016) and implemented on Virtex-4, the proposed design (1) of both approaches in this work is chosen. The authors in the previous work chose k=6, while in this proposed design, k=52. This explains the difference between the time consumed in their work 9.9 us and the time consumed in design (1) which was approximately 21.805 us for both multipliers, as shown in Tables 1 and 2. The PI of the previous work is 0.3297294, while the PI of design (1) using the first approach is 0.38751206, and of the second approach, 0.276640035. So, their design is better than design (1), as shown in Table.1, when using an interleaved multiplier, but the design (1) shown in Table.2 provides better PI than that obtained from previous work.

CONCLUSIONS
This paper presents the implementation of scalar multiplication based on the Lopez-Dahab algorithm. This algorithm uses point addition and squaring units as the cornerstone of point multiplication. The proposed design relies on the Koblitz curve with a binary field GF (2 233 ) bit. The index of performance equation was used as an analysis tool for comparison of the proposed design among different Xilinx's platforms using two different types of multipliers, either Montgomery or interleaved. It is shown that the proposed design on Virtex-6 with interleaved multiplier outperforms the other designs of the two approaches with a performance index of approximately 0.217610202 and a low number of total slices 9978. However, in general, ECP with Montgomery multiplier achieves a good performance index among all different technologies compared to ECP with the interleaved multiplier. The proposed design implemented on Virtex-7 in the first approach is appropriate for applications with high frequency since its maximum operational frequency approximately 221.6Mhz. In contrast, the proposed design on Virtex-4 is more suitable for applications with low-frequency, since its maximum frequency is approximately 86.4Mhz. Design (1) in Table.2, when using the Montgomery multiplier, provides a performance index better than the previous design of Li and Li in 2016.