Superscalar Coprocessor for Highspeed Curvebased Cryptography K Sakiyama
Superscalar Coprocessor for High-speed Curve-based Cryptography K. Sakiyama, L. Batina, B. Preneel, I. Verbauwhede Katholieke Universiteit Leuven / IBBT Department Electrical Engineering - ESAT/COSIC 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 1/26
Overview p Introduction p Curve-based Cryptography p HW/SW Partitioning p Superscalar Coprocessor p Results p Conclusions 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 2/26
Introduction Motivation p High-speed curve-based cryptography in HW/SW co-design p How much instruction-level parallelism can we obtain from coprocessor instructions? p Performance improvement for different operation forms in datapath p AB+C mod P vs A(B+D)+C mod P , A, B, C, D, P: polynomials p Performance comparison three different curve-based cryptosystems p Which one is faster between ECC, HECC, ECC over a composite field? p Programmability and scalability p Programmable in order to support different cryptosystems? p Scalable in field sizes? 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 3/26
Introduction Target Architecture p Curve-based cryptography over binary fields p Hardware can be smaller and faster than prime field p ECC over a binary field, e. g. GF(2163) p HECC of genus 2 Field length can be shorter with a factor of 2, e. g. GF(283) p ECC over a composite field Field length can be shorter with a factor of 2, e. g. GF ((283)2) p. The datapath can be shared p. Programmable coprocessor supporting three curve-based cryptography by defining coprocessor instruction(s) p(Coprocessor) instruction-level parallelism by superscalar 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 4/26
Overview p Introduction p Curve-based Cryptography p HW/SW Partitioning p Superscalar Coprocessor p Results p Conclusions 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 5/26
Curve-based Cryptography HW/SW partitioning (1) p General hierarchy in coprocessor for curve-based cryptography SW or HW controller Point/Divisor Multiplication Point/Divisor Addition Finite Field Addition 13/10/2006 Point/Divisor Doubling Finite Field Multiplication Finite Field Inversion SW or HW controller HW Datapath Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 6/26
Curve-based Cryptography Proposed Hierarchy (1) nv e Co Point/Divisor Addition Finite Field Addition 13/10/2006 (D ng at le ap In at st h) ru c Point/Divisor Multiplication Point/Divisor Addition Point/Divisor Doubling Si nt io na l ti on p Single instruction for all finite field operations p Fixed-cycle execution enables efficient implementation Point/Divisor Doubling Finite Field Multiplication Finite Field Inversion Finite Field Operation E. g. AB+C mod P Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 7/26
Curve-based Cryptography Modular Arithmetic Logic Unit (MALU) p (a) Building block: Regular XOR chains p (b) Scalable in digit size (d) and field size (k) by interconnecting several building blocks p We use MALU 83 (n=83, d=12) as building block p 2 x. MALU 83 can be configured as 1 x. MALU 163 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 8/26
Overview p Introduction p Curve-based Cryptography p HW/SW Partitioning p Superscalar Coprocessor p Results p Conclusions 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 9/26
HW/SW Partitioning TYPE I: Smallest implementation (baseline) Main CPU Memory Mapped I/O 32 -bit data SRAM Program ROM 32 -bit instructions Coprocessor DBC Data Bus IBC Instruction Bus MALU 83 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 10/26
HW/SW Partitioning TYPE II: TYPE I + m-code RAM Main CPU Memory Mapped I/O 32 -bit data SRAM Program ROM 32 -bit instructions DBC Data Bus Coprocessor m-code RAM FSM IBC Instruction Bus MALU 83 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 11/26
HW/SW Partitioning TYPE III: TYPE I + Coprocessor Memory SRAM Main CPU Memory Mapped I/O 32 -bit data Program ROM 32 -bit instructions Coprocessor DBC Data Bus IBC Instruction Bus MALU 83 Coprocessor Memory 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 12/26
HW/SW Partitioning TYPE IV: TYPE I + Copro. Mem. & m-code RAM SRAM Main CPU Memory Mapped I/O 32 -bit data Program ROM 32 -bit instructions DBC Data Bus Coprocessor m-code RAM FSM IBC Instruction Bus MALU 83 Coprocessor Memory 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 13/26
HW/SW Partitioning Co-design flow with GEZEL C/C++ codes for PKCs Partitioning of functions C/C++ codes & H/W behavior blocks w/interface ARM (SW) C/C++ codes w/physical memory map Cross compile Program codes 13/10/2006 Co-processor (HW) GEZEL FDL codes Cycle-true sim. (GEZEL) Synthesis VHDL codes Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 14/26
HW/SW Partitioning Result: Vertical Exploration of System p HECC Performance for different HW/SW partitioning (Performance: Point/Divisor multiplication) 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 15/26
Overview p Introduction p Curve-based Cryptography p HW/SW Partitioning p Superscalar Coprocessor p Results p Conclusions 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 16/26
Superscalar Coprocessor Proposed Hierarchy (2) M Point/Divisor Doubling le Point/Divisor Addition Point/Divisor Multiplication ip Point/Divisor Multiplication M ul t Si ng le M A A LU LU s p Multiple Modular Arithmetic Logic Units (MALUs) in coprocessor Finite Field Inversion Finite Field Operation E. g. AB+C mod P 13/10/2006 Point/Divisor Addition Finite Field Operation E. g. AB+C mod P Point/Divisor Doubling Finite Field Operation E. g. AB+C mod P Finite Field Inversion Finite Field Operation … E. g. AB+C mod P Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) Finite Field Operation E. g. AB+C mod P 17/26
Superscalar Coprocessor Parallel Processing Architecture (TYPE IV-based) SRAM Main CPU Memory Mapped I/O 32 -bit data 32 -bit instructions DBC Program ROM Buffer Full m-code RAM Coprocessor FSM IBC IQB Data Bus Instruction Bus MALU 83 Coprocessor Memory 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 18/26
Superscalar Coprocessor Horizontal Exploration of System p Performance of ECC and HECC 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 19/26
Overview p Introduction p Curve-based Cryptography p HW/SW Partitioning p Superscalar Coprocessor p Results p Conclusions 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 20/26
Results Performance for ECC over GF(283) p Fastest of three AB+C A(B+D)+C p x 1. 8 speed-up by 2 -way superscaling (ILPDP=6) with A(B+D)+C p Still more improvement is possible by adding MALUs 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 21/26
Results Performance of HECC over GF(283) AB+C A(B+D)+C p Faster than ECC over a composite field p x 2. 7 speed-up by 4 -way superscaling (ILPDP=5) with A(B+D)+C p Less improvement as increasing # of MALU 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 22/26
Results Performance for ECC over GF((283)2 ) p Slowest of three AB+C A(B+D)+C p x 2. 5 speed-up by 4 -way superscaling (ILPDP=6) with A(B+D)+C p Less improvement as increasing # of MALU 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 23/26
Results Comparison of ECC/HECC implementations on FPGAs [11] T. Wollinger, Ph. D thesis, 2004. [13] G. Orlando and C. Paar, CHES 00. [14] N. Gura et al. , CHES 02. [29] Nazar A. Saqib et al. , International Journal of Embedded Systems 2005 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 24/26
Conclusions p Performance improvement / Comparison p ECC was improved by a factor of 1. 8 (2 -way) p HECC (genus 2) was improved by a factor of 2. 7 (4 -way) p ECC over a composite field was improved by a factor of 2. 5 (4 -way) p A(B+D)+C offers better performance than AB+C p ECC is the fastest in this case study p Programmability & flexibility p Support three different curve-based cryptosystems over a binary field p Arbitrary irreducible polynomial p Field size up to 332 bits by using 4 x. MALU 83 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 25/26
Thank you! 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 26/26
Parallel issue of instructions Case of using 4 MALUs p p IF/D : Instruction Fetch & Decode R_ : Read operands (dependent on the type of operation) EX : Execution (dependent on MALU configuration, k & d) W_ : Write (dependent on # of instructions issued in parallel) 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 27/26
Parallel issue of instructions Out-of-order Execution p Check RAW (Read After Write Dependency) for in-/out-oforder execution 13/10/2006 Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 28/26
- Slides: 28