AES Microcode Implementation In IXP 2400 And A

  • Slides: 24
Download presentation
AES Microcode Implementation In IXP 2400 And A study of Reconfigurable Crypto Unit Piyush

AES Microcode Implementation In IXP 2400 And A study of Reconfigurable Crypto Unit Piyush Ranjan Satapathy CS 203 B Class Project Presentation

Road Map n AES Algorithm Overview IXP 2400 Platform: A Quick Look Microcode: Overview

Road Map n AES Algorithm Overview IXP 2400 Platform: A Quick Look Microcode: Overview Implementation of AES Experimental Results n Reconfigurable Crypto unit of Intel IXP 2850 n n

Algorithm Overview n n n n n Designed by Daemen and Rijmen for the

Algorithm Overview n n n n n Designed by Daemen and Rijmen for the NIST Originally called Rijndael Symmetric key block substitution cipher Replacement for DES Successful field testing sinception Three bit-modes State defined as a 4 x 4 array of 16 bytes Key size is either 16, 24, or 32 bytes A byte is represented by Galois polynomials Bit Mode Key Length (Nk words) State Numbe Size r (Nb of words) Rounds (Nr) 128 4 4 10 192 6 4 12 256 8 4 14

Kn Stages of AES Algorithm: Result from round n-1 Pass to round n+1 Byte.

Kn Stages of AES Algorithm: Result from round n-1 Pass to round n+1 Byte. Sub Shift Row Mix. Column Add. Round. Key Detailed view of round n u Each round performs the following operations: u Non-linear Layer: No linear relationship between the input and output of a round u Linear Mixing Layer: Guarantees high diffusion over multiple rounds u Very small correlation between bytes of the round input and the bytes of the output u Key Addition Layer: Bytes of the input are simply EXOR’ed

1. Sub. Bytes Function n Affine Transformation in GF (28) Direct implementation is complex

1. Sub. Bytes Function n Affine Transformation in GF (28) Direct implementation is complex Easily performed by a 16 x 16 LUT ROM n Simple byte substitution n Combinational logic Each byte at the input of a round undergoes a non-linear byte substitution according to the following transform Substitution (“S”)-box

2. Shift Row Depending on the block length, each “row” of the block is

2. Shift Row Depending on the block length, each “row” of the block is cyclically shifted according to the above table n n n Shifting done only on the bottom three rows of the State Left rotate for encryption Right rotate for decryption

3. Mix. Columns Function Each column is multiplied by a fixed polynomial C(x) =

3. Mix. Columns Function Each column is multiplied by a fixed polynomial C(x) = ’ 03’*X 3 + ’ 01’*X 2 + ’ 01’*X + ’ 02’ • • • Matrix multiplication in GF (28) This corresponds to matrix multiplication b(x) = c(x) a(x): Mix. Columns functionality resides primarily in the controller and instruction memory A series of conditional XOR and left shift operations

4. Key Expansion and Addition n n Performed before both the encrypt and decrypt

4. Key Expansion and Addition n n Performed before both the encrypt and decrypt process Byte values from the Key are read and manipulated into the Round. Key A series of Sub. Bytes and XOR operations with RCON ROM values and the Key Performs XOR operation between the State and the Roundkey This is the only function without an inverse Each word is simply EXOR’ed with the expanded round key

IXP 2400 Platform: A Quick Look Name Size. Bytes Transfer Size(Bytes) Reference latency in

IXP 2400 Platform: A Quick Look Name Size. Bytes Transfer Size(Bytes) Reference latency in cycles GPR/ME 256*4 4 1 TR/ME 512*4 4 1 NNR/ME 128*4 4 1 LM/ME 640*4 4 3 Scratch 16 K 4 60 SRAM 64 M 4 90 DRAM 1 G 16 120 • achieve high processing performance • programming flexibility • Cheaper than ASIC

Microcode Overview n n n n n alu [ dest 1, a, +, b]

Microcode Overview n n n n n alu [ dest 1, a, +, b] ALU addition of a and b and storing in dest 1 alu [ dest 2, dest 1, -, c] ALU subtraction Move(reg 1, reg 2) Moving from one reg 1 to reg 2 ; both are gprs. Immed[reg, ox 0020] Immediate value assignment to register local_csr_wr[ACTIVE_LM_ADDR_0, 0 x 0] Local memory indexing with index 0. begin … endm Macro begin and end. if …. endif If loop xbuf_alloc ($$state, 4, read) buffer allocation in DRAM transfer register. reg gen_regiater $sram_reg $$dram_reg Register declaration. sig sram_sig dram_sig signal declaration. while …. endw While looping #for round[1, 2, 3, 4, 5, 6, 7, 8, 9, 10] … #endloop For looping alu_shf[index, --, B, s 0, >>24] Alu shift function of B scratch[read, $T, index, 0, 1], ctx_swap[sram_sig] scratch read instruction ld_field_w_clr[t 1, 1000, $T] Performs a write to t 1 register dram[write, $$out[0], dst_addr, 0, 2], sig_done[dram_sig] Dram write ctx_arb[dram_sig], ctx_arb[kill] signaling

Implementation Setup n n n n Environmental Setup: Intel IXP 4. 1 600 MHz

Implementation Setup n n n n Environmental Setup: Intel IXP 4. 1 600 MHz ME configurations 200 -MHz SRAMs 150 -MHz RDRAMs Executed in Multi threads Executed in Different Micro Engines

Experimental Results(1) SRAM Utilization ME utilization %

Experimental Results(1) SRAM Utilization ME utilization %

Experimental Results(2) Throughput Performance Across Threads in 1 ME

Experimental Results(2) Throughput Performance Across Threads in 1 ME

Crypto Unit of IXP 2850

Crypto Unit of IXP 2850

Intel IXP 2850 Encryption Data Flow

Intel IXP 2850 Encryption Data Flow

Crypto Unit Overview

Crypto Unit Overview

Simple Encrypt Example

Simple Encrypt Example

Simple Encrypt and Hash Example

Simple Encrypt and Hash Example

3 DES Core �� 2 Cores per crypto unit n �� Takes 192 -bit

3 DES Core �� 2 Cores per crypto unit n �� Takes 192 -bit key n –(56 -bit + 8 -bit parity) x 3 Keys n �� Operates on 8 -byte blocks n �� Result is written to ME transfer registers or TBUF element n �� Result can be passed to the SHA-1 unit for hashing Security Processing, pipelining, and interleaving using three wires and one core Multiple keys and IVs

AES Core n �� All AES key sizes are supported n n n –(128,

AES Core n �� All AES key sizes are supported n n n –(128, 192, or 256) Both Encryption and Decryption supported �� Operates on 16 byte blocks AES Key Scheduler

SHA 1 Core n n n 2 SHA-1 cores per crypto unit Operates on

SHA 1 Core n n n 2 SHA-1 cores per crypto unit Operates on 64 -byte blocks Data is loaded from Input RAM or Crypto cores into the SHA-1 buffer Can perform on unmodified packet data or on the ciphered packet data Operates on 512 bit block size and has a data buffer to accumulate the ciphered data This gives flexibility to run SHA and AES, 3 DES at different rates. SHA 1 Critical Path Analysis

Some of The Crypto Commands n n crypto_write_ram($$orig_plain_text[0], DATA_RAM_ADDR, 8, ENCRYPT _UNIT, ram_sig) Perform

Some of The Crypto Commands n n crypto_write_ram($$orig_plain_text[0], DATA_RAM_ADDR, 8, ENCRYPT _UNIT, ram_sig) Perform and wait for the write crypto_load_iv($$iv[0], 1, ENCRYPT_UNIT, CRYPTO_BANK, ENCRYPT_STATE, iv_sig) Loading IV Data crypto_load_key($$key[0], 3, ENCRYPT_UNIT, CRYPTO_BANK, ENCRY PT_STATE, key_sig) Loading Key crypto_cipher($$encrypt_data[0], DATA_RAM_ADDR, 8, CRYPTO_CIPHER _ENCRYPT, CRYPTO_CIPHER_NO_CBC, CRYPTO_CIPHER_3 DES, ENCRYPT_UNIT, CRYPTO_BANK, ENCRYPT_STATE, cipher_sig)

Acknowledgement n n n Yan Luo Chris Baron http: //cnscenter. future. co. kr/resource/rsccenter/presentation/intel/spring 2003/S

Acknowledgement n n n Yan Luo Chris Baron http: //cnscenter. future. co. kr/resource/rsccenter/presentation/intel/spring 2003/S 03 US CPTS 92_OS. pdf ( For some slides) Mel Tsai; UC Berkeley (For some slides) Thomas Sodon et al, EE College of New. Jersey Zhangxi Tan et al, Tsinghua University