Macro instruction synthesis for embedded processors Pinhong Chen

Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS 252 project presentation

Motivation Start from a simple processor core l Find new macro instructions to enhance performance and reduce code size l Application-specific l Using dedicated hardware to speed up l Application Mem. Reg Bus unit Control control I/D ALU Macro Instr. Reg/Mem Ext. Access

RISC 8 Architecture Why RISC 8? l Simple l l l Complete ISA, including l l Load/Store, Arithmetic, Logical , Branch, Multiplication, Division, Stack Operation, Subroutine call, Interrupt Operations, etc. Small l l 8 -bit ISA with 43 Instructions Addressable space 64 K bytes Verilog core size is 3. 5 K gates in 0. 25 um clock speed of 300 MHz is reported (our result is about 200 MHz) Synthesizable RTL Core Free assembler

Methodology Application (*. c) Front end Instr. Profiling Istr. Syn performance IR (exp. tree) Code Gen. Istr. Syn simulation Istr. Syn RTL exp. tree Asm. code Assembler mach. code

Different Levels of expression trees sum += c & 5 ASSIGN byte ASSIGN reg acc ADD addr 16 MOV ADD VAR AND VAR SUIF IR CON acc reg VAR AND byte AND VAR acc reg byte con 08 RTL IR after code gen con 08 addr 16 Reconstructed from mach. code

Expression trees SUIF IR Data type carried Inaccurate cost No profiling Simple – less tree nodes Machine independent Register level • • Data type carried One-to-one between macro instructions Profiling data can be back annotated Machine dependent Machine code • • • Data type lost One-to-one between machine instructions Profiling data accurate Large expression trees Machine dependent

Instruction Enumeration Traverse tree structure in post-order Normalize sub-tree orders l Combine patterns from sub-trees l Hash new instruction patterns l Collect register usage and memory access for evaluation reg byte l Annotate profiling information l ADD acc AND acc reg byte con 08

Machine Code Level Tree Reconstruction Build IR tree from machine codes l Recover data dependencies from assembly code ADD l Clear definition by ISA l l l eg. AND r 2 ==> acc=acc & r 2 Limited to a basic block Eliminate intermediate nodes acc reg AND byte acc reg storage byte con 08

Machine Code Level Tree Reconstruction Build IR tree from machine codes l Recover data dependencies from assembly code ADD l Clear definition by ISA l l l eg. AND r 2 ==> acc=acc & r 2 Limited to a basic block Eliminate intermediate nodes AND byte storage byte con 08

Table-Driven Assembly Development Tools New Instruction Candidates New Instr. Select Istr. Syn Instr. Profile Special. Instr. Special Disassembler Asm. code Instr. Table Assembler Simulator mach. code performance

$Table-driven back-end tool automation @new_ins=( 'mac'=>{otree=>['r 0', 'n. ADD', 'r 0', ['n. MUL', 'Rn',$

Table-driven back-end tool automation @new_ins=( 'mac'=>{otree=>['r 0', 'n. ADD', 'r 0', ['n. MUL', 'Rn', 'addr 16']], pattern=>'Rn addr 16', code=>['00000011', '00000$Rn', '$addr 16[0]', '$addr 16[1]'], sim=>'$R[0]+=$R[$Rn]*$memory[$addr 16]', cycles=>13, decode=>'$Rn=$memory[$pc++] & 0 x 7; $addr 16[0]=$memory[$pc++]; $addr 16[1]=$memory[$pc++]; $addr 16=$addr 16[0]|($addr 16[1]<<8); ‘ });

Op-Code Reuse Op codes may not be fully used in a specific application l l Remove un-used instruction op-codes Typical applications use far less than 256 op-codes application FIR ADPCM GSM max 7219 LCD 4 x 20 PRN-IO Opcodes 28 49 32 39 40 30 Cost of op-code reuse l l Decoding logic Less flexibility

Implementation Compiler front-end: SUIF Code generator: SPAM-olive l Retargeted to RISC 8 RTL pattern enumeration: C++ RISC 8 assembler: PERL RISC 8 simulator: PERL Machine level pattern enumeration: PERL Macro driven instruction implementation automation: PERL

Benchmarks Benchmark adpcm GSM-encoder PRN-IO LCD_4 X 20 max 7219 Instructions null: n. ASSIGN(word, n. AND(areg, const 16)) null: n. ASSIGN(word, n. ADD(areg, word)) bool: n. BOOL(areg, const 16) bool: n. BOOL(n. AND(areg, const 16), areg) areg: n. IOR(n. AND(areg, const 16), word) # 40 40 86 36 24 acc: n. AND(acc, const 08) acc: n. AND(n. ASR(acc, const 08), reg) acc: n. IOR(n. AND(acc, const 08), reg) acc: n. ASR(acc, const 08) acc: n. IOR(n. AND(n. ASR(acc, const 08), reg) 796 492 414 330 621 acc: n. IOR(acc, const 08) null: n. ASSIGN(byte, n. IOR(acc, reg)) null: n. ASSIGN(byte, n. IOR(acc, const 08)) bool: n. BOOL(n. AND(areg, const 16) 240 96 96 60 bool: n. BOOL(acc, const 08) null: n. ASSIGN(byte, n. ADD(reg, one)) Acc: n. IOR(acc, const 08) bool: n. BOOL(n. AND(acc, reg), zero) 99 30 140 48

GSM encoder Hardware/software tradeoff l l Software gain: execution speed, code size Hardware cost: functional unit, decoding logic, data path configuration

Conclusions RTL level pattern enumeration Key to automating instruction identification, code-generation, assembly and simulation l No need to change algorithm source code l Hardware/software trade-off l Good estimation of performance gain and hardware cost at register-transfer level Op-code reuse