Macro instruction synthesis for embedded processors Pinhong Chen
Macro instruction synthesis for embedded processors Pinhong Chen Yunjian Jiang (william) - CS 252 project presentation
Motivation Start from a simple processor core l Find new macro instructions to enhance performance and reduce code size l Application-specific l Using dedicated hardware to speed up l Application Mem. Reg Bus unit Control control I/D ALU Macro Instr. Reg/Mem Ext. Access
RISC 8 Architecture Why RISC 8? l Simple l l l Complete ISA, including l l Load/Store, Arithmetic, Logical , Branch, Multiplication, Division, Stack Operation, Subroutine call, Interrupt Operations, etc. Small l l 8 -bit ISA with 43 Instructions Addressable space 64 K bytes Verilog core size is 3. 5 K gates in 0. 25 um clock speed of 300 MHz is reported (our result is about 200 MHz) Synthesizable RTL Core Free assembler
Methodology Application (*. c) Front end Instr. Profiling Istr. Syn performance IR (exp. tree) Code Gen. Istr. Syn simulation Istr. Syn RTL exp. tree Asm. code Assembler mach. code
Different Levels of expression trees sum += c & 5 ASSIGN byte ASSIGN reg acc ADD addr 16 MOV ADD VAR AND VAR SUIF IR CON acc reg VAR AND byte AND VAR acc reg byte con 08 RTL IR after code gen con 08 addr 16 Reconstructed from mach. code
Expression trees SUIF IR Data type carried Inaccurate cost No profiling Simple – less tree nodes Machine independent Register level • • Data type carried One-to-one between macro instructions Profiling data can be back annotated Machine dependent Machine code • • • Data type lost One-to-one between machine instructions Profiling data accurate Large expression trees Machine dependent
Instruction Enumeration Traverse tree structure in post-order Normalize sub-tree orders l Combine patterns from sub-trees l Hash new instruction patterns l Collect register usage and memory access for evaluation reg byte l Annotate profiling information l ADD acc AND acc reg byte con 08
Machine Code Level Tree Reconstruction Build IR tree from machine codes l Recover data dependencies from assembly code ADD l Clear definition by ISA l l l eg. AND r 2 ==> acc=acc & r 2 Limited to a basic block Eliminate intermediate nodes acc reg AND byte acc reg storage byte con 08
Machine Code Level Tree Reconstruction Build IR tree from machine codes l Recover data dependencies from assembly code ADD l Clear definition by ISA l l l eg. AND r 2 ==> acc=acc & r 2 Limited to a basic block Eliminate intermediate nodes AND byte storage byte con 08
Table-Driven Assembly Development Tools New Instruction Candidates New Instr. Select Istr. Syn Instr. Profile Special. Instr. Special Disassembler Asm. code Instr. Table Assembler Simulator mach. code performance
Table-driven back-end tool automation @new_ins=( 'mac'=>{otree=>['r 0', 'n. ADD', 'r 0', ['n. MUL', 'Rn', 'addr 16']], pattern=>'Rn addr 16', code=>['00000011', '00000$Rn', '$addr 16[0]', '$addr 16[1]'], sim=>'$R[0]+=$R[$Rn]*$memory[$addr 16]', cycles=>13, decode=>'$Rn=$memory[$pc++] & 0 x 7; $addr 16[0]=$memory[$pc++]; $addr 16[1]=$memory[$pc++]; $addr 16=$addr 16[0]|($addr 16[1]<<8); ‘ });
Op-Code Reuse Op codes may not be fully used in a specific application l l Remove un-used instruction op-codes Typical applications use far less than 256 op-codes application FIR ADPCM GSM max 7219 LCD 4 x 20 PRN-IO Opcodes 28 49 32 39 40 30 Cost of op-code reuse l l Decoding logic Less flexibility
Implementation Compiler front-end: SUIF Code generator: SPAM-olive l Retargeted to RISC 8 RTL pattern enumeration: C++ RISC 8 assembler: PERL RISC 8 simulator: PERL Machine level pattern enumeration: PERL Macro driven instruction implementation automation: PERL
Benchmarks Benchmark adpcm GSM-encoder PRN-IO LCD_4 X 20 max 7219 Instructions null: n. ASSIGN(word, n. AND(areg, const 16)) null: n. ASSIGN(word, n. ADD(areg, word)) bool: n. BOOL(areg, const 16) bool: n. BOOL(n. AND(areg, const 16), areg) areg: n. IOR(n. AND(areg, const 16), word) # 40 40 86 36 24 acc: n. AND(acc, const 08) acc: n. AND(n. ASR(acc, const 08), reg) acc: n. IOR(n. AND(acc, const 08), reg) acc: n. ASR(acc, const 08) acc: n. IOR(n. AND(n. ASR(acc, const 08), reg) 796 492 414 330 621 acc: n. IOR(acc, const 08) null: n. ASSIGN(byte, n. IOR(acc, reg)) null: n. ASSIGN(byte, n. IOR(acc, const 08)) bool: n. BOOL(n. AND(areg, const 16) 240 96 96 60 bool: n. BOOL(acc, const 08) null: n. ASSIGN(byte, n. ADD(reg, one)) Acc: n. IOR(acc, const 08) bool: n. BOOL(n. AND(acc, reg), zero) 99 30 140 48
GSM encoder Hardware/software tradeoff l l Software gain: execution speed, code size Hardware cost: functional unit, decoding logic, data path configuration
Conclusions RTL level pattern enumeration Key to automating instruction identification, code-generation, assembly and simulation l No need to change algorithm source code l Hardware/software trade-off l Good estimation of performance gain and hardware cost at register-transfer level Op-code reuse
- Slides: 16