Architecture and Compilation for Reconfigurable Processors Jason Cong

Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004

Outline u Motivation u Application-specific u Register file data bandwidth problem u Architecture u Shadow instruction set compilation extension – shadow registers register binding u Conclusions

Reconfigurable Processor Platform u Reconfigurable processor (RP) core + programmable fabric § RP core supports: Basic instruction set + customized instructions u Programmable fabric implements the customized instructions u Either runtime reconfigurable or pre-synthesized u Example: Nios / Nios II from Altera § Stratix version supported by Nios 3. 0 system § 5 extended instruction formats § Up to 2048 instructions for each format Reconfigurable Processor Core CPU Bus

Motivational Example t 1 = a * b; t 1 = extop 1(a, b, 2); t 2 = b * 2; ; t 2 = extop 2(b, c, 2, 5); t 3 = c * 5; t 3 = t 1 + t 2; t 4 = t 1 + t 2; t 5 = t 2 + t 3; t 6 = t 5 + t 4; a b * 2 *: 2 clock cycles +: 1 clock cycle Execution time: 5 Execution 9 clock cycles Speedup: 1. 8 5 * + extop 1 c * + + extop 2 Extended Instruction Set: I extop 1 expop 2

Problem Statement Given: § § § I. III. Application program in CDFG G(V, E) A processor with basic instruction set I Pattern constraints: Number of inputs less than Nin; 1 output; Total area no more than A Objective: § § Generate a pattern library P Map G to the extended instruction set I P, so that the total execution time is minimized.

Proposed ASIP Compilation Flow u Extended Instruction Candidates Generation § Satisfying I/O constraints u Extended Instruction Selection § Select a subset to maximize the potential speedup while satisfying the resource constraint u C Compilation CDFG Application Mapping Mapped CDFG Code Generation § Graph covering § Minimize the total execution time Simulation ASIP constraints Pattern Generation / Pattern Selection Pattern Library Instruction Implementation / ASIP Synthesis Implementation

Step 1. Pattern Enumeration n Each pattern is a Ninfeasible cone Cut enumeration is used to enumerate all the Ninfeasible cones [cong et al, FPGA’ 99] Basic idea: In topological order, merge the cuts of fan-ins and discards those cuts not Nin-feasible a n 1 n 4 b n 2 * c 2 n 3 * n 5 + n 6 5 * + + 3 -feasible cones: n 1: {a, b} n 2: {b, 2} n 3: {c, 5} n 4: {n 1, n 2}, {n 1, b, 2}, {n 2, a, b}, {a, b, 2}

Step 2. Pattern Selection u Basic idea: simultaneously consider speed up, occurrence frequency and area. u Speedup Tsw(p) = total execution time with basic instructions Thw(p) = length of the critical path of scheduled p a n 1 Speedup(p) = Tsw(p) / Thw(p) u Occurrence § § § Some pattern instances may be isomorphic Graph isomorphism test [ Nauty Package ] Small subgraphs, isomorphism test is very fast Gain(p) = Speedup(p) Occurrence(p) n 4 b * n 2 2 c * + + n 6 Pattern *+ Tsw= 3 u n 53 Selection under area constraint can be formulated as a 0 -1 Thw= 2 knapsack problem Speedup = 1. 5 n 5

Step 3. Application Mapping u Assume execution on an in-order, single-issue processor u Cover each node in G(V, E) with the extended instruction set to minimize the execution time. § Trivial pattern – software execution time § Nontrivial pattern – hardware execution time § Total execution time = Sum of execution time of instance patterns after application mapping u Theorem: The application mapping problem is equivalent to the library-based minimum-area technology mapping problem.

Speedup and Resource Overhead on NIOS Speedup Resource Overhead # Extended Instruction Estimation Nios fft_br 9 3. 28 2. 65 408 6. 06% 65, 536 9. 79% 16 iir 7 3. 18 3. 73 255 3. 79% 4, 736 0. 71% 40 fir 2 2. 40 2. 14 51 0. 76% 1, 024 0. 15% 8 pr 2 1. 57 1. 75 71 1. 05% 0 0. 00% 14 dir 2 3. 28 3. 02 54 0. 80% 0 0. 00% 16 mcm 4 4. 75 3. 22 186 2. 76% 0 0. 00% 56 3. 08 2. 75 - 2. 54% - 1. 77% - Average LE Memory DSP Block

Simulation Environment u Simplescalar v 3. 0 u Benchmarks § From Mediabench suite u Machine Configuration § Single issue in-order processor (ARM like) § DL 1: 8 KB, 4 -way, 1 cycle § IL 1: 8 KB, direct mapped, 1 cycle § Unified L 2: 256 KB, 4 -way, 8 cycle § Functional units: 2 Int. Add, 1 Int. Mult, 1 FPAdd, 1 FPMult § Reconfigurable units • critical path latency of the collapsed instructions

Pattern Distribution Most of the patterns have less than 7 nodes inside

Ideal Speedup under Different Input Size Constraints

Outline u Motivation u Application-specific u Register file data bandwidth problem u Architecture u Shadow instruction set compilation extension – shadow registers register binding u Conclusions

Register File Bandwidth Problem u Most of the speedup comes from clusters with more than two inputs u 2 -port register file in embedded processors u Need extra cycles to transfer data for extended instructions with more than 2 inputs u Speedup drop due to communication overhead

Speedup Drop with Different Input Constraints § Move operation takes one cycle § 46% speedup drop on average

Outline u Motivation u Application-specific u Register file data bandwidth problem u Architecture u Shadow instruction set compilation extension – shadow registers register binding u Conclusions

Architecture Extensions Existing Solutions u § Dedicated Data Link • Avoid potential resource contention through bus • Need extra cycles for communication • Employed in Microblaze from Xilinx § Multiport Register File • Low utilization when executing basic instructions • Area and power grows cubically § Register File Replication • Predetermined one-to-one correspondence • Resource waste in terms of area and power • Limit compiler optimization

Our Approach – Shadow Registers u Core registers are augmented by an extra set of shadow registers § Conditionally written § Used only by the custom logic

Shadow Registers u Controlling the shadow register Operation u Forward the result Skip Instruction Subword 00 01 10 11 Shadowreg ID 0 1 2 - Advantages and limitations § Cost-efficient for small number of shadow registers § Only need a few control signals to be added § Opportunity for compiler optimization § Require extra control bits

Outline u Motivation u Application-specific u Register file data bandwidth problem u Architecture u Shadow instruction set compilation extension – shadow registers register binding u Conclusions

Internal Representation 2 -level CDFG representation § 1 st level: control flow graph § 2 nd level: data flow graph § Computation node latency & scheduled time slot § Data edge lifetime § Variable lifetime 1 i 1 = …; i 2 = ext 1 (…, i 1, …); 2 e 1 e 2 i 3 = …; i 4 = ext 2 (…, i 1, …); 4 i 5 = ext 3 (…, i 3, …); i 6 = ext 4 (…, i 3, …); e 4 3 e 3 5 6 Life time e 1 = [2, 2] Life time e 2 = [2, 4] Life time i 1 = [2, 4]

Observation u 2 -port register file u 3 -input extended instruction u Without shadow register 4 additional moves u Binding for 1 register Binding 1: either i 1 or i 3 in shadow register save 2 moves Binding 2: save 3 moves 1 i 1 = …; i 2 = ext 1 (…, i 1, …); 2 e 2 i 3 = …; i 4 = ext 2 (…, i 1, …); 4 i 5 = ext 3 (…, i 3, …); i 6 = ext 4 (…, i 3, …); e 1 e 4 3 e 3 5 6

Register Binding u Which operands should be bound? § Each input could be a candidate § Binding different candidates leads to different savings § Unaffordable to try all the combinations

One Shadow Register Binding Problem u Problem formulation: § Given A scheduled DFG and one shadow register § Objective Bind variables to shadow register Minimize the number of moves

Algorithm for Binding One Shadow Register u Weighted compatibility graph • Vertex <-> data edge in the DFG • Weight <-> # saves if the value is kept in the register • Edge <-> lifetimes don’t overlap u Theorem: § Binding problem is equivalent to find a maximum weighted chain in the compatibility graph § Can be optimally solved in time O(|V’| + |E’|) u Extension to K-shadow registers

Experimental Results (1) Speedup under different number of shadow registers for 3 -input extended instructions

Experimental Results (2) Speedup under different number of shadow registers for 4 -input extended instructions

Conclusions u Proposed and developed complete compilation flow u Observed and quantitatively analyzed data bandwidth problem u Proposed novel architecture extension and efficient register binding algorithm

Thank You