HardwareSoftware CoDesign of Embedded Reconfigurable Architecture Tim Callahan

  • Slides: 19
Download presentation
Hardware-Software Co-Design of Embedded Reconfigurable Architecture Tim Callahan, Yanbing Li, Randolph Harr, Uday Kurkure,

Hardware-Software Co-Design of Embedded Reconfigurable Architecture Tim Callahan, Yanbing Li, Randolph Harr, Uday Kurkure, Jon Stockwood, Ervan Darnell DAC 2000

Outline • • • Related Work Problem formulation Nimble Compilation flow Experiments & Results

Outline • • • Related Work Problem formulation Nimble Compilation flow Experiments & Results Conclusion

Co-design with ASIC Embedded CPU Shared Memory ASIC

Co-design with ASIC Embedded CPU Shared Memory ASIC

Co-design with heterogeneous microprocessor u. P DSP Shared Memory ASIC Coprocessor

Co-design with heterogeneous microprocessor u. P DSP Shared Memory ASIC Coprocessor

Target Architecture Embedded CPU On Chip SRAM / Cache Reconfigurable Datapath (FPGA)

Target Architecture Embedded CPU On Chip SRAM / Cache Reconfigurable Datapath (FPGA)

Compare to Conventional Approach • Spatial [4, 5, 7] & temporal domain partitioning •

Compare to Conventional Approach • Spatial [4, 5, 7] & temporal domain partitioning • Task [2, 7, 10] / Instruction level parallelism • Single/Multi- configuration of a task CPU M FPGA ASIC

Partitioning Problem Formulation For (i=0; i<n; i++) { switch (a) { case 1: ret

Partitioning Problem Formulation For (i=0; i<n; i++) { switch (a) { case 1: ret = c + d; break; case 2: c++; if ( c = d ) ret = A( ); break; } ret = ret * d; } config En En Ex Ex Ex

Cost Function • Tsw(Li)*Iter(Li), if loop Li in SW • T(Li, KJ) = Thw(Li

Cost Function • Tsw(Li)*Iter(Li), if loop Li in SW • T(Li, KJ) = Thw(Li , KJ)*Iter(Li , KJ) +Tsw 2 hw(Li , KJ)*En(Li , KJ) +Thw 2 sw(Li , KJ)*Ex(Li , KJ) +Tconfig(Li , KJ), if kernel ki in HW • Tconfig(Li , KJ) = Nmiss(Li) *Tmiss(Li , KJ) +Nhit(Li) *Thit(Li , KJ)

Nimble Compiler Flow C source code Preprocessing • Kernel (loop) extraction • Compiler transformation

Nimble Compiler Flow C source code Preprocessing • Kernel (loop) extraction • Compiler transformation • Performance profiling HW / SW Partitioning Datapath synthesis FPGA bit stream Binary Code ADL

Algorithm • (LEP) Loop Entry Trace Profiling – Prepare data for cost evaluation •

Algorithm • (LEP) Loop Entry Trace Profiling – Prepare data for cost evaluation • (ILD) Interesting Loop Detection – Reduce problem size • Intra-Loop Selection – Apply compiler transformation technique • Inter-Loop Selection – SW/HW partitioning

Loop Entry Trace Profiling • To Know runtime sequence for all hardware candidate loops.

Loop Entry Trace Profiling • To Know runtime sequence for all hardware candidate loops. • MPEG 2 generates ~200 M bytes. • Encode and reduced to ~Kbytes

Interesting Loop Detection Benchmark WIC EPIC encode # loops Total % > 1% (>1%)

Interesting Loop Detection Benchmark WIC EPIC encode # loops Total % > 1% (>1%) 25 13 99% 132 13 92% 62 15 99% ADPCM 3 3 98% MPEG 2 165 14 85% Skipjack 6 2 99% UNEPIC decode

Intra-Loop Selection • Compiler Transformation technique : – Loop unrolling, fusion, pipelining, procedure inline,

Intra-Loop Selection • Compiler Transformation technique : – Loop unrolling, fusion, pipelining, procedure inline, branch trimming. Delay Selected FPGA Limit AREA While { blah, blah if ( condition ) { blah, blah } blah, blah }

Inter-Loop Selection • • Problem Size (2 n) Hierarchical Loop Clustering Loop-procedure hierarchy graph

Inter-Loop Selection • • Problem Size (2 n) Hierarchical Loop Clustering Loop-procedure hierarchy graph Exhaustively search

Loop-procedure hierarchy graph Main Level 1 Init Q FW within Level 2 Level 3

Loop-procedure hierarchy graph Main Level 1 Init Q FW within Level 2 Level 3 I 1 Q 1 call F 1 BQ F 2 I 1 Loop Q Procedure F 3 nested F 4 F 5 F 6

Hierarchical Loop Clustering Top-Down Pre-defined limit : 3 Main Init I 1 Q FW

Hierarchical Loop Clustering Top-Down Pre-defined limit : 3 Main Init I 1 Q FW Q 1 F 1 BQ F 2 F 4 F 3 F 5 F 6

Hierarchical Loop Clustering Top-Down Pre-defined limit : 3 Main Init I 1 Q FW

Hierarchical Loop Clustering Top-Down Pre-defined limit : 3 Main Init I 1 Q FW Q 1 F 1 BQ F 2 F 4 Cluster 1 Cluster 2 F 3 F 5 F 6

Optimal Selection Select only 1 kernel Main Init I 1 Q Probability: Un-clustered :

Optimal Selection Select only 1 kernel Main Init I 1 Q Probability: Un-clustered : 26=64 Clustered : 2 x 3 x 4=24 FW Q 1 F 1 BQ F 2 F 4 F 3 F 5 F 6

Experiment & Result

Experiment & Result