ApplicationSpecific Customization of Parameterized FPGA SoftCore Processors David






















- Slides: 22
Application-Specific Customization of Parameterized FPGA Soft-Core Processors David Sheldona, Rakesh Kumarb, Roman Lyseckyc, Frank Vahida*, Dean Tullsenb a. Department of Computer Science and Engineering University of California, Riverside *Also with the Center for Embedded Computer Systems at UC Irvine b. Department of Computer Science and Engineering University of California, San Diego c. Department of Electrical and Computer Engineering University of Arizona This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and by hardware and software donations from Xilinx Sheldon, D.
FPGA Soft Core Processors n HDL Description Soft-core Processor n HDL description n Flexible implementation n n FPGA or ASIC Technology independent David Sheldon, UC Riverside FPGA Spartan 3 Virtex 2 ASIC Virtex 4 2
FPGA Soft Core Processors n Soft Core Processors can have configurable options n n Datapath units Cache Bus architecture Current commercial FPGA Soft-Core Processors n n Xilinx Microblaze Altera Nios FPU μP MAC Cache FPGA David Sheldon, UC Riverside 3
Goal n Goal: Tune FPGA soft-core microprocessor for a given application App Parameter Values μP Synthesis Configured μP time Configured μP FPGA David Sheldon, UC Riverside size 4
Microblaze – Xilinx FPGA Soft-Core All units not necessarily the fastest, due to critical path lengthening Multiplier Base Micro. Blaze Barrel Shifter Divider FPU Cache Instantiatable units Significant tradeoffs David Sheldon, UC Riverside 5
Problem n Need fast exploration n n Synthesis runs can take an hour This talk n Two approaches n n n Approach 1: Using Traditional CAD Techniques Approach 2: Synthesis-in-theloop Results David Sheldon, UC Riverside Parameter Values Exploration μP Synthesis ~20 -60 mins Configured μP 6
Constraints on Configurations n Size constraints may prevent use of all possible units Multiplier Barrel Shifter Multiplier Micro. Blaze FPU Cache Max Area David Sheldon, UC Riverside Divider Cache 7
Approach 1: Traditional CAD Techniques n n n Create model Create a model of the problem Solve model with extensive search heuristics We will model this problem as a 0 -1 knapsack problem Multiplier Micro. Blaze Slow, includes synthesis Model Exploration Fast, considers 1000 s of configurations FPU Cache Max Area David Sheldon, UC Riverside 8
Approach 1: Traditional CAD Techniques Synthesis Creating the model FPU Micro. Blaze Base size Cache perf App Multiplier FPU size BS FPU MUL Perf increment 1. 1 0. 9 1. 2 1. 0 1. 3 Size increment 1. 4 2. 7 1. 8 1. 1 1. 6 Perf/Size 0. 96 0. 34 0. 63 0. 93 0. 80 David Sheldon, UC Riverside Divider perf Barrel Shifter perf Micro. Blaze size DIV CACHE 9
Approach 1: Traditional CAD Techniques n 0 -1 knapsack model n n n Object’s benefit = Unit’s performance increment / size increment Object’s weight = Unit’s Size Knapsack’s size constraint = FPGA size constraint BS FPU MUL Perf increment 1. 1 0. 9 1. 2 1. 0 1. 3 Size increment 1. 4 2. 7 1. 8 1. 1 1. 6 Perf/Size 0. 96 0. 34 0. 63 0. 93 0. 80 David Sheldon, UC Riverside DIV CACHE Micro. Blaze 10
Approach 1: Traditional CAD Techniques n Solved the 0 -1 knapsack problem using established methods n n Toth, P. , Dynamic Programming Algorithms for the Zero-One Knapsack Problem. Computing 1980 Running time n 6 Microblaze configuration synthesis runs to create model n O(n*p) to solve model n n is the number of factors p is the available area Negligible (seconds) compared to synthesis runtimes (~hour) David Sheldon, UC Riverside 11
Approach 1: Traditional CAD Techniques n Problems n 100’s of target FPGAs n n Model approach estimates size and performance for two or more units n n n Different hard core resources (multiplier, block RAM) MUL speedup 1. 3, DIV speedup 1. 6 estimate MUL+DIV speedup 1. 9 May really be 1. 7 Model inaccuracies may be large David Sheldon, UC Riverside 12
Approach 2: Synthesis-in-the-Loop n Problem with traditional CAD approach n n 100’s of target FPGAs Model approach estimates size and performance for two or more units Model inaccuracies may be large Create model Model Exploration Solution – Synthesis in the loop n n n No abstract model Guided by actual size and performance data But slow – can only explore a few configurations 10’s of minutes Synthesis-in-the-Loop Exploration Synthesis Execute David Sheldon, UC Riverside size perf 13
Approach 2: Synthesis-in-the-Loop First pre-analyze units to guide heuristic Same calculations as when creating model for knapsack perf size Multiplier size Cache size BS FPU MUL Perf increment 1. 1 0. 9 1. 2 1. 0 1. 3 Size increment 1. 4 2. 7 1. 8 1. 1 1. 6 Perf/Size 0. 96 0. 34 0. 63 0. 93 0. 80 David Sheldon, UC Riverside Divider perf Floating Point perf Barrel Shifter perf n size DIV CACHE 14
Approach 2: Synthesis-in-the-Loop n Build “impact-ordered tree” structure n Tree is specific to given application BS Perf/Size FPU MUL DIV CACHE 0. 96 0. 34 0. 63 0. 93 0. 80 Sort BS Perf/Size DIV CACHE MUL FPU 0. 96 0. 93 David Sheldon, UC Riverside 0. 80 0. 63 0. 34 Application Specific Impact-ordering Impact BS 0. 96 DIV 0. 93 CACHE 0. 80 MUL 0. 63 FPU 0. 34 15
Approach 2: Synthesis-in-the-Loop n Run tree-based search heuristic Not Include Synthesis-in-the-Loop Include Perf/Size Useful BS 0. 96 Yes DIV 0. 93 No CACHE 0. 80 No MUL 0. 63 Yes FPU 0. 34 No Exploration Synthesis size Execute David Sheldon, UC Riverside perf 16
Comparison of Approaches n Approach 1 – Traditional CAD n n 6 synthesis runs to build model O(np) knapsack solution Examines thousands of configurations during exploration Approach 2 – Synthesis in the loop n n 11 synthesis runs (6 pre-analysis, 5 exploration) Examines (at most) 5 configurations during exploration David Sheldon, UC Riverside 17
Results n 10 EEMBC and Powerstone benchmarks n Average results shown, on Virtex 2 Pro, for particular size constraint Tool Run Time (min) n aifir, Base. FP 01, bitmnp, brev, canrdr, g 3 fax, g 721_ps, idct, matmul, tblook, ttsprk Knapsack sub-optimality due to multi-unit estimation inaccuracy David Sheldon, UC Riverside 800 Exhaustive App-Spec 600 Knapsack 400 200 0 1 1. 5 2 2. 5 Application-specific impact-ordered tree approach yields near -optimal results in acceptable tool runtime Speedup 18
n Obtained results for six different size constraints n n Results shown for a second size constraint Similar findings for all six constraints Tool Run Time (min) Results 800 Exhaustive App-Spec 600 Knapsack 400 200 0 1 1. 5 2 2. 5 Speedup David Sheldon, UC Riverside 19
n Also ran for different FPGA n n Xilinx Spartan 2 Similar findings Tool Run Time (min) Results 300 Exhaustive 250 App-Spec 200 Knapsack 150 100 50 0 1 1. 2 1. 4 1. 6 Speedup David Sheldon, UC Riverside 20
Conclusions n Synthesis-in-the-loop approach outperformed traditional CAD approach n n Better results Slightly longer runtime Application-specific impact-ordered tree heuristic served well for synthesis-in-the-loop approach Future n Extend for highly-configurable soft-core processors, and for multiple processors competing for and/or sharing resources David Sheldon, UC Riverside 21
Questions? David Sheldon, UC Riverside 22