Measure Twice and Cut Once Robust Dynamic Voltage
Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs Ibrahim Ahmed, Shuze Zhao, Olivier Trescases and Vaughn Betz Email: ibrahim@ece. utoronto. ca
FPGA Power Consumption Challenge 2 VDD (V) 1. 5 1 0. 5 0 150 130 90 65 40 Technology (nm) 28 14 2
FPGA Power Consumption Challenge 2 VDD (V) 1. 5 VDD not scaling 1 0. 5 0 150 130 90 65 40 Technology (nm) 28 14 3
FPGA Power Consumption Challenge • Obstacle against entering emerging low power/mobile market (Io. T) • Must show superior perf/W to compete in Data centers • Need innovation to bring power down “The future of continued scaling is dependent on adaptive power management and voltage scaling”, IEEE Fellow Kevin Zhang, VP of Intel's Technology and Manufacturing Group 4
Worst-case Modelling is Wasteful • Devices have different delay -> Variation !! 5
Worst-case Modelling is Wasteful • Delay is temperature dependant High Temperature 6
Worst-case Modelling is Wasteful • Delay is affected by VDD Lower VDD 7
Worst-case Modelling is Wasteful • Aging also affects delay End-of-life 8
Worst-case Modelling is Wasteful • Aging also affects delay End-of-life Static timing analysis (STA) accommodates the tail 9
Worst-case Modelling is Wasteful • Aging also affects delay • Timing models add margins for : End-of-life • • • Slow device Worst temperature Worst voltage droop End-of-life effects Guard-bands for noise, etc. . 10
How significant are the added margins ? 250 FIR filter Fmax on a 60 -nm Cyclone IV (1. 2 V nominal VDD) Measured Fmax (MHz) 200 150 CAD reported Fmax 100 50 0 800 900 1000 1100 1200 Supply Voltage (m. V) 1300 1400 11
How significant are the added margins ? 250 FIR filter Fmax on a 60 -nm Cyclone IV (1. 2 V nominal VDD) Measured Fmax (MHz) 200 150 CAD reported Fmax 100 > 20 % reduction in VDD without reducing Fmax 50 0 800 900 1000 1100 1200 Supply Voltage (m. V) 1300 1400 12
How significant are the added margins ? 250 FIR filter Fmax on a 60 -nm Cyclone IV (1. 2 V nominal VDD) Measured Fmax (MHz) 200 150 CAD reported Fmax 100 > 20 % reduction in VDD without reducing Fmax 50 0 800 900 1000 1100 1200 Supply Voltage (m. V) 1300 Dynamic Voltage Scaling (DVS) 1400 13
Dynamic Voltage Scaling • Find minimum VDD that guarantees operation at required speed • VDD, reduces both dynamic and static power Pdynamic a VDD 2 • Static power drops even faster • DVS has been commercially adopted by CPUs, but not FPGAs • FPGA’s programmability unknown critical path at fabrication time • This work: exploit programmability to perform design & chipspecific calibration 14
Outline • DVS proposal • Testing Procedure • FRo. C • Results • Summary & Future work 15
Outline • DVS proposal • Testing Procedure • FRo. C • Results • Summary & Future work 16
Conventional Design Cycle Application HDL One Measurement by STA Passes timing FPGA Application bit-stream Program & run application with nominal VDD 17
DVS Proposal Overview CAD System Application HDL FPGA Calibration bit-stream Replicated critical path 1 st measurement by conventional STA (once per application) FPGA Application bit-stream Critical path Heaters 18
DVS Proposal Overview CAD System Application HDL FPGA Power VDD stage 2 nd measurement by on-chip calibration (repeated for each FPGA) FPGA Calibration bit-stream Application bit-stream Critical path Program & generate calibration table (CT) 19
DVS Proposal Overview CAD System Application HDL FPGA Calibration bit-stream Program & generate calibration table (CT) Application bit-stream CT VDD Power stage Program & run application with DVS 20
DVS Proposal Overview CAD System Application HDL Today’s talk FPGA Calibration bit-stream Program & generate calibration table (CT) Application bit-stream CT Program & run application with DVS 21
Generating the Calibration Bit-stream • Performed on each FPGA at least once • For aging effects, calibration with every power up • Capture all speed-limiting paths • Invisible to FPGA users Fast Robust Automated Calibration FRo. C CAD tool 22
Outline • Motivation • DVS proposal • Testing Procedure • FRo. C • Results • Summary & Future work 23
How to measure Fmax • Stimulate with random inputs and check output ? • Does not guarantee exercising the critical path (CP) • To robustly measure the delay of a path : • Off-path inputs must have a steady non-controlling value Tested path Steady 1/0 LUT 24
How to measure Fmax • Stimulate with random inputs and check output ? • Does not guarantee exercising the critical path (CP) • To robustly measure the delay of a path : • Off-path inputs must have a steady non-controlling value • Control over the edge transition from input output Tested path LUT Edge 1/0 / 25
Measuring the Delay of a Single Path Application FF Critical path (CP) FF FF LUT Replicate LUT FF FF 26
Measuring the Delay of a Single Path Application FF Critical path (CP) FF FF LUT FF Replicate LUT LUT FF FF FF 27
Measuring the Delay of a Single Path Application FF Critical path (CP) FF FF Change LUT mask LUT FF XOR LUT XOR FF FF FF 28
Measuring the Delay of a Single Path Application FF FF FF Edge 1 Critical path (CP) Control edge transition LUT XOR Edge 2 LUT XOR FF FF FF 29
Measuring the Delay of a Single Path Input stimulus Application FF FF FF Edge 1 FF Critical path (CP) Detect timing faults LUT XOR Edge 2 LUT XOR FF FF FF Error detection XNOR FF Error 30
A Single Path Delay is Not Robust • Many paths have delay close to the CP • Within-die variation may cause some other paths to be more critical • Varying VDD affects FPGA elements delay differently Robust; measure delay of many near critical paths Fast; use 1 calibration bit-stream 31
Testing Disjoint Paths • Testing many disjoint paths is mostly easy • Repeat the same procedure for single path testing Application FF FF 32
Testing Disjoint Paths • Testing many disjoint paths is mostly easy • Repeat the same procedure for single path testing Application FF Calibration FF FF FF Error FF 33
. . but What to Do with Overlapping Paths? FF S 1 FF S 2 LUT A LUT B • Paths sharing a LUT through different inputs Path 1 LUT C FF Path 2 34
. . but What to Do with Overlapping Paths? FF S 1 FF S 2 LUT A LUT B • Paths sharing a LUT through different inputs • To test Path 1, fix off-path input at C Path 1 LUT C FF Path 2 35
. . but What to Do with Overlapping Paths? FF S 1 FF S 2 LUT A LUT B Path 1 LUT C Path 2 FF • Paths sharing a LUT through different inputs • To test Path 1, fix off-path input at C • Path 1 & Path 2 can’t be tested together 36
. . but What to Do with Overlapping Paths? FF S 1 FF S 2 LUT A LUT B Path 1 LUT C Path 2 FF • Paths sharing a LUT through different inputs • To test Path 1, fix off-path input at C • Path 1 & Path 2 can’t be tested together • Need 2 separate test phases 37
. . but What to Do with Overlapping Paths? Fix. A LUT A FF S 1 FF S 2 LUT B Fix. B Path 1 LUT C Path 2 FF • Paths sharing a LUT through different inputs • To test Path 1, fix off-path input at C • Path 1 & Path 2 can’t be tested together • Need 2 separate test phases -Add Fix control signals to keep LUT output constant -Test controller cycles through test phases sequentially 38
LUT Masks for Testing Fix off-path inputs Break re-convergent fan-outs K-LUT Control edge transition • only added when required • Developed more LUT masks to test Cyclone IV carry-chains with the same controllability 39
Can’t Test Everything with 1 Bit-stream • One or two LUT inputs used as control signals P P 2 1 P 3 P 4 LUT 40
Can’t Test Everything with 1 Bit-stream • One or two LUT inputs used as control signals P P 2 1 LUT Edge Fix 41
Can’t Test Everything with 1 Bit-stream • One or two LUT inputs used as control signals • Fixing LUT output does not break all re-convergent fan-outs Path 2 LUT A P P 2 1 LUT Edge Fix LUT B Path 1 LUT C 42
Can’t Test Everything with 1 Bit-stream • One or two LUT inputs used as control signals • Fixing LUT output does not break all re-convergent fan-outs Path 2 LUT A P P 2 1 LUT Edge Fix LUT B Path 1 LUT C • LAB inputs constraint • Carry-chains constraints 43
Outline • Motivation • DVS proposal • Testing Procedure • FRo. C • Results • Summary & Future work 44
CAD System with FRo. C Proposed CAD system Application HDL Quartus P&R 1) Paths selection Quartus STA Calibration HDL FRo. C 2) Paths replication Location & Routing Constraints 3) Grouping replicated paths Calibration bit-stream Quartus Application bit-stream 4) Test controller generation 45
1) Path selection Application circuit FF FF LUT LUT FF 46
1) Path selection • Extract near critical paths from STA Application circuit P 5 FF P 1 FF P 2 P 3 FF FF P 4 • {P 1, P 2, P 3, P 4, P 5} 4 -LUT FF 47
1) Path selection • Extract near critical paths from STA Application circuit P 5 FF P 1 FF P 2 P 3 FF FF P 4 • {P 1, P 2, P 3, P 4, P 5} • Select which paths to test • Can’t test {P 2, P 3, P 4} in 1 bit-stream 4 -LUT Two inputs reserved for control signals (Fix , Edge) FF 48
1) Path selection • Extract near critical paths from STA Application circuit P 5 FF P 1 FF P 2 P 3 FF 4 -LUT FF • {P 1, P 2, P 3, P 4, P 5} • Select which paths to test • Can’t test {P 2, P 3, P 4} in 1 bit-stream • Select the more critical paths • {P 1, P 2, P 3 , P 5} 4 -LUT FF 49
2) Path replication Application circuit P 5 FF P 1 FF P 2 P 3 FF FF 4 -LUT FF Replication + Control Signals
2) Path replication Application circuit P 5 FF P 1 FF P 2 P 3 FF P 5 FF Replicated Paths FF P 1 Fix 2 FF P 2 P 3 FF Fix 1 Edge 2 4 -LUT FF Replication + Control Signals Fix 3 4 -LUT Edge 3 4 -LUT FF 51
3) Grouping replicated paths P 5 Replicated Paths FF P 1 Fix 2 FF P 2 P 3 FF Fix 1 Edge 2 4 -LUT Fix 3 4 -LUT Edge 3 4 -LUT FF 52
3) Grouping replicated paths P 5 Replicated Paths FF P 1 Fix 2 FF P 2 P 3 • Minimising test phases -> minimises calibration time FF Fix 1 Edge 2 4 -LUT Fix 3 4 -LUT Edge 3 4 -LUT FF 53
3) Grouping replicated paths P 5 Replicated Paths FF P 1 Fix 2 FF P 2 P 3 • Minimising test phases -> minimises calibration time • Graph coloring problem FF Fix 1 Edge 2 4 -LUT Fix 3 4 -LUT Edge 3 4 -LUT FF 54
3) Grouping replicated paths P 5 Replicated Paths FF P 1 Fix 2 FF P 2 P 3 • Minimising test phases -> minimises calibration time • Graph coloring problem FF Fix 1 Edge 2 4 -LUT Fix 3 4 -LUT Edge 3 4 -LUT FF 55
3) Grouping replicated paths P 5 Replicated Paths FF P 1 Fix 2 FF P 2 P 3 • Minimising test phases -> minimises calibration time • Graph coloring problem FF Fix 1 Edge 2 4 -LUT Fix 3 4 -LUT Edge 3 4 -LUT FF 56
3) Grouping replicated paths P 5 Replicated Paths FF P 1 Fix 2 FF P 2 P 3 • Minimising test phases -> minimises calibration time • Graph coloring problem FF Fix 1 Edge 2 4 -LUT Fix 3 4 -LUT Edge 3 4 -LUT FF 57
3) Grouping replicated paths P 5 Replicated Paths FF P 1 Fix 2 FF P 2 P 3 FF Fix 1 • Minimising test phases -> minimises calibration time • Graph coloring problem • Tested > 5000 paths using 17 phases only !! Edge 1 Edge 2 4 -LUT Fix 3 4 -LUT Edge 3 4 -LUT FF 58
4) Test controller generation • For each test phase : • Set the appropriate control signals • Generates input stimulus • Detects timing faults Replicated paths Input stimulus Control signals Sink registers Test Controller Error 59
Outline • Motivation • DVS proposal • Testing Procedure • FRo. C • Results • Summary & Future work 60
Benchmarks & Target Chip • Dual-channel 51 -tap low pass FIR filter • Full crossbar (Xbar) with 16 100 -bit-wide-ports Application LE utilization Reported FMAX FIR filter 67, 505 (59 %) 121 MHz Crossbar 26, 579 (23 %) 115 MHz • Targeting Cyclone IV EP 4 CE 115 F 29 C 7 (TSMC 60 -nm technology) • Nominal VDD 1. 2 V 61
How Many Edges Are We Covering ? • Timing edge is a connection between • I & O of a cell (Cell delay) , O of a cell & I of another cell (connection delay) • Timing edge criticality = (longest path using this edge)/(CP delay) Timing edge coverage Xbar 10000 candidate paths FIR 10000 candidate paths Criticality % Covering more than 90 % of the more critical bins. FRo. C favours testing the more critical edges 62
First, a Sanity Check • Need to validate the CT values • Selected benchmarks are feed-forward applications with no buried 250 states FIR measured Fmax Application BIST controller M I S R 200 Ref Tested = Fmax (MHz) L F S R Xbar measured Fmax 121. 18 150 Xbar CAD reported Fmax 115. 19 100 50 0 FIR CAD reported Fmax 800 900 1000 1100 1200 Supply Voltage (m. V) 1300 63 1400
How Many Paths to Measure ? FIR 1 Path 2000 Paths 1 Path 10000 Paths Benchmark Actual Fmax 10000 Paths 240 220 2000 Paths Benchmark Actual Fmax 220 200 180 1 path is not robust 160 180 Fmax(MHz) Xbar 140 120 160 140 Fan-out loading effects 120 100 80 80 60 60 0. 8 0. 9 1 VDD(V) 1. 1 1. 2 1. 3 0. 8 0. 9 1 1. 1 VDD(V) 1. 2 1. 3 1. 4 64
Fan-out Correction & Guard-banding • Correcting for fan-out through the difference in reported delay (by Quartus STA) between the calibration and the application bit-streams • 1 % for FIR & 5 % for Xbar • Guard-banding for IR-drop, crosstalk effects • 5 % for both benchmarks (experimental values) 65
Generated CT & Power Savings Xbar FIR Benchmark Actual Fmax 200 Guard-banded CT 180 140 Fmax(MHz) 160 120 100 80 60 0. 8 0. 9 1 VDD(V) 1. 1 1. 2 1. 3 VDD(V) 66
Generated CT & Power Savings Xbar FIR Benchmark Actual Fmax 200 Guard-banded CT 180 140 Fmax(MHz) 160 120 100 80 60 0. 8 0. 9 1 VDD(V) 1. 1 1. 2 1. 3 VDD(V) 67
Generated CT & Power Savings Xbar FIR Benchmark Actual Fmax 200 Guard-banded CT 180 140 Fmax(MHz) 160 120 100 80 60 0. 8 0. 9 1 VDD(V) 1. 1 1. 2 1. 3 VDD(V) 68
Generated CT & Power Savings Xbar FIR Benchmark Actual Fmax 200 Guard-banded CT 180 140 Fmax(MHz) 160 120 100 80 60 0. 8 0. 9 1 VDD(V) 1. 1 1. 2 1. 3 VDD(V) 69
Outline • Motivation • DVS proposal • Testing Procedure • FRo. C • Results • Summary & Future work 70
Summary • Presented a DVS approach tailored for FPGA (off-line calibration) • Created FRo. C tool to automate the calibration procedure • Achieve more than 33 % total power reduction 71
Future Work • Reducing guard-bands to enable more power savings • Complete fan-out modelling for tested paths • Account for IR-drop during calibration • # of required calibration bit-streams for full coverage • Testing hard blocks to find the safest minimum VDD 72
Summary • Presented a DVS approach tailored for FPGA (off-line calibration) • Created FRo. C tool to automate the calibration procedure • Achieve more than 33 % total power reduction 73
- Slides: 73