Ultra FineGrained RunTime Power Gating of OnChip Routers

  • Slides: 48
Download presentation
Ultra Fine-Grained Run-Time Power Gating of On-Chip Routers for CMPs Hiroki Matsutani (Univ Tokyo,

Ultra Fine-Grained Run-Time Power Gating of On-Chip Routers for CMPs Hiroki Matsutani (Univ Tokyo, Japan) Michihiro Koibuchi (NII, Japan) Daisuke Ikebuchi (Keio Univ, Japan) Kimiyoshi Usami (Shibaura IT, Japan) Hiroshi Nakamura (Univ Tokyo, Japan) Hideharu Amano (Keio Univ, Japan)

Number of PEs (caches are not included) Multi-Core & Many-Core pico. Chip PC 102

Number of PEs (caches are not included) Multi-Core & Many-Core pico. Chip PC 102 256 pico. Chip PC 205 Clear. Speed CSX 700 128 64 32 16 Intel 80 -core TILERA TILE 64 Clear. Speed CSX 600 MIT RAW UT TRIPS (OPN) STI Cell BE 8 Sun T 1 4 2 Sun T 2 Intel Core, IBM Power 7 AMD Opteron 2002 2004 2006 2008 2010?

Number of PEs (caches are not included) Multi-Core & Many-Core pico. Chip PC 102

Number of PEs (caches are not included) Multi-Core & Many-Core pico. Chip PC 102 256 32 16 Intel 80 -core TILERA TILE 64 Clear. Speed CSX 600 MIT RAW Target UT TRIPS (OPN) STI Cell BE 8 Sun T 1 4 2 Clear. Speed CSX 700 Hundreds of simple PEs 128 64 pico. Chip PC 205 Chip multiprocessor (CMP) Sun T 2 Intel Core, IBM Power 7 AMD Opteron 2002 2004 2006 2008 2010?

Our target: No. C for future CMPs • 8 -CPU CMP example – 8

Our target: No. C for future CMPs • 8 -CPU CMP example – 8 CPUs (each has a private L 1 cache) – Shared L 2 cache (divided into 64 banks) [Beckmann, MICRO’ 04] Ultra. SPARC L 1 cache (I & D) (16 k. B) L 2 cache bank (256 k. B, 4 -way)

Our target: No. C for future CMPs • 8 -CPU CMP example – 8

Our target: No. C for future CMPs • 8 -CPU CMP example – 8 CPUs (each has a private L 1 cache) – Shared L 2 cache (divided into 64 banks) On-chip network is an infrastructure of CMPs. – CPUs and cache banks are interconnected via No. C It must be always ready for the packet transfers. [Beckmann, MICRO’ 04] It consumes leakage power at any time! Run-time power gating that stops the power supply Ultra. SPARC to the routers whenever possible is required. L 1 cache (I & D) (16 k. B) L 2 cache bank (256 k. B, 4 -way) On-chip router

Outline: Fine-grain power gating router • Fine-grained power gating router – Input VC buffers

Outline: Fine-grain power gating router • Fine-grained power gating router – Input VC buffers – Crossbar MUXes, VC MUXes – Output latches 35 power domains in each router • Power domain implementation @ 65 nm – Design flow – Wakeup latency estimation and its impact • Three early wakeup methods • Evaluation results – Application performance w/ early wakeup – Leakage power reduction

Power gating: Coarse- vs. fine-grain • Coarse-grain approach – IP core (module) level –

Power gating: Coarse- vs. fine-grain • Coarse-grain approach – IP core (module) level – Surrounded by VGND – Power switch between VGND and GND • Fine-grain approach – Standard cell level – Each cell has VGND port – All cells in a domain share the same VGND line [Usami, ICCD’ 06] Virtual GND (VGND) IP Core On/Off GND ring Power Switch IP Core

Power gating: Coarse- vs. fine-grain • Coarse-grain approach – IP core (module) level –

Power gating: Coarse- vs. fine-grain • Coarse-grain approach – IP core (module) level – Surrounded by VGND – Power switch between VGND and GND Virtual GND (VGND) • Fine-grain approach – Standard cell level – Each cell has VGND port – All cells in a domain share the same VGND line [Usami, ICCD’ 06] VDD OR AND IP Core GND Power Switch On/Off VGND On/Off GND ring Power Switch GND INV DFF VDD

Which is better? Coarse or Fine • Each router component (e. g. , input

Which is better? Coarse or Fine • Each router component (e. g. , input port) works independently each other – Fine-grain approach has more opportunities to sleep ARBITER X+ X+ Packet#1 X- X- Y+ Y+ Packet#2 Y- Y- CORE 5 x 5 CROSSBAR CORE

Fine-grain run-time PG router • Router is divided into many Micro-power-domains – Input VC

Fine-grain run-time PG router • Router is divided into many Micro-power-domains – Input VC buffers, Output latches – Crossbar MUXes, VC MUXes 35 power domains in a 5 -port router ARBITER X+ X+ X- X- Y+ Y+ Y- Y- CORE 5 x 5 CROSSBAR CORE

Fine-grain run-time PG router • Router is divided into many Micro-power-domains – Input VC

Fine-grain run-time PG router • Router is divided into many Micro-power-domains – Input VC buffers, Output latches – Crossbar MUXes, VC MUXes 35 power domains in a 5 -port router ARBITER Packet X+ X+ X- X- Y+ Y+ Y- Y- CORE 5 x 5 CROSSBAR CORE

Fine-grain run-time PG router • Router is divided into many Micro-power-domains – Input VC

Fine-grain run-time PG router • Router is divided into many Micro-power-domains – Input VC buffers, Output latches – Crossbar MUXes, VC MUXes 35 power domains in a 5 -port router ARBITER X+ X+ X- X- Y+ Y+ Y- Y- CORE 5 x 5 CROSSBAR CORE

Fine-grain run-time PG router • Router is divided into many Micro-power-domains – Input VC

Fine-grain run-time PG router • Router is divided into many Micro-power-domains – Input VC buffers, Output latches – Crossbar MUXes, VC MUXes 35 power domains in a 5 -port router ARBITER X+ X+ X- X- Y+ Y+ Y- Y- CORE 5 x 5 CROSSBAR CORE

Fine-grain run-time PG router • Router is divided into many Micro-power-domains – Input VC

Fine-grain run-time PG router • Router is divided into many Micro-power-domains – Input VC buffers, Output latches – Crossbar MUXes, VC MUXes 35 power domains in a 5 -port router ARBITER X+ X+ X- X- Y+ Y+ Y- Y- CORE 5 x 5 CROSSBAR CORE

Fine-grain run-time PG router • Router is divided into many Micro-power-domains – Input VC

Fine-grain run-time PG router • Router is divided into many Micro-power-domains – Input VC buffers, Output latches – Crossbar MUXes, VC MUXes 35 power domains in a 5 -port router ARBITER X+ X+ X- X- Y+ Y+ Y- Y- CORE 5 x 5 CROSSBAR CORE Each power domain is activated only when it is “really” used

Outline: Fine-grain power gating router • Fine-grained power gating router – Input VC buffers

Outline: Fine-grain power gating router • Fine-grained power gating router – Input VC buffers – Crossbar MUXes, VC MUXes – Output latches 35 power domains in each router • Power domain implementation @ 65 nm – Design flow – Wakeup latency estimation and its impact • Three early wakeup methods • Evaluation results – Application performance w/ early wakeup – Leakage power reduction

Power domain design: Design flow • Verilog netlist Synopsys Design. Compiler • HOLD cell

Power domain design: Design flow • Verilog netlist Synopsys Design. Compiler • HOLD cell insertion not to propagate “X” By hand module FIFO (in, out); • Place-and-route input [127: 0] in; • Power switch insertion DFF reg 0 (in 0, out 0, clk); Synopsys Astro Sequence Design Cool. Power • Place-and-route again Synopsys Astro • RC extraction Cadence Assura (QRC) • SPICE simulation Synopsys HSIM output [127: 0] out: DFF reg 1 (in 1, out 1, clk); endmodule

Power domain design: Design flow • Verilog netlist Synopsys Design. Compiler • HOLD cell

Power domain design: Design flow • Verilog netlist Synopsys Design. Compiler • HOLD cell insertion not to propagate “X” By hand module FIFO (in, out); • Place-and-route input [127: 0] in; • Power switch insertion DFF reg 0 (in 0, out 0, clk); Synopsys Astro Sequence Design Cool. Power • Place-and-route again Synopsys Astro • RC extraction Cadence Assura (QRC) • SPICE simulation Synopsys HSIM output [127: 0] out: DFF reg 1 (in 1, out 1, clk); HOLD (out 0); HOLD (out 1); endmodule

Power domain design: Design flow • Verilog netlist Synopsys Design. Compiler • HOLD cell

Power domain design: Design flow • Verilog netlist Synopsys Design. Compiler • HOLD cell insertion not to propagate “X” VDD • Place-and-route GND • Power switch insertion GND OR AND By hand DFF Synopsys Astro Sequence Design Cool. Power • Place-and-route again Synopsys Astro • RC extraction Cadence Assura (QRC) • SPICE simulation Synopsys HSIM INV DFF AND OR NOR Domain#0 Domain#1 VDD

Power domain design: Design flow • Verilog netlist Synopsys Design. Compiler • HOLD cell

Power domain design: Design flow • Verilog netlist Synopsys Design. Compiler • HOLD cell insertion not to propagate “X” VDD • Place-and-route GND VGND OR AND By hand Synopsys Astro • Power switch insertion Sequence Design Cool. Power • Place-and-route again Synopsys Astro • RC extraction Cadence Assura (QRC) • SPICE simulation Synopsys HSIM Power Switch DFF INV DFF AND OR NOR Domain#0 Domain#1 VDD Area overhead: Power switch and hold cells: 4. 3% But, we need larger cells: 15. 9% in total

Power domain design: • Verilog netlist Synopsys Design. Compiler • HOLD cell insertion not

Power domain design: • Verilog netlist Synopsys Design. Compiler • HOLD cell insertion not to propagate “X” Design flow On/Off control • Power On when packet comes • Power Off when packet leaves By hand • Place-and-route Synopsys Astro • Power switch insertion Input VC buffer VDD Sequence Design Cool. Power FIFO • Place-and-route again Synopsys Astro • RC extraction Cadence Assura (QRC) • SPICE simulation Synopsys HSIM GND

Power domain design: • Verilog netlist Synopsys Design. Compiler • HOLD cell insertion not

Power domain design: • Verilog netlist Synopsys Design. Compiler • HOLD cell insertion not to propagate “X” Design flow On/Off control • Power On when packet comes • Power Off when packet leaves By hand • Place-and-route Synopsys Astro • Power switch insertion Input VC buffer VDD Sequence Design Cool. Power FIFO • Place-and-route again Synopsys Astro • RC extraction Cadence Assura (QRC) • SPICE simulation Synopsys HSIM VGND Wakeup GND Switch

Correct output Power domain design: • Verilog netlist FIFO OUT[1] Design flow On/Off control

Correct output Power domain design: • Verilog netlist FIFO OUT[1] Design flow On/Off control • HOLD cell insertion 2. 8 nsec FIFO not to Wakeup propagate “X”OUT[0] & • Power On when packet comes • Power Off when packet leaves Initialization • Place-and-route Input VC buffer CLOCK • Power switch insertion VDD FIFO • Place-and-route again • RC extraction Power ON • SPICE simulation WAKEUP Fujitsu 65 nm CMOS (1. 20 V, 75 C) VGND Wakeup GND Switch

Power domain design: • Verilog netlist Synopsys Design. Compiler • HOLD cell insertion not

Power domain design: • Verilog netlist Synopsys Design. Compiler • HOLD cell insertion not to propagate “X” Design flow On/Off control • Power On when packet comes • Power Off when packet leaves By hand • Place-and-route Synopsys Astro • Power switch insertion Crossbar multiplexer VDD Sequence Design Cool. Power MUX • Place-and-route again Synopsys Astro • RC extraction Cadence Assura (QRC) • SPICE simulation Synopsys HSIM VGND Wakeup GND Switch

Correct output Power domain design: • Verilog netlist MUX OUT[1] Design flow On/Off control

Correct output Power domain design: • Verilog netlist MUX OUT[1] Design flow On/Off control • HOLD cell insertion MUX 1. 3 nsec not to Wakeup propagate “X”OUT[0] • Place-and-route • Power On when packet comes • Power Off when packet leaves Crossbar multiplexer CLOCK • Power switch insertion VDD MUX • Place-and-route again • RC extraction Power ON WAKEUP • SPICE simulation CORE VGND Wakeup GND Switch All. Fujitsu power 65 nm domains this router can be activated within 3 nsec CMOSin (1. 20 V, 75 C)

Wakeup latency impact on CMPs • Full system CMP simulator: GEMS/Simics – 3 -cycle

Wakeup latency impact on CMPs • Full system CMP simulator: GEMS/Simics – 3 -cycle router [RC] [VSA] [ST] – Wakeup latencies: 2, 3, 4 cycles – SPLASH-2 benchmark (8 threads) [Martin, CAN’ 05] radix, lu, fft, barnes, ocean, raytrace, volrend, water-ns, water-sp, fmm (10 applications) Ultra. SPARC L 1 cache (I & D) (16 k. B) L 2 cache bank (256 k. B, 4 -way) On-chip router

Wakeup latency impact on CMPs • Full system CMP simulator: GEMS/Simics – 3 -cycle

Wakeup latency impact on CMPs • Full system CMP simulator: GEMS/Simics – 3 -cycle router [RC] [VSA] [ST] – Wakeup latencies: 2, 3, 4 cycles – SPLASH-2 benchmark (8 threads) [Martin, CAN’ 05] radix, lu, fft, barnes, ocean, raytrace, volrend, water-ns, water-sp, fmm (10 applications) Token coherence protocol • VC 0 [Martin, ISCA’ 03] – Request msg (L 1 L 2) • VC 1 – Request msg (L 2 Mem) • VC 2 – Reply msg (All All) • VC 3 – Persistent request msg

Wakeup latency impact: Results • Execution times of SPLASH-2 (10 applications) 2 -cycle wait

Wakeup latency impact: Results • Execution times of SPLASH-2 (10 applications) 2 -cycle wait (@667 MHz) 3 -cycle wait (@1000 MHz) 4 -cycle wait (@1333 MHz) Execution time w/o power gating = 1. 00 Execution time increases 23. 2 -46. 3% (unacceptable!) Radix Lu Longer execution Barnes Ocean Ray- Vol- Water Fmm Ave consumes moretrace power; wakeup required rend Early NS SP Fft

Outline: Fine-grain power gating router • Fine-grained power gating router – Input VC buffers

Outline: Fine-grain power gating router • Fine-grained power gating router – Input VC buffers – Crossbar MUXes, VC MUXes – Output latches 35 power domains in each router • Power domain implementation @ 65 nm – Design flow – Wakeup latency estimation and its impact • Three early wakeup methods • Evaluation results – Application performance w/ early wakeup – Leakage power reduction

Early wakeup: Look-ahead method • Router modules in 2 -hop away activated in advance

Early wakeup: Look-ahead method • Router modules in 2 -hop away activated in advance [Matsutani, ASPDAC’ 08] – Look-ahead routing is used – Wakeup procedure starts 4 -cycle in advance SRC 1 -hop Wakeup 2 -hop 3 -hop DST 4 -hop • Problems – Wakeup signals span 2 -hop – Cannot wakeup the 1 -hop router in advance (*) Average hop is 2. 67 @ 4 x 4 mesh Suffers first-hop wakeup latency; Small benefit of look-ahead

Early wakeup: Look-ahead method + CPU ever-on • Router modules in 2 -hop away

Early wakeup: Look-ahead method + CPU ever-on • Router modules in 2 -hop away activated in advance – Look-ahead routing is used – Wakeup procedure starts 4 -cycle in advance SRC Wakeup DST Ever-on 1 -hop 2 -hop 3 -hop 4 -hop • Ever-on domain – VC buffers connected from CPU are always active – No wait for the first-hop – But, they consume leakage

Early wakeup: Look-ahead method + CPU ever-on • Router modules in 2 -hop away

Early wakeup: Look-ahead method + CPU ever-on • Router modules in 2 -hop away activated in advance – Look-ahead routing is used – Wakeup procedure starts 4 -cycle in advance SRC Wakeup DST • VC 0 Ever-on – Request msg (L 1 1 -hop 2 -hop 3 -hop 4 -hop • Ever-on domain L 2) • VC 1 – Request msg (L 2 Mem) • VC 2 – VC 0 and VC 2 connected – Reply msg (All All) from CPU are always active • VC 3 – No wait for the first-hop – Persistent request msg (*) VC 0 and VC 2 are heavily loaded Ever-on domain is only 4. 7%; Minimum leakage power overhead

Early wakeup: Active buffer window • Active buffer window in each VC buffer –

Early wakeup: Active buffer window • Active buffer window in each VC buffer – A part of the buffer is always activated [Chen, ISLPED’ 03] – Active buffer window shifts when it receives/sends flit – Short packets (less than window size) No wait ARBITER Window size = 3 X+ XY+ Y+ YRead Write 5 x 5 CROSSBAR CORE

Early wakeup: Active buffer window • Active buffer window in each VC buffer –

Early wakeup: Active buffer window • Active buffer window in each VC buffer – A part of the buffer is always activated [Chen, ISLPED’ 03] – Active buffer window shifts when it receives/sends flit – Short packets (less than window size) No wait ARBITER Window size = 3 X+ XY+ Y+ YRead Write 5 x 5 CROSSBAR CORE

Early wakeup: Active buffer window • Active buffer window in each VC buffer –

Early wakeup: Active buffer window • Active buffer window in each VC buffer – A part of the buffer is always activated [Chen, ISLPED’ 03] – Active buffer window shifts when it receives/sends flit – Short packets (less than window size) No wait ARBITER Window size = 3 X+ XY+ Y+ YRead Write 5 x 5 CROSSBAR CORE

Early wakeup: Active buffer window • Active buffer window in each VC buffer –

Early wakeup: Active buffer window • Active buffer window in each VC buffer – A part of the buffer is always activated [Chen, ISLPED’ 03] – Active buffer window shifts when it receives/sends flit – Short packets (less than window size) No wait ARBITER Window size = 3 X+ XY+ Y+ YRead Write 5 x 5 CROSSBAR CORE

Early wakeup: Active buffer window • Active buffer window in each VC buffer –

Early wakeup: Active buffer window • Active buffer window in each VC buffer – A part of the buffer is always activated [Chen, ISLPED’ 03] – Active buffer window shifts when it receives/sends flit – Short packets (less than window size) No wait ARBITER Window size = 3 X+ XY+ Y+ YRead 5 x 5 CROSSBAR CORE No wait, but the Write window consumes leakage Small PG benefit

Outline: Fine-grain power gating router • Fine-grained power gating router – Input VC buffers

Outline: Fine-grain power gating router • Fine-grained power gating router – Input VC buffers – Crossbar MUXes, VC MUXes – Output latches 35 power domains in each router • Power domain implementation @ 65 nm – Design flow – Wakeup latency estimation and its impact • Three early wakeup methods • Evaluation results – Application performance w/ early wakeup – Leakage power reduction

CMP simulator: • Full system CMP simulation GEMS/Simics [Martin, CAN’ 05] radix, lu, fft,

CMP simulator: • Full system CMP simulation GEMS/Simics [Martin, CAN’ 05] radix, lu, fft, barnes, – 8 CPUs, 64 L 2 banks, 4 x 4 mesh ocean, raytrace, volrend, – Sun Solaris 9; Sun Studio 12 water-ns, water-sp, fmm – SPLASH-2 benchmark (8 threads) (10 applications) Ultra. SPARC L 1 cache (I & D) (16 k. B) L 2 cache bank (256 k. B, 4 -way) On-chip router

CMP simulator: GEMS/Simics [Martin, CAN’ 05] • Full system CMP simulation radix, lu, fft,

CMP simulator: GEMS/Simics [Martin, CAN’ 05] • Full system CMP simulation radix, lu, fft, barnes, – 8 CPUs, 64 L 2 banks, 4 x 4 mesh ocean, raytrace, volrend, – Sun Solaris 9; Sun Studio 12 water-ns, water-sp, fmm – SPLASH-2 benchmark (8 threads) (10 applications) Token coherence protocol • VC 0 [Martin, ISCA’ 03] – Request msg (L 1 L 2) • VC 1 – Request msg (L 2 Mem) • VC 2 – Reply msg (All All) • VC 3 – Persistent request msg

CMP simulator: GEMS/Simics • Full system CMP simulation radix, lu, fft, barnes, – 8

CMP simulator: GEMS/Simics • Full system CMP simulation radix, lu, fft, barnes, – 8 CPUs, 64 L 2 banks, 4 x 4 mesh ocean, raytrace, volrend, – Sun Solaris 9; Sun Studio 12 water-ns, water-sp, fmm – SPLASH-2 benchmark (8 threads) (10 applications) • Three early wakeup methods are compared SRC DST Wakeup Window size = 2 Wakeup Read Ever-on Look-ahead +CPU ever-on • Wakeup latency: 3 nsec Write Buffer window (3 -cycle wakeup @ 1 GHz)

Evaluations: Application performance Execution times of SPLASH-2 (3 -cycle wakeup @ 1 GHz) Look-ahead

Evaluations: Application performance Execution times of SPLASH-2 (3 -cycle wakeup @ 1 GHz) Look-ahead with CPU ever-on Look-ahead Buffer window (1. 00 = Execution time with no wakeup latency) Execution time without early wakeup (+35. 3%) Performance overhead is only 4. 0% (CPU-ever on) Radix Barnes Ocean Raysignificantly mitigates trace Lu Early wakeup Fft Vol- Water Fmm Ave the overhead rendperformance NS SP

Evaluations: SRC DST Wakeup Leakage power reduction SRC DST Window size = 2 Wakeup

Evaluations: SRC DST Wakeup Leakage power reduction SRC DST Window size = 2 Wakeup Read Look-ahead Ever-on Look-ahead +CPU ever-on Write Buffer window • Power gating is applied to the router with 3 steps Level-1 power gating ARB Input buffer only Level-2 power gating ARB Input buffer + Xbar Level-3 power gating ARB In/output buffer + Xbar

Evaluations: Leakage power reduction Level-1 PG: Input buffer only (3 -cycle wakeup) Look-ahead with

Evaluations: Leakage power reduction Level-1 PG: Input buffer only (3 -cycle wakeup) Look-ahead with CPU ever-on Look-ahead Buffer window Leakage power without power gating (100%) Input buffer PG reduces the leakage power by 64. 6% when applications are running @ 1 GHz Radix Lu Fft Barnes Ocean Raytrace Vol- Water Fmm rend NS SP Ave

Evaluations: Leakage power reduction Level-2 PG: Input buffer + Crossbar (3 -cycle wakeup) Look-ahead

Evaluations: Leakage power reduction Level-2 PG: Input buffer + Crossbar (3 -cycle wakeup) Look-ahead with CPU ever-on Look-ahead Buffer window Leakage power without power gating (100%) Input buffer PG + Crossbar PG reduce the leakage power by 72. 7% @ 1 GHz Radix Lu Fft Barnes Ocean Raytrace Vol- Water Fmm rend NS SP Ave

Evaluations: Leakage power reduction Level-3 PG: Input buffer + Crossbar + Output latch Look-ahead

Evaluations: Leakage power reduction Level-3 PG: Input buffer + Crossbar + Output latch Look-ahead with CPU ever-on Look-ahead Buffer window Leakage power without power gating (100%) Input buffer PG + Crossbar PG + Output latch PG reduce the leakage power by 77. 7% @ 1 GHz Radix Lu Fft Fine-grain PG with Barnes Ocean Ray- Vol- Water Fmm early wakeuptrace reduces by rend the NS leakage SP Ave 77. 7%

Summary: Run-time PG router for CMPs • Power gating for router components – –

Summary: Run-time PG router for CMPs • Power gating for router components – – Input VC buffers 35 power domains Crossbar MUXes, VC MUXes in each router Output latches Wakeup latency is at most 3 nsec • Three early wakeup methods – Look-ahead with CPU ever-on – Look-ahead with active buffer window • Evaluation results – Performance overhead is less than 4. 0% @ 1 GHz – Leakage power is reduced by 77. 7%

Thank you for your attention

Thank you for your attention