Understanding Power Consumption and Reliability of High Bandwidth

Understanding Power Consumption and Reliability of High. Bandwidth Memory with Voltage Underscaling Saber Nabavi Behzad Salami Osman Unsal Adrian Cristal Hamid Sarbazi-Azad Onur Mutlu

Executive Summary • DRAM has problems and one of them is Bandwidth. ü HBM puts DRAM chips inside a package with GPU, FPGA, etc. • HBM uses the package’s power budget ü Undervolting reduces power WITHOUT losing bandwidth. • Push undervolting too far, it will result unwanted bit-flips ü This Work: power, bit flips and trade-offs • Evaluation Setup ü Xilinx FPGA with 2 Stacks ü HBM voltage rail • Main Results ü 19% voltage guardband ü 2. 3 X power savings ü Fault-map to aid users to take advantage of undervolting 2

Outline • Why HBM? • What is HBM? • Undervolting • Methodology • Results • Power • Reliability • The Trade-off 3

Outline • Why HBM? • What is HBM? • Undervolting • Methodology • Results • Power • Reliability • The Trade-off 4

Why HBM? • DRAM Limitations: Power, Latency, Bandwidth, etc. ü Specially in data-intensive applications • Replace DRAM: PCM, MRAM, etc. ü Have their own limitations • Improve DRAM: ü Reduced Latency DRAM (RLDRAM) ü Graphics DDR (GDDR) ü Low-Power DDR (LPDDR) ü High Bandwidth Memory (HBM) -> Bandwidth • HBM Use cases ü NVIDIA A 100, Xilinx Virtex Ultrascale+ HBM, AMD Radeon Pro ü The Summit Supercomputer 5

Outline • Why HBM? • What is HBM? • Undervolting • Methodology • Results • Power • Reliability • The Trade-off 6

What is HBM? • IDEA: Integrate stacks DRAM chips into the computing package ü Use TSVs, µBumps and Silicon Interposer to connect everything ü Eight 128 bits wide channels per stack • Benefits: ü An order of magnitude Higher Bandwidth ü Smaller form factor ü Lower energy per bit (7 p. J vs 25 p. J in DDRx) • Challenge: ü Uses the package’s power budget ü Save power but NOT lose bandwidth: Undervolting 7

Xilinx VCU 128 SLR=Super Logic Region (Reconfigurable Fabric) HBM Stacks AXI Ports Switches Memory Controllers (MC) Pseudo Channels (PC) 2 32 32 Memory Array 8

Outline • Why HBM? • What is HBM? • Undervolting • Methodology • Results • Power • Reliability • The Trade-off 9

Undervolting • IDEA: Reduce supply voltage but keep Frequency ü We can do this because of Voltage Guardband ü Save power WITHOUT losing bandwidth • Catch: ü Pushed beyond guardband, bit flips will appear! ü But we can save even more power at the cost of these faults! • Our Work: ü Undervolt HBM ü Then push in too far! ü Report power saving and bit flips ü Trade-off among memory capacity, power and fault-rate 10

Outline • Why HBM? • What is HBM? • Undervolting • Methodology • Results • Power • Reliability • The Trade-off 11

Methodology • Undervolting Mechanism ü HBM supply voltage is driven by an on-board regulator ü We control it from the host ü 10 m. V voltage steps • Power Measurements: ü Change bandwidth utilization by enabling/disabling AXI ports ü Measure power at all voltage steps ü Measure idle power by disabling all AXI ports • Reliability Test ü Test the entire memory vs. pseudo-channel ü Write all 1 s (to detect 1 -to-0 bit flips) ü Write all 0 s (to detect 0 -to-1 bit flips) 12

Outline • Why HBM? • What is HBM? • Undervolting • Methodology • Results • Power • Reliability • The Trade-off 13

Power: Active and Idle • 14

Power: α • Bit Flips <3% % 14 15

Outline • Why HBM? • What is HBM? • Undervolting • Methodology • Results • Power • Reliability • The Trade-off 16

Reliability: the Regions 0. 81 Exponential No Bit Flips 0. 98 • 17

Reliability: the Variations 18

Outline • Why HBM? • What is HBM? • Undervolting • Methodology • Results • Power • Reliability • The Trade-off 19

The Trade-Off 2. 3 X 1. 6 X 20

Understanding Power Consumption and Reliability of High. Bandwidth Memory with Voltage Underscaling Saber Nabavi Behzad Salami Osman Unsal Adrian Cristal Hamid Sarbazi-Azad Onur Mutlu
- Slides: 21