MICRO43 Elastic Refresh Techniques to Mitigate Refresh Penalties

  • Slides: 21
Download presentation
MICRO-43 Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory Jeffrey Stuecheli

MICRO-43 Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory Jeffrey Stuecheli 1, 2, Dimitris Kaseridis 1, Hillery C. Hunter 3 & Lizy K. John 1 1 ECE Department, The University of Texas at Austin 2 IBM Corp. , Austin 3 IBM Thomas J. Watson Research Center Laboratory for Computer Architecture 12/7/2010

Overview/Summary § Refresh overhead is increasing with device density § Due to the nature

Overview/Summary § Refresh overhead is increasing with device density § Due to the nature of this increase, performance is suffering § Current refresh scheduling methods ineffective in hiding these delays § We propose more sophisticated mitigation methods – Elastic Refresh Scheduling 2 Laboratory for Computer Architecture 12/7/2010

Background Basic DRAM/Refresh Info § Each bit stored on a capacitor § Single read

Background Basic DRAM/Refresh Info § Each bit stored on a capacitor § Single read transistor to hold charge § Leakage, looses charge over time § Refresh: Rewrite cell on periodic basis § DDR 3 – Temperature dependence on refresh requirement, 64 ms@85 o. C, 32 ms@95 o. C – DRAM device contains internal address counter – JEDEC simply specifies the time interval (t. REFI, time REFresh Interval) t. REFI = 64 ms/8096 = 7. 8 us (3. 9 us for 95 o. C) 3 Laboratory for Computer Architecture 12/7/2010

Background Transition to denser devices § 7. 8 us based on 8 k Rows

Background Transition to denser devices § 7. 8 us based on 8 k Rows per bank § DRAM device density doubles ~2 year § With one refresh per row, t. REFI would half each generation 95 nm 512 MBit § Instead, multiple rows are refreshed with each command § Current delivery constraints forces increase in t. RFC with denser devices 42 nm 2 GBit 4 Laboratory for Computer Architecture 12/7/2010

Background “Stacked” Refresh Operations in a Single Command Example Source: TN-47 -16 Designing for

Background “Stacked” Refresh Operations in a Single Command Example Source: TN-47 -16 Designing for High-Density DDR 2 Memory Introduction by MICRON 5 Laboratory for Computer Architecture 12/7/2010

Background t. RFC Growth with DRAM Density § In the most basic terms, t.

Background t. RFC Growth with DRAM Density § In the most basic terms, t. RFC should scale linearly with density DRAM type Refresh Completion Time 512 Mbit 90 ns 1 Gbit 110 ns 2 Gbit 160 ns 4 Gbit 300 ns 8 Gbit 350 ns – Based strictly on current to charge capacitance § ~Fixed charge per bit § This has been reflected in the DDR 3 spec, with the exception of 8 GBit § Net, even if DRAM vendors can slow the growth, the delay is large today 6 Laboratory for Computer Architecture 12/7/2010

Motivation Slowdown Effects Observed in Simulation § Simics/Gems § 4 cores, 2 1333 MHz

Motivation Slowdown Effects Observed in Simulation § Simics/Gems § 4 cores, 2 1333 MHz channels, 2 DDR 3 Ranks/channel 30% IPC Degradation over No-Refresh 2 Gbit 4 Gbit 8 Gbit 25% 20% 15% 10% 5% Laboratory for Computer Architecture Floating Point 12/7/2010 n ea x 3 G . M rf w hi sp o m lb nt to ix TD D ul lc s. F em vr ay ca II ex po pl G Integer 7 so al d m de 3 d na le sl ie M s AD ac ca ct us p m om gr s ilc us ze m es es m ga av bw ea n k G . M r ta p bm nc la as xa om ne tp re f m h 2 64 g tu en an lib qu sj m er k bm hm c cf go m 2 gc ip bz pe rlb en ch 0%

Motivation Why it is so bad 8 Laboratory for Computer Architecture t. RFC bandwidth

Motivation Why it is so bad 8 Laboratory for Computer Architecture t. RFC bandwidth overhead (95 o. C per Rank) latency overhead (95 o. C) 512 Mb 90 ns 2. 7% 1. 4 ns 1 Gb 110 ns 3. 3% 2. 1 ns 2 Gb 160 ns 5. 0% 4. 9 ns 4 Gb 300 ns 7. 7% 11. 5 ns 8 Gb 350 ns 9. 0% 15. 7 ns DRAM capacity 12/7/2010

Motivation Postponing Refresh Operations § Each cell needs to be refreshed every 64 ms,

Motivation Postponing Refresh Operations § Each cell needs to be refreshed every 64 ms, § Refresh command spacing is based around an average rate. § As such, cell failure will not occur if no refresh is sent as t. REFI expires. § Current DDR 3 spec allows the controller to fall eight t. REFI intervals behind (backlog count) – Cell refresh rate is elongated by 0. 1% (8 in 8 k) 9 Laboratory for Computer Architecture 12/7/2010

Motivation Current Approaches § Demand Refresh (DR) – Most basic policy, sends refresh operations

Motivation Current Approaches § Demand Refresh (DR) – Most basic policy, sends refresh operations as high priority operations every t. REFI period § Delay Until Empty (DUE) – Policy utilizes DRAM ability to postpone refreshes. – Refresh operations are postponed until no reads are queued, or the max backlog count has been reached § Why These policies are ineffective – DR: Does nothing to hide refreshes – DUE: Too aggressive in sending refresh operations. Does not take advantage of the backlog in many cases. 10 Laboratory for Computer Architecture 12/7/2010

Elastic Refresh § Exploit – Non-uniform request distribution – Refresh overhead just has to

Elastic Refresh § Exploit – Non-uniform request distribution – Refresh overhead just has to fit in free cycles § Initially not aggressive, converges with DUE as refresh backlog grows § Latency sensitive workloads are often lower bandwidth § Decrease the probability of reads conflicting with refreshes 11 Laboratory for Computer Architecture 12/7/2010

Elastic Refresh Idle Delay Function § Introduce refresh backlog dependent idle threshold § With

Elastic Refresh Idle Delay Function § Introduce refresh backlog dependent idle threshold § With a log backlog, there is no reason to send refresh command § With a bursty request stream, the probability of a future request decreases with time § As backlog grows, decrease this delay threshold Idle Delay Threshold 12 Laboratory for Computer Architecture 12/7/2010

Elastic Refresh Tuning the Idle Delay Function § The optimal shape of the IDF

Elastic Refresh Tuning the Idle Delay Function § The optimal shape of the IDF is workload dependent § IDF can be controlled with the listed parameters § Our system contains hardware to determine “good” parameters – Max Delay and Proportional Slope Parameter Units Description Max Delay Memory Clocks Sets the delay in the constant region Proportional Slope Memory Clocks per Postponed Step Sets slope of the proportional region Postponed Step Point where the idle delay goes to zero High Priority Pivot 13 Laboratory for Computer Architecture 12/7/2010

Elastic Refresh Max Delay Circuit § Circuit used to collect average Rank idle period

Elastic Refresh Max Delay Circuit § Circuit used to collect average Rank idle period § Conceptually, given a exponential type distribution, the average can be used to find the tail § Calculated average is used as Max Delay § Circuit function, – Accumulate idle delay over 1024 events – Average calculated with concatenation of accumulator 14 Laboratory for Computer Architecture 12/7/2010

Elastic Refresh Proportional Slope Circuit § Conceptually, proportional region acts to gracefully transition to

Elastic Refresh Proportional Slope Circuit § Conceptually, proportional region acts to gracefully transition to high priority, while utilizing full postponed range § Circuit works to balance the utilization across the postponed range (High/Low counts) § PI type controller adjusts slot to balance High/Low counts 15 Laboratory for Computer Architecture 12/7/2010

Elastic Refresh Hardware Cost § Trivial integration into DUE based policies – Structure replaces

Elastic Refresh Hardware Cost § Trivial integration into DUE based policies – Structure replaces “empty” indication of DUE § Logic size – ~100 latch bits for static policy – ~80 additional latch bits for dynamic policy § Logic cycle time – Low frequency compared to ALU functions in processor core. – Infrequent updates could enable pipelined control. 16 Laboratory for Computer Architecture 12/7/2010

Simulation Methodology § Simics extended with GEMS model – 1, 4 & 8 cores

Simulation Methodology § Simics extended with GEMS model – 1, 4 & 8 cores CMP – First-Ready, First-Come-First-Served memory controller policy – DDR 3 1333 MHz 8 -8 -8 memory, 2 MC, 2 Ranks/MC – t. RFC= 550 ns, t. REFI = 3. 9μs @95 o. C (estimation of 16 GBit) – Refresh policies: • Demand Refresh (DR) • Defer Until Empty (DUE) • Elastic Refresh policies § SPEC cpu 2006 workloads 17 Laboratory for Computer Architecture 12/7/2010

Results Integer 8 Cores 18 Laboratory for Computer Architecture 12/7/2010

Results Integer 8 Cores 18 Laboratory for Computer Architecture 12/7/2010

Related Work § B. Bhat and F. Mueller, “Making DRAM refresh predictable, ” Real-Time

Related Work § B. Bhat and F. Mueller, “Making DRAM refresh predictable, ” Real-Time Systems, Euromicro Conference 2010 § M. Ghosh and H. S. Lee, “Smart Refresh: An enhanced memory controller design for reducing energy in conventional and 3 D die-stacked DRAMs, ” in MICRO 40 § K. Toshiaki, P. Paul, H. David, K. Hoki, J. Golz, F. Gregory, R. Raj, G. John, R. Norman, C. Alberto, W. Matt, and I. Subramanian, “An 800 MHz embedded DRAM with a concurrent refresh mode, ” in IEEE ISSCC Digest of Technical Papers, Feb. 2004 19 Laboratory for Computer Architecture 12/7/2010

Conclusions § The significant degradation of refresh can be mitigated with low overhead mechanisms

Conclusions § The significant degradation of refresh can be mitigated with low overhead mechanisms § Commodity DRAM is cost driven – Elastic refresh requires no DRAM changes § Future work: – Coordinate refresh with other structures on the CMP – Investigate refresh for future DRAM devices (DDR 4) • Example, dynamically select how many rows to refreshed 20 Laboratory for Computer Architecture 12/7/2010

Thank You, Questions? Laboratory for Computer Architecture University of Texas Austin IBM T. J.

Thank You, Questions? Laboratory for Computer Architecture University of Texas Austin IBM T. J. Watson Lab 21 Laboratory for Computer Architecture 12/7/2010