Case Study SRAM Caches By Nikhil Suryanarayanan Outline

Case Study - SRAM & Caches By Nikhil Suryanarayanan

Outline • • • Introduction SRAM Cell Peripheral Circuitry Practical Cache Designs Technology scaling & Power Reduction Summary

Cache Sizes in recent Processors Processor Cache Size Core 2 Duo 6 M Atom 512 K Core i 7 Extreme 8 M Core i 7 Extreme Mobile 8 M Core 2 Quad 12 M Itanium 2 24 M Xeon 16 M

SRAM Cell • Basic building block of on-chip cache • Various design 12 T, 10 T, 8 T, 6 T, 4 T • 6 T is the most widely used by industry giants in contemporary microprocessors • Data bit and its complement and stored in a cross-coupled inverter • Simple in design

A 6 T SRAM Cell • Cross-coupled inverter • Access Transistors

Column Circuitry • Bitline Conditioning • Sense Amplifiers • Multiplexer (column decoder)

Bitline Conditioning • Precharge high before reads • Equalize bitline delays to minimize voltage difference…. Why?

Sense Amplifier • Bitlines have many cells attached; enormous capacitive loads • Bitlines swing slowly • Voltages equalized during precharge • SA detects small swing as bitlines are driven apart and bring it up to normal logic • Reduced delay by not waiting for full-swing

Types of SA • Differential Pair requires no clock but always dissipates static power • Clocked sense amp saves power and also isolates the large bitline capacitances • Improved gain offered by Cross Coupled Amplifier

Read Circuit

Write Circuit

Drive strengths • • • Read Stability Writeability NMOS pull down – Strongest Access NMOS – intermediate PMOS pull up - weakest

Industry Designs • Novel Design Techniques • Embedded Dual 512 KB L 2$ in Ultrasparc (2004) • L 1$, L 2$ & L 3$ - Designs in two Itanium Generations (2002 -2003) • Effects of Technology scaling • Overview of current Design in 45 nm

Design Considerations • Larger on-chip cache running at processor frequency improves performance • Latency also increases as size grows • Memory cell performance does not improve rapidly with new generation • Change of design focus with the cache level

Itanium 2 L 1 cache 16 KB 4 -ways I & D caches Feature – Eliminates read stall during cache updates • Focus – Circuit Technique • •

L 1$ Circuit technique

Ultra. Sparc(2004) Dual 512 KB L 2$ • 4 -way setassociative design • 64 byte line size, 128 bit data bus, 4 cycle to fill a line • 8 bit ECC for a 64 bit datum

Utra. Sparc L 2$ • Since data occupies most of silicon area, is goal was process tolerance & area efficiency • Tag array was designed for tight timing considerations • Conversion from off-chip direct-mapped to on-chip cache • Feature: Read tags for next process during current process • Focus : Use of differential access times to tag and data array & Layout

Read Access Pipeline

Tag array I/O circuit diagram

Data organization

Itanium 3 MB L 3$ • • • 180 nm process 12 way set-associative Fits in an irregular structure Feature : Size of cache Focus : Regular & efficient partition in an irregular space

Itanium 3 MB L 3$

Sensing Unit

Itanium 2 6 MB L 3$ 24 -way set-associative 130 nm process 64 bit EPIC architecture processor Double the number of transistors compared to the previous generation • Same power dissipation as previous generation. . . hmmm • 1. 5 x freq, 2 x L 3 cache, 3. 5 x leakage • •

Technology Scaling Previous Model Current Model

L 3 Power Reduction Scheme

Itanium

Core 2 Duo

Power Reduction techniques used in Silverthorne • Control registers disable a pair of ways(out of 8) during low performance operation • A way is entirely flushed before disablement • During Deep Power Down, the entire voltage plane to L 2 is cut off • General Cache Architecture has remained the same

Questions?

References • • CMOS VLSI Design, Weste & Harris 3 e On-Chip 3 MB Subarray based 3 rd level Cache on Itanium Microprocessor - Don Weiss, John J Wuu, Victor Chin Design and Implementation of an Embedded 512 KB Level-2 Cache Subsystem, Shin, Petrick, Singh, Leon A 1. 5 GHz 130 nm Itanium 2 Processor with 6 -MB On-die L 3 Cache

Need for Caches Large Access Delays of off-chip memory Fast on-chip Memory is expensive Hierarchical Memory System is a solution Bring limited amount of data on-chip and reduce latencies • Increase processor performance • •

Data Array read & write critical signals

Write Circuit