Enabling Ultra Low Voltage System Operation by Tolerating

Enabling Ultra Low Voltage System Operation by Tolerating On-Chip Cache Failures Amin Ansari, Shuguang Feng, Shantanu Gupta, and Scott Mahlke Advanced Computer Architecture Lab. University of Michigan, Ann Arbor August 20, 2009 University of Michigan Electrical Engineering and Computer Science

Motivation § Extreme technology integration in sub-micron regime o Heat dissipation ↑ and power density ↑ Ø Ø § If high performance is not needed DVS o § Cost of thermal packaging, cooling, and electricity ↑ Device lifetime ↓ Improvement in battery life of medical devices, laptops, and etc Large SRAM structures limit the min achievable Vdd o because SRAM delay increases at a higher rate than CMOS logic delay as Vdd is decreased 2 University of Michigan Electrical Engineering and Computer Science

Bit-Error-Rate for an SRAM Cell § Extremely fast growth in failure rate with decreasing Vdd § Due to systematic and random process variation o § Min achievable Vdd for 64 KB and 2 MB caches o § Min sustainable Vdd of entire cache is determined by the one SRAM bit-cell with the highest required operational voltage In 90 nm while targeting 99% yield Write-margin of L 2 cache determines the min Vdd 3 University of Michigan Electrical Engineering and Computer Science

Our Goal § Enabling DVS to push core’s Vdd down to o o Ultra low voltage region ( < 600 m. V ) While preserving correct functionality of on-chip caches § Proposing a highly flexible and FT cache architecture that can efficiently tolerate these SRAM failures § No gain in high power mode o o Minimizing our overheads in this mode Single power supply, because dual Vdd have Ø Ø Ø Area and design complexity ↑ Necessity of voltage converters Large noise from the high voltage island 4 University of Michigan Electrical Engineering and Computer Science

Our Fault-Tolerant Cache § Interweaving a set of n+1 partially functional cache wordlines to give the appearance of n functional lines § Partitioning the set of all lines into large groups o o o § One line per group serves as redundancy for other lines Each line is divided to multiple chunks (smaller redundancy units) Two lines have collision, if they have at least one faulty chunk in the same position (10 and 15 are collision free) We form groups such that there are no collision between any two lines within a group o Group 3 (G 3) contains lines 4, 10, and 15 5 University of Michigan Electrical Engineering and Computer Science

Architecture Group address of data line Fault map address Sacrificial line Data line Added modules: + Memory map + Fault map + MUXing layer Memory Map Input Address 15 4 First Bank G 3 2 Second Bank 1 2 3 4 5 6 7 8 G 3(S) 9 10 11 12 13 14 15 16 G 3(1) G 3(2) Fault Map MUXing layer G 3 1 - - 2 Functional Block 6 Two type of lines: + data line + sacrificial line University of Michigan Electrical Engineering and Computer Science

G 1(1) G 2(S) G 1(2) G 4(S) G 3(1) G 3(2) D G 5(1) G 1(S) 9 G 1(1) 11 G 1(2) 10 G 2(S) 13 14 G 3(1) G 3(2) 12 G 4(S) 16 G 5(1) 1 2 3 G 2(1) G 2(2) 4 G 3(S) 5 6 7 G 4(1) G 4(2) G 4(3) 8 G 5(S) Group 5 Group 1 9 10 11 12 13 14 15 16 Group 2 G 1(S) G 2(1) G 2(2) G 3(S) G 4(1) G 4(2) G 4(3) G 5(S) Group 3 1 2 3 4 5 6 7 8 Group 4 cache fault pattern Group Formation 7 University of Michigan Electrical Engineering and Computer Science

Operation Modes § Low power mode (Vdd < 651 m. V) o First time processor switches to this mode Ø Ø Ø § BIST scans cache for potential faulty cells Processor switches back to high power mode Forms groups and fills the memory and fault maps High power mode (Vdd ≥ 651 m. V) o Our scheme is turned off to minimize overheads Ø Ø There is no sacrificial lines in this case Clock gating to reduce dynamic power of SRAM structures Bypass MUXes still burn dynamic power No power gating is used for leakage mitigation 8 University of Michigan Electrical Engineering and Computer Science

Evaluation Methodology § Performance o o § Sim. Alpha that is based on Simple. Scalar Oo. O Processor is modeled after DEC EV-7 Delay, power and area o o CACTI for caches and other SRAM structures Synopsys standard tool-chain for Ø § Miscellaneous logic (e. g. bypass MUXes and comparators) Given set of cache parameters (e. g. Vdd) o o Monte Carlo (with 1000 iterations) using described algorithm Determining disabled portion of caches (for 99% yield) 9 University of Michigan Electrical Engineering and Computer Science

Minimum Achievable Vdd § Protecting L 2 is harder than L 1 o o o Due to longer lines and larger size Chunk size = 8 b for L 2 and 4 b for L 1 Achieving 420 m. V by enforcing the following 10% limits 10 University of Michigan Electrical Engineering and Computer Science

Overheads § Overheads for L 1 and L 2 caches o § 10 T used to protect fault map, tag array, and memory map Using SPEC 2 K benchmark suite o o o INT: (gzip, vpr, gcc, mcf, crafty, parser, vortex, bzip 2, twolf) FP: (swim, mgrid, applu, art, equake, ammp, sixtrack) 4. 7% performance penalty for EV-7 (sim. Alpha) 11 University of Michigan Electrical Engineering and Computer Science

Conclusion § DVS is widely used to deal with high power dissipation o § We proposed a flexible FT cache architecture o § Minimum achievable voltage is bounded by SRAM structures To tolerate these SRAM failures efficiently when operating in low power mode Using our approach o o Operational voltage of processor can be reduced to 420 m. V 80% dynamic power saving and 73% leakage power saving 4. 7% performance overhead for microprocessor < 15% overhead for on-chip caches 12 University of Michigan Electrical Engineering and Computer Science