Data Partitioning Techniques for Partially Protected Caches to

  • Slides: 35
Download presentation
Data Partitioning Techniques for Partially Protected Caches to Reduce Soft Error Induced Failures Kyoungwoo

Data Partitioning Techniques for Partially Protected Caches to Reduce Soft Error Induced Failures Kyoungwoo Lee 1, Aviral Shrivastava 2, Nikil Dutt 1, and Nalini Venkatasubramanian 1 1 Department of Computer Science 2 Department University of California at Irvine Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces of Computer Science and Engineering Arizona State University

Outline q Motivation and Problem Statement q Our Solution q Experiments q Conclusion Copyright

Outline q Motivation and Problem Statement q Our Solution q Experiments q Conclusion Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #2

Motivation q Soft errors threaten the reliability of the system Ø Soft errors are

Motivation q Soft errors threaten the reliability of the system Ø Soft errors are expected to increase by several orders of magnitude beyond sub-micron technology q. Exponential increase of soft error rate as technology scales [Hazucha, 00] Ø Redundancy techniques incur high overheads of power and performance q. TMR (Triple Modular Redundancy) exceeds 200% overheads without optimization [Nieuwland, 06] q. ECC (Error Correction Codes) incurs overheads of performance by 95% [Li, 05] and power by 22% in caches [ARM, 03] q PPC (Partially Protected Caches) [Lee, 06] is promising for multimedia applications Ø No obvious solutions to partition data into a PPC for general applications Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #3

Soft Errors on an Increase q SER increases exponentially as technology scales q Integration,

Soft Errors on an Increase q SER increases exponentially as technology scales q Integration, voltage scaling, altitude, latitude [Baumann, 05] Transistor 5 hours MTTF 1 0 1 month MTTF Bit Flip • MTTF: Mean time To Failure Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #4

Most Vulnerable Caches q Caches are most hit due to: Ø Larger portion in

Most Vulnerable Caches q Caches are most hit due to: Ø Larger portion in processors (more than 50%) Ø No masking effect (e. g. , no logical masking) Intel Itanium II Processor Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #5

Unequal Data Protection q All pages are not equally failure critical Ø (e. g.

Unequal Data Protection q All pages are not equally failure critical Ø (e. g. ) Multimedia data is failure non-critical Ø (e. g. ) Program variables are failure critical Ø Failures: system crash, infinite loop, segmentation faults, etc Only 9 pages out of 83 are failure critical Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #6

PPC – Partially Protected Caches PPC architectures provide an unequal protection for mobile multimedia

PPC – Partially Protected Caches PPC architectures provide an unequal protection for mobile multimedia systems [Lee, 06] Unprotected cache and Protected cache at the same level of memory hierarchy Protected cache is typically smaller to keep power and delay the same as or less than those of Unprotected cache Very efficient in terms of power and performance Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces Processor Pipeline PPC Unprotected Cache Protected Cache Memory DIPES 08 #7

Data Partitioning in a PPC Unprotected Cache q Multimedia Applications PPC Protected Cache Memory

Data Partitioning in a PPC Unprotected Cache q Multimedia Applications PPC Protected Cache Memory Ø Multimedia data is failure non-critical Map multimedia data into the unprotected cache in a PPC Ø All other data is failure critical Map all other data into the protected cache in a PPC q General Applications Ø No obvious partitioning exists Ø This limits the applicability of the PPC q Problem Statement Ø Find data partitions for a PPC to minimize the overheads of power and performance with maximal reliability Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #8

Outline q Motivation and Problem Statement q Our Solution Ø Exploitation of Vulnerability to

Outline q Motivation and Problem Statement q Our Solution Ø Exploitation of Vulnerability to Partition Data Ø Data Partitioning Heuristics q Experiments q Conclusion Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #9

Our Solution q Data Partitioning Techniques – DPExplore Ø Design space exploration using Vulnerability

Our Solution q Data Partitioning Techniques – DPExplore Ø Design space exploration using Vulnerability metric rather than failure rates q. Just one evaluation (vulnerability) vs. hundreds simulations (failure rate) q. Efficient explorations compared to Exhaustive Search or Genetic Algorithm Ø Data partitioning for general applications q. Now PPC is effective not only for multimedia applications but also for general applications Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #10

Vulnerable Time Write Eviction Invulnerable Read Incoming data Vulnerable time Ø It is vulnerable

Vulnerable Time Write Eviction Invulnerable Read Incoming data Vulnerable time Ø It is vulnerable for the time when eventually data is t 0 t 1 t 2 t 3 read by CPU or written Vulnerable back to Memory o Soft errors between t 0 and t 1 Ø Vulnerability of a Page q. Sum of vulnerable times of data in a page q. Page is of 1 KB data in our study Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces (t 2 and t 3) can cause failures of applications – data is vulnerable between t 0 and t 1 (t 2 and t 3) o Soft errors between t 1 and t 2 do not cause failures of applications since data will be updated by CPU – data is invulnerable between t 1 and t 2 DIPES 08 #11

Vulnerability and Failure Rate q Vulnerable time closely estimates failure rate Copyright © 2008

Vulnerability and Failure Rate q Vulnerable time closely estimates failure rate Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #12

Data Partitions using Vulnerability q. Pages causing high vulnerable time are failure critical (FC)

Data Partitions using Vulnerability q. Pages causing high vulnerable time are failure critical (FC) Ø They are mapped into the Protected Cache in a PPC Ø Others are failure noncritical (FNC) mapped into the Unprotected Cache Processor Pipeline PPC Unprotected Cache FNC Pages Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces Memory Protected Cache FC FC Pages DIPES 08 #13

Goal of Data Partitioning q Must be careful when partitioning pages Ø Too many

Goal of Data Partitioning q Must be careful when partitioning pages Ø Too many pages onto the (smaller) protected cache incurs many misses causing high overheads q Goal of data partitions Ø discovers interesting pages to be mapped into a PPC Ø finds the best partitions in terms of vulnerability under the performance constraint Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces Processor Pipeline PPC Unprotected Cache Protected Cache Memory FNC Pages FC Pages DIPES 08 #14

DPExplore – Data Partitioning Heuristics q DPExplore 1. Estimate page vulnerability 2. Add a

DPExplore – Data Partitioning Heuristics q DPExplore 1. Estimate page vulnerability 2. Add a page from the pool into the protected cache 3. Evaluate current page partitions 4. Find a page mapping with minimal vulnerability under runtime constraint 5. Repeat 2 to 4 until no more partitions can be found PV – Page Vulnerability n V – Vulnerability of unprotected cache for page partitions R – Runtime Constraint th Rn – Runtime when n page is mapped into the protected cache Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces PPC Protected Cache Unprotected Cache Memory P 1 PV 1=9 P 2 PV 2=6 P 3 PV 3=2 P 4 PV 4=1 R 1 > R R 2 < R V 2 < RV 3< R V 3 >V 2 R 4 > R DIPES 08 #15

Outline q Motivation and Problem Statement q Our Solution q Experiments q Conclusion Copyright

Outline q Motivation and Problem Statement q Our Solution q Experiments q Conclusion Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #16

Experimental Setup Runtime Energy Vulnerability Application Compiler Executable Platform Page Vulnerability Estimator Page Mapping

Experimental Setup Runtime Energy Vulnerability Application Compiler Executable Platform Page Vulnerability Estimator Page Mapping Page Vulnerabilities DPExplore Data Partitioning Framework Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #17

Evaluation q Data Caches Ø PPC data caches – 2 KB Unprotected Cache and

Evaluation q Data Caches Ø PPC data caches – 2 KB Unprotected Cache and 256 Byte Protected Cache Ø Conventional data cache – 2 KB Unprotected Unified Cache q Simulator Ø Simple. Scalar sim-outorder simulator [Burger, 97] q Benchmarks Ø Several benchmarks from Mi. Bench [Guthaus, 01] q Evaluation Ø Runtime for performance Ø Energy consumption of memory subsystem for power Ø Vulnerability for reliability Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #18

Experimental Results q Effectiveness of DPExplore Ø Find data partitions with minimal vulnerability under

Experimental Results q Effectiveness of DPExplore Ø Find data partitions with minimal vulnerability under 5% runtime penalty q Comparison of DPExplore to Monte Carlo Exploration and Genetic Algorithm Exploration Ø Number of simulations to find interesting data partitions Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #19

Significant Reduction of Vulnerability On average, DPExplore finds page partitions to reduce the vulnerability

Significant Reduction of Vulnerability On average, DPExplore finds page partitions to reduce the vulnerability by 66% compared to the unprotected cache Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #20

Min Overheads of Energy and Runtime Under 5% runtime penalty, DPExplore causes than •

Min Overheads of Energy and Runtime Under 5% runtime penalty, DPExplore causes than • PSNR: less Peak Signal to Noise Ratio 1% runtime and 15% energy consumption overheads Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #21

Experimental Results q Effectiveness of DPExplore Ø Find data partitions with minimal vulnerability under

Experimental Results q Effectiveness of DPExplore Ø Find data partitions with minimal vulnerability under 5% runtime penalty q Comparison of DPExplre to Monte Carlo Exploration and Genetic Algorithm Exploration Ø Number of simulations to find interesting data partitions Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #22

DPExplore vs. MC and GA MC – Monte Carlo Simulation GA – Genetic Algorithm

DPExplore vs. MC and GA MC – Monte Carlo Simulation GA – Genetic Algorithm Exploration DPExplore is aware of runtime and vulnerability Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #23

DPExplore vs. MC and GA MC – Monte Carlo Simulation GA – Genetic Algorithm

DPExplore vs. MC and GA MC – Monte Carlo Simulation GA – Genetic Algorithm Exploration explore interesting data DPExplore is more effective to partitions than MC and GA Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #24

Outline q Motivation and Problem Statement q Our Solution q Experiments q Conclusion Copyright

Outline q Motivation and Problem Statement q Our Solution q Experiments q Conclusion Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #25

Conclusion q PPC (Partially Protected Caches) is promising to achieve low-cost reliability using unequal

Conclusion q PPC (Partially Protected Caches) is promising to achieve low-cost reliability using unequal data protection q Propose data partitioning heuristics (DPExplore) Ø Vulnerability metric closely estimates the failure rate for reliability of caches Ø DPExplore explores data partitions with minimal vulnerability under runtime constraint Ø DPExplore is more effective than random explorations q Future Work Ø Partitioning techniques for instruction caches Ø Intelligent schemes to improve costs and vulnerability Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #26

Thanks! Any Questions? kyoungwl@ics. uci. edu Copyright © 2008 UCI ACES Laboratory http: //www.

Thanks! Any Questions? kyoungwl@ics. uci. edu Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces

Backup Slides Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces

Backup Slides Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces

Soft Errors on Increase exponentially due to technology scaling 0. 18 µm 0. 13

Soft Errors on Increase exponentially due to technology scaling 0. 18 µm 0. 13 µm 1, 000 FIT per Mbit of SRAM 10, 000 to 100, 000 FIT per Mbit of SRAM Voltage Scaling Voltage scaling increases SER significantly SER Nflux x CS x exp {- Qcritical } Qs where Qcritical = C x V Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #29

Related Work in Combating Soft Errors q Process Technology Solutions Ø Hardening: [Baze et

Related Work in Combating Soft Errors q Process Technology Solutions Ø Hardening: [Baze et al. , IEEE Trans. On Nuclear Science ’ 00] Ø SOI: [O. Musseau, IEEE Trans. On Nuclear Science ‘ 96] Ø Process complexity, yield loss, and substrate cost q Microarchitectural Solutions for Caches Ø Ø Ø Cache Scrubbing: [Mukherjee et al. , PRDC ’ 04] Low Power Cache: [Li et al. , ISLPED ’ 04] Area Efficient Protection: [Kim et al. , DATE ’ 06] Multiple Bit Correction: [Neuberger et al. , TODAES ’ 03] Cache Size Selection: [Cai et al. , ASP-DAC ’ 06] High overheads in terms of power, performance, and area q PPC Ø Compiler-based Microarchitectural Technique Ø Provide protection from soft errors while minimizing the power, performance, and area overheads Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #30

ECC Protection q ECC (Error Correcting Codes) is popular technique to protect memory from

ECC Protection q ECC (Error Correcting Codes) is popular technique to protect memory from soft errors Protected Cache Ø e. g. , SEC-DED - Hamming Code (32, 6) q. Performance by up to 95 % Ø [Li et al. , MTDT ’ 05] q. Energy by up to 22 % Ø [Phelan, ARM ’ 03] q. Area by more than 18 % Ø [Phelan, ARM ’ 03] Data ECC q But has high overheads in terms of Area, Performance and Power Coding Unprotected Cache Decoding ECC protection for caches is expensive! Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #31

Experimental Setup for Page Failures Copyright © 2008 UCI ACES Laboratory http: //www. cecs.

Experimental Setup for Page Failures Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #32

Impact of Page Partitions to a PPC Failure rate reduction by moving pages from

Impact of Page Partitions to a PPC Failure rate reduction by moving pages from the unprotected cache to the protected cache in a PPC Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #33

Vulnerability under No Runtime Penalty Copyright © 2008 UCI ACES Laboratory http: //www. cecs.

Vulnerability under No Runtime Penalty Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #34

Energy and Runtime under No Penalty Copyright © 2008 UCI ACES Laboratory http: //www.

Energy and Runtime under No Penalty Copyright © 2008 UCI ACES Laboratory http: //www. cecs. uci. edu/~aces DIPES 08 #35