Silent Stores for Free or Silent Stores Darn

  • Slides: 36
Download presentation
Silent Stores for Free (or, Silent Stores Darn Cheap) Kevin M. Lepak Mikko H.

Silent Stores for Free (or, Silent Stores Darn Cheap) Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison http: //www. ece. wisc. edu/~pharm

Introduction n Recent work shows that many memory writes do not update the system

Introduction n Recent work shows that many memory writes do not update the system state n “Silent Stores” are memory writes which are writing the same value that already exists at that memory location n Intuitively, we might be able to exploit this observation for performance benefit December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Silent Stores—Is this for Real? Percentage of silent stores is non-trivial in all cases,

Silent Stores—Is this for Real? Percentage of silent stores is non-trivial in all cases, 20%-68% December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Motivation n Silent Stores are real and non-trivial n n Multiprocessor benefits: n n

Motivation n Silent Stores are real and non-trivial n n Multiprocessor benefits: n n 20 -60% of dynamic stores are silent Reduced address and data bus traffic Uniprocessor benefits: n n Reduced writebacks, pressure on write buffers Write port utilization, etc. December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Standard Store Verifies n Issue a store verify (SV) for every store D-Cache Store

Standard Store Verifies n Issue a store verify (SV) for every store D-Cache Store Verify (Load) Decode/ Dispatch Rename Fetch n EX/ Agen = Hit Store Silent? WB Commit “Standard” Store Verifies are expensive n Load, compare, (store) overhead for every store n Increase cache port utilization n Can block loads that may be on critical path December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Is There a Better Way? n Predict which stores are likely to be silent

Is There a Better Way? n Predict which stores are likely to be silent and only store verify those n n Subject of ongoing research Find lower cost mechanisms for verifying stores n n Exploit Oo. O core -arch features Exploit core reliability features for deepsubmicron technology trends Silent Stores for Free: Reducing the cost of store verification December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Outline n Oo. O core enabled Free Silent Store Squashing (FSSS) Mechanisms n n

Outline n Oo. O core enabled Free Silent Store Squashing (FSSS) Mechanisms n n n FSSS in ECC cache architectures n n Read port stealing Temporal & spatial locality in the load/store queue (LSQ) Data cache protection methods ECC-L 1 -D$ FSSS Trading FSSS for physical bandwidth Conclusions December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Performance--Machine Model n n n n Simple. Scalar PISA w/realistic memory system 8 issue;

Performance--Machine Model n n n n Simple. Scalar PISA w/realistic memory system 8 issue; 64 entry RUU; 64 K entry Gshare 64 KB each L 1 I/D cache; 512 KB unified L 2 32 entry load/store queue Two fully-pipelined memory access ports 32 B L 1 -L 2 interface, single cycle occupancy Write-through-allocate L 1, write-back L 2 n 2 Write buffers, 32 B write-combining December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Read Port Stealing n n Only issue a store verify if a cache port

Read Port Stealing n n Only issue a store verify if a cache port is available (schedule ready loads/stores first) If a store reaches the head of the ROB before it can be verified, assume it is non-silent D-Cache Read Port Available? Fetch n Decode/ Dispatch Rename EX/ Agen = Hit Store Silent? WB Commit Similar to standard SV, but does not delay ready loads/stores and does not capture all silent stores December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Read Port Stealing--Opportunities Captures minimum of 84% of store verify opportunities December 11, 2000

Read Port Stealing--Opportunities Captures minimum of 84% of store verify opportunities December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

LSQs n Oo. O cores implement LSQs to track inmemory dependences for improved performance

LSQs n Oo. O cores implement LSQs to track inmemory dependences for improved performance n n n store forwarding consistency model violations LSQs provide temporal and spatial context for a memory operation n Surrounds an operation with other references local to it in dynamic program order December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

LSQ Temporal Locality n We can exploit temporal locality (same address aliases) in the

LSQ Temporal Locality n We can exploit temporal locality (same address aliases) in the LSQ to verify stores n n n WAW: Can forward store to load, why not store to store? WAR: Load allocates data from the cache, use it to squash a subsequent store RAW: In many -arch, cache port scheduled before aliasing to an entry in the LSQ is known, use the port to store verify December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

LSQ Spatial Locality n n Obtaining a wide datapath to L 1 -D$ is

LSQ Spatial Locality n n Obtaining a wide datapath to L 1 -D$ is possible due to on-chip caches Assume a memory reference can provide an entire cacheline of data n Exploit spatial locality to issued memory references n WAR: Load allocates an entire line n WAW: Use read port stealing to allocate n RAW: Load allocates an entire line December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

LSQ Squashing--Silent Stores Captured Over 90% of silent stores captured; Greater than 40% in

LSQ Squashing--Silent Stores Captured Over 90% of silent stores captured; Greater than 40% in most cases using locality in LSQ December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Memory Storage Soft Errors n Detecting and correcting soft errors is becoming more important

Memory Storage Soft Errors n Detecting and correcting soft errors is becoming more important n n n Deep-submicron manufacturing Uptime & system reliability concerns Many methods exist for ECC: n Rely on redundancy for detection/correction n n Coding: Keep extra bits that allow both detection & correction Explicit copies: Keep multiple copies with extra bits for detection, correct by loading the copy December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Redundant Data ECC for L 1 -D$ Duplication of parity protected L 1 -D$

Redundant Data ECC for L 1 -D$ Duplication of parity protected L 1 -D$ Load Datapath Ok Parity Check L 1 -D$ w/Parity Address ! Ok Ok Parity Check Address n n n L 1 -D$ w/Parity Store Datapath n L 1 -D$ Data w/Parity Address L 1 -D$ w/Parity High overhead--100% over L 1 -D$ with parity 2 x read bandwidth vs. write bandwidth Leads to configurations with higher load throughput December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Redundant Data ECC for L 1 -D$ Write-through parity protected L 1 -D$ with

Redundant Data ECC for L 1 -D$ Write-through parity protected L 1 -D$ with inclusive (ECC code protected) L 2 Address Ok L 1 -D$ Parity Check w/Parity ! Ok n n L 2 w/ECC Store Datapath Load Datapath n Address Data L 1 -D$ w/Parity L 2 w/ECC Write-through creates high demand on the L 1 -L 2 interface Can use previous FSSS techniques to reduce stores (and hence write-throughs) December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Coding ECC for L 1 -D$ Address Data Reg. ECC Logic (Correction) D-Cache Check

Coding ECC for L 1 -D$ Address Data Reg. ECC Logic (Correction) D-Cache Check bits ECC Logic (Generate) n Protect the L 1 -D$ with ECC directly ECC-data words relatively large to reduce overhead (ex: 64 -bit in 21264, RS 64 -III) Data bits n Sub-ECC-word store datapath Sub-ECC-word stores consist of four operations: Read original ECC-word, Merge, ECC-gen, Write December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

ECC L 1 Free Silent Store Squash n If sub-ECC-word stores are read-modifywrites, why

ECC L 1 Free Silent Store Squash n If sub-ECC-word stores are read-modifywrites, why not squash? D-Cache ECC Logic (Generate) ECC Logic (Correction) Check bits Data bits Address Data Reg. Sub-ECC-word store datapath with ECC-FSSS = (!Silent || ECC Error) Store verify in parallel with correction & check bit generation gives ECC-FSSS December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Effectiveness of ECC-L 1 FSSS n Can detect 100% of sub-ECC-word L 1 -D$

Effectiveness of ECC-L 1 FSSS n Can detect 100% of sub-ECC-word L 1 -D$ hits n n store-byte (8 b), store-half (16 b), store-word (32 b) in 64 b-ECC-data-word -arches Can also capture many more which might not be so obvious n IBM RS 64 -III (Pulsar) has maximal 32 b integer stores in 32 b mode (common for user programs) n December 11, 2000 All of these can be captured with ECC-FSSS Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Increasing Write-Through Bandwidth via FSSS n n We expect squashing silent stores to reduce

Increasing Write-Through Bandwidth via FSSS n n We expect squashing silent stores to reduce pressure on the L 1 -L 2 interface Can we implement a narrower/slower L 1 -L 2 physical interface and exploit FSSS for greater effective interface bandwidth? n Potentially reduce power consumption n Ease circuit & physical design December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Increasing Write-Through BW-Write-Through Reduction 15% average write-through traffic reduction December 11, 2000 Kevin Lepak

Increasing Write-Through BW-Write-Through Reduction 15% average write-through traffic reduction December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Increasing Write-Through BW--IPC 75% lower physical BW+FSSS yields 9% IPC improvement over fast physical

Increasing Write-Through BW--IPC 75% lower physical BW+FSSS yields 9% IPC improvement over fast physical interface without FSSS December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Conclusions n n Standard store verifies are expensive Three methods of squashing silent stores

Conclusions n n Standard store verifies are expensive Three methods of squashing silent stores for reduced cost n n n Using read port stealing Exploiting temporal and spatial locality in the LSQ Using ECC logic in the L 1 data cache These methods verify a large fraction of silent stores for non-trivial speedups Trade implementation of silent store squashing for higher physical BW between L 1 & L 2 December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Current and Future Work n Silent stores in MPs, as well as program structure

Current and Future Work n Silent stores in MPs, as well as program structure and message passing store value locality [Lepak & Lipasti, ISCA-2 k] n n n Characterizing & Critical silent stores [Bell et. al PACT-2 k] Silence confidence mechanism(s) Exploiting predictable stores in MP systems Applying all types of store value locality in different system paradigms. . . December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Backup Slides December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin

Backup Slides December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Read Port Stealing--IPC HM improvement of 10%, 0 -56% range across benchmarks December 11,

Read Port Stealing--IPC HM improvement of 10%, 0 -56% range across benchmarks December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

LSQ Squashing--IPC HM improvement of 11%, 0 -56% range across benchmarks December 11, 2000

LSQ Squashing--IPC HM improvement of 11%, 0 -56% range across benchmarks December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

LSQ Temporal--Silent Stores Captured Captures an average of 30% of silent stores across benchmarks

LSQ Temporal--Silent Stores Captured Captures an average of 30% of silent stores across benchmarks December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

FSSS Method Comparison n Read Port Stealing and LSQ squashing provide similar performance results

FSSS Method Comparison n Read Port Stealing and LSQ squashing provide similar performance results n n Temporal LSQ squashing is not effective in isolation for this machine n n However, LSQ squashing reduces the percent of store verifies issued to the memory system by 50% May be useful to reduce sharing ECC squashing is truly free December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

LSQ Cache Design n Assume FIFO LSQ cache operated in lock-step with LSQ n

LSQ Cache Design n Assume FIFO LSQ cache operated in lock-step with LSQ n n n Avoids explicit tags, replacement policy considerations MPs: Flush on memory barriers (WC) MPs: Use existing LSQ logic for SC to invalidate (e. g. R 10 K) December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Terminology n Program Structure Store Value Locality (PSSVL): The value locality exhibited by a

Terminology n Program Structure Store Value Locality (PSSVL): The value locality exhibited by a given static store (can write to many addresses) n n Message Passing Store Value Locality (MPSVL): The value locality exhibited for a specific memory location (can be written by many PCs) Stochastically Silent Store: A store value which is trivially predictable by any well known method December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

MPSVL and PSSVL Percentage of stochastically silent (PSSVL, MPSVL) stores is non-trivial 27%-72% for

MPSVL and PSSVL Percentage of stochastically silent (PSSVL, MPSVL) stores is non-trivial 27%-72% for PSSVL, 39%-70% for MPSVL December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Multiprocessor Sharing n n n Measurable reduction in true/false sharing for simple update silent

Multiprocessor Sharing n n n Measurable reduction in true/false sharing for simple update silent squashing (UFS) Substantial reductions by squashing update silent store hits and misses (UFS-P) and stochastically silent stores (SFS) Squashing store misses (UFS-P) can be substantially better than simple UFS n Motivates silence confidence mechanism for store misses December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Multiprocessor Traffic n Measurable reduction in invalidate traffic for simple update silent store squashing

Multiprocessor Traffic n Measurable reduction in invalidate traffic for simple update silent store squashing (UFS)— more effective than Exclusive state n n Substantial reduction for UFS-P and Stochastic False Sharing (SFS) Writeback data traffic reduction by squashing update silent store hits and misses (UFS-P) n n n 5%-82% in oltp 16%-17% in ocean 5%-16% in barnes December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33

Multiprocessor Sharing December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin

Multiprocessor Sharing December 11, 2000 Kevin Lepak and Mikko Lipasti--PHARM Team, University of Wisconsin MICRO-33