32 nd Annual International Symposium on Microarchitecture Access

  • Slides: 32
Download presentation
32 nd Annual International Symposium on Microarchitecture Access Region Locality for High. Bandwidth Processor

32 nd Annual International Symposium on Microarchitecture Access Region Locality for High. Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee Iowa State U

Big Picture MICRO-32 November 17, 1999 Cho, Yew, and Lee 2

Big Picture MICRO-32 November 17, 1999 Cho, Yew, and Lee 2

On-Chip D-Cache Bandwidth Problem MICRO-32 November 17, 1999 Cho, Yew, and Lee 3

On-Chip D-Cache Bandwidth Problem MICRO-32 November 17, 1999 Cho, Yew, and Lee 3

Wide-Issue Superscalar Processors n Current Generation – Alpha 21264 – Intel’s Merced n Future

Wide-Issue Superscalar Processors n Current Generation – Alpha 21264 – Intel’s Merced n Future Generation (IEEE Computer, Sept. ‘ 97) – Superspeculative Processors – Trace Processors MICRO-32 November 17, 1999 Cho, Yew, and Lee 4

Multi-Ported Data Cache n Replicated Cache – Alpha 21164 n Time-Division Multiplexed Cache –

Multi-Ported Data Cache n Replicated Cache – Alpha 21164 n Time-Division Multiplexed Cache – MICRO-32 November 17, 1999 n Interleaved Cache – MIPS R 10 K Alpha 21264 Cho, Yew, and Lee 5

Window Logic Complexity n Pointed out as the major hardware complexity (Parlacharla et al.

Window Logic Complexity n Pointed out as the major hardware complexity (Parlacharla et al. , ISCA ‘ 97) n More severe for Memory window – Difficult to partition – Thick network needed to connect RSs and LSUs MICRO-32 November 17, 1999 Cho, Yew, and Lee 6

Data Decoupling MICRO-32 November 17, 1999 Cho, Yew, and Lee 7

Data Decoupling MICRO-32 November 17, 1999 Cho, Yew, and Lee 7

Data Decoupling: What is it? n A Divide-and-Conquer approach – Instruction stream partitioned before

Data Decoupling: What is it? n A Divide-and-Conquer approach – Instruction stream partitioned before entering RS – Narrower networks – Less ports to each cache – Needs mechanism for proper partitioning MICRO-32 November 17, 1999 Cho, Yew, and Lee 8

Data Decoupling: Operating Issues n Memory Stream Partitioning n – Hardware classification – Compiler

Data Decoupling: Operating Issues n Memory Stream Partitioning n – Hardware classification – Compiler classification MICRO-32 November 17, 1999 Load Balancing – Enough instructions in different groups? – Are they well interleaved? Cho, Yew, and Lee 9

Access Region Locality & Access Region Prediction MICRO-32 November 17, 1999 Cho, Yew, and

Access Region Locality & Access Region Prediction MICRO-32 November 17, 1999 Cho, Yew, and Lee 10

Access Region: Overview n Access Region R – R = (L, U) n n

Access Region: Overview n Access Region R – R = (L, U) n n n L: Lower Bound on Addr. U: Upper Bound on Addr. If (D<A) or (B<C), – Region R and Q are said to be exclusive or nonoverlapping. n MICRO-32 November 17, 1999 Cho, Yew, and Lee Locations in exclusive regions are independent. 11

Access Region and Mem. Instructions MICRO-32 November 17, 1999 Cho, Yew, and Lee 12

Access Region and Mem. Instructions MICRO-32 November 17, 1999 Cho, Yew, and Lee 12

Partitioning Memory Space n One way of partitioning memory space into regions: – Data

Partitioning Memory Space n One way of partitioning memory space into regions: – Data Region / Heap Region / Stack Region n This work assumes this partitioning. MICRO-32 November 17, 1999 Cho, Yew, and Lee 13

Partitioning Memory Space, Cont’d (%) n n Many accesses are toward Data and Stack

Partitioning Memory Space, Cont’d (%) n n Many accesses are toward Data and Stack regions. Some programs don’t access the Heap region at all. MICRO-32 November 17, 1999 Cho, Yew, and Lee 14

Partitioning Memory Space, Cont’d n n n Window Size = 32 Accesses to Data

Partitioning Memory Space, Cont’d n n n Window Size = 32 Accesses to Data region are less bursty than others. Programs such as ijpeg have clustered region accesses. MICRO-32 November 17, 1999 Cho, Yew, and Lee 15

Partitioning Memory Space, Cont’d n n n Window Size = 64 W/ a large

Partitioning Memory Space, Cont’d n n n Window Size = 64 W/ a large window, Stack accesses become less bursty. Data and Stack regions have quite stable, constant demand. MICRO-32 November 17, 1999 Cho, Yew, and Lee 16

Partitioning Memory Space, Cont’d 1. 8% 1. 9% 50. 4% 51. 1% 1. 6%

Partitioning Memory Space, Cont’d 1. 8% 1. 9% 50. 4% 51. 1% 1. 6% 16. 2% 45. 4% 31. 6% go m 88 ksim gcc compress li n n ijpeg perl vortex tomcatv swim su 2 cor mgrid Int. Avg FP. Avg Many instructions access a single region (~98%). Multi-region-accessing instructions account for 0 ~ 9. 6% of dynamic memory references. MICRO-32 November 17, 1999 Cho, Yew, and Lee 17

Access Region Locality n “A memory reference instruction typically accesses a single region at

Access Region Locality n “A memory reference instruction typically accesses a single region at run time” – Only about 2% of all static memory instructions access more than a single region. n “(Thus) the region it accesses is highly predictable” – Simple predictors with a small look-up table achieve high prediction accuracy. MICRO-32 November 17, 1999 Cho, Yew, and Lee 18

Predicting Regions: Unlimited Case n n One predictor per memory instruction Predictor types: –

Predicting Regions: Unlimited Case n n One predictor per memory instruction Predictor types: – 1 -bit history saver (0: Data, 1: Stack) – 2 -bit saturating counter MICRO-32 November 17, 1999 Cho, Yew, and Lee 19

Predicting Regions: Adding Context n Run-time context – – – Caller’s ID (CID): in

Predicting Regions: Adding Context n Run-time context – – – Caller’s ID (CID): in Link Register Global Branch History (GBH) Hybrid of above MICRO-32 November 17, 1999 Cho, Yew, and Lee 20

Predicting Regions: Utilizing Static Info. n Some instructions’ access regions are revealed through architecture

Predicting Regions: Utilizing Static Info. n Some instructions’ access regions are revealed through architecture and compiler conventions: – Use of Stack Pointer ($SP) or Frame Pointer ($FP)suggests that the region is Stack. – Use of Global Pointer ($GP) suggests that the region is non-Stack. – For others, assume non-Stack. n Directly exporting some high-level region information from compiler to processor may improve prediction accuracy. MICRO-32 November 17, 1999 Cho, Yew, and Lee 21

Region Pred. Result: Unlimited Case w/ GBH w/ CID Simple 1 w/ Hybrid bit

Region Pred. Result: Unlimited Case w/ GBH w/ CID Simple 1 w/ Hybrid bit Static go m 88 ksim gcc compress li n n ijpeg perl vortex tomcatv swim su 2 cor mgrid Int. Avg FP. Avg 1 -bit predictors do better than 2 -bit predictors (not shown). Hybrid context bits achieve the best prediction rate on average. MICRO-32 November 17, 1999 Cho, Yew, and Lee 22

Predicting Regions: Limited-Size ARPT n Low n bits of PC, XOR’ed with hybrid context

Predicting Regions: Limited-Size ARPT n Low n bits of PC, XOR’ed with hybrid context bits are used to index into Access Region Prediction Table (ARPT): – Table Entries Initialized to 0’s – 1 to denote stack access – Decoding information exploited to save ARPT space MICRO-32 November 17, 1999 Cho, Yew, and Lee 23

Region Prediction Result: ARPT 8 KB 4 KB Unlimited 2 KB 1 KB go

Region Prediction Result: ARPT 8 KB 4 KB Unlimited 2 KB 1 KB go m 88 ksim gcc compress li n n ijpeg perl vortex tomcatv swim su 2 cor mgrid Int. Avg FP. Avg Over 99. 9% Accuracy w/ 4 KB or larger ARPT w/o compiler hints. Compiler hints relieve pressure due to smaller sizes. MICRO-32 November 17, 1999 Cho, Yew, and Lee 24

Dynamic Data Decoupling MICRO-32 November 17, 1999 Cho, Yew, and Lee 25

Dynamic Data Decoupling MICRO-32 November 17, 1999 Cho, Yew, and Lee 25

Dynamic Data Decoupling MICRO-32 November 17, 1999 Cho, Yew, and Lee 26

Dynamic Data Decoupling MICRO-32 November 17, 1999 Cho, Yew, and Lee 26

Dynamic Data Decoupling, Cont’d n Dynamically predicting access regions to classify memory instructions: –

Dynamic Data Decoupling, Cont’d n Dynamically predicting access regions to classify memory instructions: – Utilize Access Region Prediction Table (ARPT). – Utilize any region information revealed through instruction decoding. n n Dispatching partitioned memory instructions into separate memory pipelines, connetected to separate caches. Dynamically Verifying Region Prediction – Let TLB (i. e. , page table) contain verification information such that memory access is reissued on mis. MICRO-32 Cho, Yew, and Lee 27 Novemberpredictions. 17, 1999

Base Machine Model MICRO-32 November 17, 1999 Cho, Yew, and Lee 28

Base Machine Model MICRO-32 November 17, 1999 Cho, Yew, and Lee 28

Overall Performance n go m 88 ksim gcc compress MICRO-32 November 17, 1999 li

Overall Performance n go m 88 ksim gcc compress MICRO-32 November 17, 1999 li ijpeg perl Over (2+0) conf. vortex tomcatv swim su 2 cor mgrid Int. Avg FP. Avg Cho, Yew, and Lee 29

Conclusions n Access Region Locality says – Memory instructions access few regions at run

Conclusions n Access Region Locality says – Memory instructions access few regions at run time. – Accessed regions are accurately predictable. n n Access Region Locality leads to Access Region Prediction techniques. Access Region Prediction allows Dynamic Data Decoupling, shown to achieve comparable performance to very wide data caches. MICRO-32 November 17, 1999 Cho, Yew, and Lee 30

Now Any Questions? MICRO-32 November 17, 1999 Cho, Yew, and Lee 31

Now Any Questions? MICRO-32 November 17, 1999 Cho, Yew, and Lee 31

Impact of LVC Size n n n 0. 5 K 1 K MICRO-32 November

Impact of LVC Size n n n 0. 5 K 1 K MICRO-32 November 17, 1999 2 K 4 K 2 KB and 4 KB LVCs achieve high hit rates. (~99. 9%). Set associativity less important if LVC is 2 KB or more. Small, simple LVC works well. Cho, Yew, and Lee 32