Design Exploration of an InstructionBased Shared Markov Table
Design Exploration of an Instruction-Based Shared Markov Table. Markov on CMPs Table on CMPs Karthik Ramachandran & Lixin Su
Outline n n n Motivation n Multiple cores on single chip n Commercial workloads Our study n Start from Instruction sharing pattern analysis n Our experiments n Move onto Instruction cache miss pattern analysis n Our experiments Conclusions
Motivation n Technology push: CMPs n Lower access latency to other processors Application pull: Commercial workloads n OS behavior n Database applications Opportunities for shared structures n Markov based sharing structure n Address large instruction footprint VS. small fast I caches
Instruction Sharing Analysis n n n How instruction sharing may occur ? n OS: multiple processes, scheduling n DB: concurrent transactions, repeated queries, multiple threads How can CMP’s benefit from instruction sharing ? n Snoop/grab instruction from other cores n Shared structures Let’s investigate it.
Methodology n Two-step approach n Experiment I n Targets Instruction trace analysis n How much sharing occurs ? n Experiment II n Targets I cache miss stream analysis n Examine the potential of a shared Markov structure
Experiment I n n n Add instrumentation code to analyze committed instructions Focus on repeated sequences of 2, 3, 4, and 5 instructions across 16 P Histogram-based approach P 1 P 2 P 3 P 4 {A, B} {A, B} How do we Count ? P 1 : 3 times {A, B} P 2 : 1 time {A, B} P 3 : 0 times P 4 : 2 times Total : 10 times
Results - Experiment I Q. ) Is there any Instruction sharing ? A. ) Maybe, observe the number of times the sequences 2 -5 repeat (~13000 -17000) Q. ) But why does the numbers for a sequence pattern of 5 Instructions not differ much from a sequence pattern of 2 Instructions ? A. ) Spin Loops!! For non warm-up case : 50% For warm-up case : 30%
Experiment II n Focus on instruction cache misses n n n Is there sharing involved here too? Upper bound performance benefit of a shared Markov table? Experiment setup n n n 16 K-entry fully associative shared Markov table Each entry has two consecutive misses from same processor Atomic lookup and hit/miss counter update when a processor has two consecutive I $ misses. On a miss, Insert a new entry to LRU head On a hit, Record distance from the LRU head and move the hit entry to LRU head
Design Block Diagram P P • Small fast shared Markov table • Prefetch when I$ miss occurs I$ I$ Markov Table L 2 $
Table Lookup Hit Ratio Q 1. ) Is there a lot of miss sharing? Q 2. ) Does constructive interference pattern exist to help a CMP? Q 3. ) Do equal opportunities exist for all the P?
Let’s Answer the Questions? A 1. ) Yes Of course A 2. ) Definitely a constructive interference pattern exists as you see from the figure A 3. ) Yes. Hit/miss ratio remains pretty stable across processor in spite of variance in the number of I cache misses.
How Big Should the Table Be ? • About 60% of hits are within 4 K entries away from LRU head. • A shared Markov table can fairly utilize I cache miss sharing. • What about snooping and grabbing instructions from other I caches?
Real Design Issues Associativity and size of the table n Choose the right path if multiple paths exist n Separate address directory from data entries for the table and have multiple address directories n What if a sequential prefetcher exists? n
Conclusions n Instruction sharing on CMPs exists. Spin loops occur frequently with current workloads. n Markov-based structure for storing I cache misses may be helpful on CMPs.
Questions?
Comparison with Real Markov Prefetching Cnt A B LRU head A C A D B A E A D C 5 2 F 3 LRU Tail Hit Cnt 2 Miss Cnt 3 P A A C • Misses to A & C and then look up in the table • Update hit/miss counters and change/record LRU P • Miss to A and prefetch along A, B&C
Lookup Example I P A B LRU head A C A D B D A A C C Look up LRU Tail Hit Cnt 2 Miss Cnt 3 LRU head A B A D B D LRU head Hit Cnt 3 Miss Cnt 3
Lookup Example II P A B LRU head A C A D B D C A C D Look up LRU Tail Hit Cnt 2 Miss Cnt 3 LRU head A B A D C D LRU head Hit Cnt 2 Miss Cnt 4
- Slides: 19