Trace Caches J Nelson Amaral Difficulties to Instruction

Difficulties to Instruction Fetching • Where to fetch the next instruction from? – Use

Getting more from I-cache • How can we increase the probability that more of

Line-and-way Predictor • Instead of predicting a Branch Target Address, predict the next line

NLS: tagless table of pointers to next instruction to be executed into instruction cache

Trace Caches • A cache that records instructions in the sequence in which they

Trace Caches Design Issues • • • How to design the trace cache? How

Instruction Fetch with I-cache Baer p. 161

Fetch with I-cache and Trace Cache Baer p. 161

Trace Selection Criteria • Number of conditional branches in a trace – number of

Trace Tags • What should be the tag of a trace? • Is it

Tags for Trace Cache Entries Assume a trace may contain up to 16 instructions.

Fetch with I-cache and Trace Cache Register Renaming Bypass decode stage Instructions in a

Where to build a trace from? Trace Cache Fill Unit Decoder Traces from mispredicted

Where to build a trace from? Trace Cache Fill Unit Long delay to build

Next Trace Predictor • To predict the next trace – Need to predict the

Intel Pentium 4 Trace Cache • Trace Cache contains up to 6 μops per

An 8 -to-256 bit Decoder 256 times A 0 A 1 A 2 A

Decoding Complexities • Need m 8 -256 decoders • To speculatively compute branch targets

Alleviating Decoding Complexities • Use predecoded bits appended to instructions – predecode on transfers

Alleviating Decoding Complexities • Use an extra decoding stage to steer instructions towards instruction

Pre-decoded Bits MIPS R 10000 • Append 4 bits to each instruction – designate

AMD K 7 Pre-decoded Bits • 3 bits appended to each byte: – indicate

Intel P 6 Instruction Buffer (Queue) • A pipeline stage brings instructions into a

Impact of Decoding on Superscalar • The complexity of the decoder is one of

Three approaches to Register Renaming Reorder Buffer Monolithic Physical Register File Architectural Register File

Implementation of Register Renaming • Where and when to allocate/release physical registers? • What

Example i 1: R 1 ← R 2/R 3 # division takes a long

Free R 32 R 33 R 34 R 35 Assigned R 1 R 2

Free Renaming Stage: Allocated as a result Assigned R 1 R 2 R 3

Register Releasing • Need to know the previous renaming – maintain a previous vector

Extended Register File • Only physical registers can be in free state. • At

Repair Mechanism - Misprediction • How to repair the register renaming after a branch

Repair Mechanism - Exceptions • Monolithic – Mappings are not saved at every instruction

Comparison: ROB-based • Space wasting – ROB fields for renaming for instr. that do

ROB-based Intel P 6 40 -μops ROB in Pentium III and M > 80

Slides: 40

Download presentation

Trace Caches J. Nelson Amaral

Difficulties to Instruction Fetching • Where to fetch the next instruction from? – Use branch prediction • Sometimes there is misprediction • Likely can only fetch from one I-cache line – m instructions may spread over two lines • I-cache misses are even worse • Taken branches – target address may be in the middle of a cache line • Instructions before target must be discarded – the remainder of the m instructions fetched need to be discarded. Baer p. 159

Getting more from I-cache • How can we increase the probability that more of the m instructions needed are in a cache line? – Increase cache line size • Increasing too much increases cache misses – Fetch “next” line • What is “next” in a set-associative cache? • replacements: – next line does not contain the right instructions – address checking and repair needed even with no branches Baer p. 159

Line-and-way Predictor • Instead of predicting a Branch Target Address, predict the next line and set in the Icache. – Called a Next Cache Line and Set Predictor by Calder and Grunwald (ISCA 95) • NLS-cache: associate predictor bits with a cache line • NLS-table: store the predictor bits into a separate direct mapped tag-less buffer • Effective for programs containing many branches Baer p. 160

NLS: tagless table of pointers to next instruction to be executed into instruction cache NLS-Cache NLS also predicts indirect branches and provides branch type. Needs an early distinction between branch and non-branch instructions. three predicted addresses. Calder, Brad and Grunwald, Dirk, Next Cache Line and Set Prediction, International Symposium on Computer Architecture (ISCA), 1995, 297 -296.

Trace Caches • A cache that records instructions in the sequence in which they were fetched (or committed) – PC indexes into the trace cache – If predictions are correct: • whole trace fetched in one cycle • all instructions in the trace are executed Baer p. 161

Trace Caches Design Issues • • • How to design the trace cache? How to build a trace? When to build a trace? How to fetch a trace? When to replace a trace? Baer p. 161

Instruction Fetch with I-cache Baer p. 161

Fetch with I-cache and Trace Cache Baer p. 161

Trace Selection Criteria • Number of conditional branches in a trace – number of consecutive correct predictions is limited • Merging next block may exceed trace line – no partial blocks in a trace • Indirect jump or call-return terminate traces Baer p. 161

Trace Tags • What should be the tag of a trace? • Is it sufficient to use the address of the first instruction as tag? Baer p. 161

Tags for Trace Cache Entries Assume a trace may contain up to 16 instructions. There are two possible traces: T 1: B 1 -B 2 -B 4 T 2: B 1 -B 3 -B 4 T 1 and T 2 start at the same address. Possible solution: Add the predicted branch outcomes to the trace Baer p. 162

Fetch with I-cache and Trace Cache Register Renaming Bypass decode stage Instructions in a trace may not need to be decoded: trace of μops. Big advantage on CISC ISAs (Intel IA-32) where decoding is expensive. Baer p. 162

Where to build a trace from? Trace Cache Fill Unit Decoder Traces from mispredicted paths are added to trace cache. Baer p. 163

Where to build a trace from? Trace Cache Fill Unit Long delay to build a trace. Reorder Buffer Not much performance difference between decoder and ROB. Baer p. 163

Next Trace Predictor • To predict the next trace – Need to predict the outcome of several branches at the same time. • An expanded BTB can be used – Can base the prediction on a path history of past traces • Use bits from tags of previous traces to index a trace predictor Baer p. 163

Intel Pentium 4 Trace Cache • Trace Cache contains up to 6 μops per line – Can store 12 K μops (2 K lines) • Claim that it delivers a hit rate equal to 8 to 16 KB of Icache • There is no I-cache – On a trace cache miss fetches from L 2 • Trace cache has its own BTB (512 entries) – Another independent 4 K BTB from L 2 fetches • Advantages over using an I-Cache: – Increased fetch bandwidth – Bypass of the decoder Baer p. 163

A 2 -to-4 bit Decoder

An 8 -to-256 bit Decoder 256 times A 0 A 1 A 2 A 3 A 4 A 5 A 6 A 7 256 lines

Decoding Complexities • Need m 8 -256 decoders • To speculatively compute branch targets for each of m instructions – Need m adders in the decode stage. – Solution: • limit number of branches decode per cycle • if branch is resolved in c cycles – still there are c branches in flight – c = 10 is typical Baer p. 164

Alleviating Decoding Complexities • Use predecoded bits appended to instructions – predecode on transfers from L 2 to I-cache • CISC: limit the number of complex instructions decoded in a single cycle Intel P 6 3 decoders: 2 for 1 -μop instruction 1 for complex instruct. Baer p. 164

Alleviating Decoding Complexities • Use an extra decoding stage to steer instructions towards instruction queues. Baer p. 164

Pre-decoded Bits MIPS R 10000 • Append 4 bits to each instruction – designate class (integer, floating point, branch, load-store) and execution unit queue • Partial decode on transfer from L 2 to I-Cache. Baer p. 165

AMD K 7 Pre-decoded Bits • 3 bits appended to each byte: – indicate how many bytes away is the start of the next instruction – stored in a predecode cache – predecode cache is accessed in parallel with Icache. – Advantage: detection of instruction boundaries done only once (not at every execution) • saves power dissipation – Disadvantage: size of I-cache is almost double Baer p. 165

Intel P 6 Instruction Buffer (Queue) • A pipeline stage brings instructions into a buffer. – Boundaries are determined in the buffer – Instructions are steered to either: • A simple decoder • A complex decoder – Can detect opportunities for instruction fusion (pe. a compare-and-test followed by a branch) • may send the fused instruction to a simple decoder Baer p. 165

Impact of Decoding on Superscalar • The complexity of the decoder is one of the major limitations to increase m. Baer p. 165

Three approaches to Register Renaming Reorder Buffer Monolithic Physical Register File Architectural Register File Physical Extension to Register File Baer p. 165

Implementation of Register Renaming • Where and when to allocate/release physical registers? • What to do on a branch misprediction? • What do do on an exception? Baer p. 166

Example i 1: R 1 ← R 2/R 3 # division takes a long time i 2: R 4 ← R 1 + R 5 i 3: Use. R 5 ← R 6 + R 7 a counter? rename an use i 4: in R 1 ←→ R 8 count + R 9 up When/how instruction issued→ count down can R 32 be released? i 1: i 2: i 3: i 4: R 32 ← R 2/R 3 # division takes a long time R 33 ← R 32 + R 5 As soon as i 2 is issued. R 34 ← R 6 + R 7 R 35 ← R 8 + R 9 How does the hardware know that i 2 is the Too expensive! last use of R 32? R 35 will receive a value before instruction i 2 is issued. Baer p. 166

Example i 1: i 2: i 3: i 4: R 1 ← R 2/R 3 # division takes a long time R 4 ← R 1 + R 5 ← R 6 + R 7 When R 1 is renamed R 1 ← R 8 + R 9 again, all uses of the first renaming must be issued! i 1: i 2: i 3: i 4: R 32 ← R 2/R 3 # division takes a long time R 33 ← R 32 + R 5 R 34 ← R 6 + R 7 R 35 ← R 8 + R 9 Release R 32 when i 4 commits! R 35 will receive a value before instruction i 2 is issued. Baer p. 166

Free R 32 R 33 R 34 R 35 Assigned R 1 R 2 R 3 R 4 R 5 R 6 R 7 R 8 R 9 Release: Next instruction that renames same register commits Commit: Physical Register becomes an Architectural Register rename i 1: i 2: i 3: i 4: Executed Renaming Stage: Allocated as a result Allocated End of Execution: Result Generated R 32 R 1 ← R 2/R 3 # division takes a long time R 4 ← R 1 + R 5 ← R 6 + R 7 R 1 ← R 8 + R 9 Baer p. 167

Free Renaming Stage: Allocated as a result Assigned R 1 R 2 R 3 R 4 R 5 R 6 R 7 R 8 R 9 Release: Next instruction that renames same register commits Commit: Physical Register becomes an Architectural Register execute i 1: i 2: i 3: i 4: Executed Allocated R 32 R 33 R 34 R 35 End of Execution: Result Generated R 32 ← R 2/R 3 # division takes a long time R 33 ← R 32 + R 5 R 34 ← R 6 + R 7 R 35 ← R 8 + R 9 Baer p. 167

Free Assigned R 1 R 2 R 3 R 4 R 5 R 6 R 7 R 8 R 9 Renaming Stage: Allocated as a result Release: Next instruction that renames same register commits Commit: Physical Register becomes an Architectural Register commit execute i 1: i 2: i 3: i 4: Executed R 32 R 34 R 35 Allocated R 33 End of Execution: Result Generated R 32 ← R 2/R 3 # division takes a long time R 33 ← R 32 + R 5 R 34 ← R 6 + R 7 R 35 ← R 8 + R 9 Baer p. 167

Free Assigned R 1 R 2 R 3 R 4 R 5 R 6 R 7 R 8 R 9 R 32 R 33 R 34 Renaming Stage: Allocated as a result Release: Next instruction that renames same register commits Commit: Physical Register becomes an Architectural Register commit i 1: i 2: i 3: i 4: Executed Allocated End of Execution: Result Generated R 35 R 32 ← R 2/R 3 # division takes a long time R 33 ← R 32 + R 5 R 34 ← R 6 + R 7 R 35 ← R 8 + R 9 Baer p. 167

Register Releasing • Need to know the previous renaming – maintain a previous vector for each architectural register. – when renaming R 1 to R 35 in i 4, record that the previous renaming of R 1 was R 32 – when i 4 commits, R 35 becomes the previous renaming of R 1 i 1: i 2: i 3: i 4: R 32 ← R 2/R 3 # division takes a long time R 33 ← R 32 + R 5 R 34 ← R 6 + R 7 R 35 ← R 8 + R 9 Baer p. 167

Extended Register File • Only physical registers can be in free state. • At commit, result must be stored into the architectural register mapped to the physical register. – Needs to associatively search the mapping OR – ROB has a field with name of architectural register Architectural Register File Physical Extension to Register File Baer p. 168

Repair Mechanism - Misprediction • How to repair the register renaming after a branch misprediction? – ROB • discarding ROB entries for miss-speculated instructions invalidates all mappings from architectural to physical registers. – Monolithic • make a copy of mapping table at each branch prediction • save copies in circular queue • in case of misprediction use saved copy to restore mapping Baer p. 169

Repair Mechanism - Exceptions • Monolithic – Mappings are not saved at every instruction • There may be no correct map to restore – Have to undo the mappings • from last renamed instruction • up to the one that caused the exception – Undoing the map is costly but only occurs when an exception is being handled. Baer p. 169

Comparison: ROB-based • Space wasting – ROB fields for renaming for instr. that do not use registers. • Time wasting – Two cycles to store result into register file. • Easy repair • No map saving • Compelling Simplicity • retiring and writing result can be pipelined • Power and Space wasting – ROB needs too many read and write ports Baer p. 169

ROB-based Intel P 6 40 -μops ROB in Pentium III and M > 80 -μops ROB in Intel Core Monolithic Extended Alpha 21264 MIPS R 10000 IBM Power. PC Baer p. 170