Inside the Pentium 4 Processor Microarchitecture Fall 2000

  • Slides: 46
Download presentation
Inside the ® Pentium 4 Processor Micro-architecture Fall 2000 Next Generation IA-32 Micro-architecture Doug

Inside the ® Pentium 4 Processor Micro-architecture Fall 2000 Next Generation IA-32 Micro-architecture Doug Carmean Principal Architect Intel Architecture Group August 24, 2000 Intel Copyright © 2000 Intel Corporation. Labs

Agenda Fall 2000 l IA-32 Processor Roadmap l Design Goals l Frequency l Instructions

Agenda Fall 2000 l IA-32 Processor Roadmap l Design Goals l Frequency l Instructions Per Cycle l Summary Intel Copyright © 2000 Intel Corporation. PDX

Intel® Pentium® 4 Processor Fall 2000 Performance Intel® Net. Burst™ Micro-Architecture Today P 6

Intel® Pentium® 4 Processor Fall 2000 Performance Intel® Net. Burst™ Micro-Architecture Today P 6 Micro-Architecture P 5 Micro-Architecture 486 Micro-architecture Time Copyright © 2000 Intel Corporation. Intel PDX

Intel® Pentium® 4 Processor Design Goals Fall 2000 l Deliver world class performance across

Intel® Pentium® 4 Processor Design Goals Fall 2000 l Deliver world class performance across both existing and emerging applications l Deliver performance headroom and scalability for the future Micro-architecture that will Drive Performance Leadership for the Next Several Years Intel Copyright © 2000 Intel Corporation. PDX

Intel® Netburst. TM Micro-architecture 400 MHz System Bus Fall 2000 Advanced Transfer Cache Advanced

Intel® Netburst. TM Micro-architecture 400 MHz System Bus Fall 2000 Advanced Transfer Cache Advanced Dynamic Execution Hyper Pipelined Technology Rapid Execution Engine Streaming SIMD Extensions 2 Execution Trace Cache Copyright © 2000 Intel Corporation. Enhanced Floating Point / Multi-Media Intel PDX

Store AGU Load AGU ALU ALU FP move FP store FMul FAdd MMX SSE

Store AGU Load AGU ALU ALU FP move FP store FMul FAdd MMX SSE L 1 D-Cache and D-TLB Schedulers uop Queues 3 FP RF u. Code ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L 2 Cache and Control BTB & I-TLB 3. 2 GB/s System Interface Pentium® 4 Processor Block Diagram

Store AGU Load AGU ALU ALU FP move FP store FMul FAdd MMX SSE

Store AGU Load AGU ALU ALU FP move FP store FMul FAdd MMX SSE L 1 D-Cache and D-TLB Schedulers uop Queues 3 FP RF u. Code ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L 2 Cache and Control BTB & I-TLB 3. 2 GB/s System Interface Pentium® 4 Processor Block Diagram

CPU Architecture 101 Fall 2000 Delivered Performance = Frequency * Instructions Per Cycle Frequency

CPU Architecture 101 Fall 2000 Delivered Performance = Frequency * Instructions Per Cycle Frequency Intel Copyright © 2000 Intel Corporation. PDX

Frequency l What Fall 2000 limits frequency? – Process technology – Microarchitecture l On

Frequency l What Fall 2000 limits frequency? – Process technology – Microarchitecture l On a given process technology – Fewer gates per pipeline stage will deliver higher frequency Frequency is driven by Microarchitecture Intel Copyright © 2000 Intel Corporation. PDX

Netburst. TM Micro-architecture Pipeline vs P 6 Intro at 733 MHz 9. 18µ Basic

Netburst. TM Micro-architecture Pipeline vs P 6 Intro at 733 MHz 9. 18µ Basic P 6 Pipeline 1 Fetch 2 Fetch Fall 2000 3 4 5 6 7 8 Decode Rename ROB Rd Rdy/Sch Dispatch 10 Exec Basic Pentium® 4 Processor Pipeline 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 Que Sch 11 12 13 14 15 Sch Disp RF Intro at 16³ 17 18 19 20 1. 4 GHz RF Ex Flgs Br Ck Drive. 18µ Hyper pipelined Technology enables industry leading performance and clock rate Intel Copyright © 2000 Intel Corporation. PDX

Hyper Pipelined Technology Fall 2000 20 Netburst Micro-Architecture Today 1. 13 GHz ³ 1.

Hyper Pipelined Technology Fall 2000 20 Netburst Micro-Architecture Today 1. 13 GHz ³ 1. 4 GHz Frequency 10 P 6 Micro-Architecture 233 MHz 166 MHz 5 P 5 Micro-Architecture 60 MHz Introduction Copyright © 2000 Intel Corporation. Time Intel PDX

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive TC Nxt IP: Trace cache next instruction pointer L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Pointer from the BTB, indicating location of next instruction. Intel PDX

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive TC Fetch: Trace cache fetch L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Trace Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Read the decoded instructions (u. OPs) out of the Execution Trace Cache Intel PDX

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive: Wire delay L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Drive the u. OPs to the allocator Intel PDX

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Alloc: Allocate L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Allocate resources required for execution. The resources include Load buffers, Store buffers, etc. . Intel PDX

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Rename: Register renaming L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Rename the logical registers (EAX) to the physical register space (128 are implemented). Intel PDX

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Que: Write into the u. OP Queue L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface u. OPs are placed into the queues, where they are held until there is room in the schedulers Intel PDX

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Sch: Schedule L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Write into the schedulers and compute dependencies. Watch for dependency to resolve. Intel PDX

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Disp: Dispatch L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Send the u. OPs to the appropriate execution unit. Intel PDX

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive RF: Register File Copyright © 2000 Intel Corporation. AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Schedulers uop Queues 3 FP FPRF RF ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer. RF RF Integer L 2 Cache and Control BTB & I-TLB System Interface Read the register file. These are the source(s) for the pending operation (ALU or other). Intel PDX

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Ex: Execute ROM Copyright © 2000 Intel Corporation. AGU AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L 2 Cache and Control BTB & I-TLB System Interface Execute the u. OPs on the appropriate execution port. Intel PDX

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Flgs: Flags L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Compute flags (zero, negative, etc. . ). These are typically the input to a branch instruction. Intel PDX

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Br Ck: Branch Check L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface The branch operation compares the result of the actual branch direction with the prediction. Intel PDX

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch

Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive: Wire delay L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Drive the result of the branch check to the front end of the machine. Intel PDX

CPU Architecture 101 Fall 2000 Delivered Performance = Frequency * Instructions Per Cycle Intel

CPU Architecture 101 Fall 2000 Delivered Performance = Frequency * Instructions Per Cycle Intel Copyright © 2000 Intel Corporation. PDX

Improving Instructions Per Cycle l Improve Fall 2000 efficiency – Branch prediction – Do

Improving Instructions Per Cycle l Improve Fall 2000 efficiency – Branch prediction – Do more things in a clock l Reduce time it takes to do something – Reducing latency Intel Copyright © 2000 Intel Corporation. PDX

Improving Instructions Per Cycle l Improve Fall 2000 efficiency – Branch prediction – Do

Improving Instructions Per Cycle l Improve Fall 2000 efficiency – Branch prediction – Do more things in a clock l Reduce time it takes to do something – Reducing latency Intel Copyright © 2000 Intel Corporation. PDX

Branch Prediction Fall 2000 l Accurate branch prediction is key to enabling longer pipelines

Branch Prediction Fall 2000 l Accurate branch prediction is key to enabling longer pipelines l Dramatic improvement over P 6 branch predictor: – 8 x the size (4 K) – Eliminated 1/3 of the mispredictions l Proven to be better than all other publicly disclosed predictors – (g-share, hybrid, etc) Intel Copyright © 2000 Intel Corporation. PDX

The Execution Trace Cache Copyright © 2000 Intel Corporation. Store AGU Load AGU ALU

The Execution Trace Cache Copyright © 2000 Intel Corporation. Store AGU Load AGU ALU ALU FP move FP store FMul FAdd MMX Intel SSE L 1 D-Cache and D-TLB Schedulers uop Queues 3 FP RF u. Code ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L 2 Cache and Control BTB & I-TLB 3. 2 GB/s System Interface Fall 2000 PDX

Execution Trace Cache l Advanced Fall 2000 L 1 instruction cache – Caches “decoded”

Execution Trace Cache l Advanced Fall 2000 L 1 instruction cache – Caches “decoded” IA-32 instructions (uops) l Removes decoder pipeline latency l Capacity is ~12 K u. Ops l Integrates branches into single line – Follows predicted path of program execution Execution Trace Cache feeds fast engine Intel Copyright © 2000 Intel Corporation. PDX

Execution Trace Cache 1 cmp 2 br -> T 1. . . (unused code)

Execution Trace Cache 1 cmp 2 br -> T 1. . . (unused code) T 1: T 2: T 3: 3 sub 4 br -> T 2. . . (unused code) 5 mov 6 sub 7 br -> T 3. . . (unused code) Fall 2000 Trace Cache Delivery 1 cmp 2 br T 1 3 T 1: sub 4 br T 2 5 mov 6 7 br T 3 8 T 3: add 9 sub 10 mul 11 cmp sub 12 br T 4 8 add 9 sub 10 mul 11 cmp 12 br -> T 4 Intel Copyright © 2000 Intel Corporation. PDX

Advanced Dynamic Execution Fall 2000 l Extends basic features found in P 6 core

Advanced Dynamic Execution Fall 2000 l Extends basic features found in P 6 core l Very deep speculative execution – 126 instructions in flight (3 x P 6) – 48 loads (3 x P 6) and 24 stores (2 x P 6) l Provides larger window of visibility – Better use of execution resources Deep Speculation Improves Parallelism Intel Copyright © 2000 Intel Corporation. PDX

Improving Instructions Per Cycle l Improve Fall 2000 efficiency – Branch prediction – Do

Improving Instructions Per Cycle l Improve Fall 2000 efficiency – Branch prediction – Do more things in a clock l Reduce time it takes to do something – Reducing latency Intel Copyright © 2000 Intel Corporation. PDX

Rapid Execution Engine l Dramatically Fall 2000 lower ALU latency l P 6: –

Rapid Execution Engine l Dramatically Fall 2000 lower ALU latency l P 6: – 1 clock @ 1 GHz 1 ns l P 4 P: – ½ clock @ >1. 4 GHz <0. 36 ns Intel Copyright © 2000 Intel Corporation. PDX

L 1 Data Cache Copyright © 2000 Intel Corporation. Store AGU Load AGU ALU

L 1 Data Cache Copyright © 2000 Intel Corporation. Store AGU Load AGU ALU ALU FP move FP store FMul FAdd MMX Intel SSE L 1 D-Cache and D-TLB Schedulers uop Queues 3 FP RF u. Code ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L 2 Cache and Control BTB & I-TLB 3. 2 GB/s System Interface Fall 2000 PDX

High Performance L 1 Data Cache Fall 2000 l 8 KB, 4 -way set

High Performance L 1 Data Cache Fall 2000 l 8 KB, 4 -way set associative, 64 -byte lines l Very high bandwidth – 1 Ld and 1 St per clock l New access algorithms l Very low latency – 2 clock read New algorithm enables faster cache Intel Copyright © 2000 Intel Corporation. PDX

Data Speculation Fall 2000 l Observation: Almost all memory accesses hit in the cache

Data Speculation Fall 2000 l Observation: Almost all memory accesses hit in the cache l Optimize for the common case – Assume that the access will hit the cache – Use a low cost mechanism to fix the rare cases that miss l Benefit: – Reduces latency – Significantly higher performance Intel Copyright © 2000 Intel Corporation. PDX

Replay Fall 2000 l Repairs incorrect speculation – Re-execute until correct l Replay is

Replay Fall 2000 l Repairs incorrect speculation – Re-execute until correct l Replay is u. OP specific – Replay the u. OP that mis-speculated – Replay dependent u. OPs – Independent u. OPs are not replayed Intel Copyright © 2000 Intel Corporation. PDX

L 1 Cache is >2 x Faster Fall 2000 l P 6: – 3

L 1 Cache is >2 x Faster Fall 2000 l P 6: – 3 clocks @ 1 GHz l P 4 P: 3 ns – 2 clocks @ ³ 1. 4 GHz <1. 4 ns Lower Latency is Higher Performance Intel Copyright © 2000 Intel Corporation. PDX

Example with Higher IPC and Faster Clock! Code Sequence P 6 @1 GHz Fall

Example with Higher IPC and Faster Clock! Code Sequence P 6 @1 GHz Fall 2000 Pentium® 4 Processor @1. 4 GHz Ld Add Add 10 clocks 10 ns IPC = 0. 6 Copyright © 2000 Intel Corporation. 6 clocks 4. 3 ns IPCIntel = 1. 0 PDX

L 2 Advanced Transfer Cache Copyright © 2000 Intel Corporation. Store AGU Load AGU

L 2 Advanced Transfer Cache Copyright © 2000 Intel Corporation. Store AGU Load AGU ALU ALU FP move FP store FMul FAdd MMX Intel SSE L 1 D-Cache and D-TLB Schedulers uop Queues 3 FP RF u. Code ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L 2 Cache and Control BTB & I-TLB 3. 2 GB/s System Interface Fall 2000 PDX

L 2 ATC Organization l 256 KB, Fall 2000 8 -way set associative –

L 2 ATC Organization l 256 KB, Fall 2000 8 -way set associative – 128 -byte lines – Two 64 -byte pieces per line l Holds both data and instructions l High bandwidth: 45 GB/Sec @ 1. 4 GHz – 2. 8 x P 6 @1 GHz Intel Copyright © 2000 Intel Corporation. PDX

Aggregate Cache Latency Fall 2000 l Function of all caches in a processor l

Aggregate Cache Latency Fall 2000 l Function of all caches in a processor l Overall Effective Latency L 1 latency + L 1 Miss Rate * L 2 latency + L 2 Miss Rate * DRAM Latency Average cache speed is >1. 8 x better than the Pentium® III Processor! Average on desktop applications, Pentium® III Processor @ 1 GHz, Pentium ® 4 Processor @ 1. 4 GHz Copyright © 2000 Intel Corporation. Intel PDX

Recap Pentium®III Processor Pentium® 4 Processor Frequency 1 GHz > 1. 4 Ghz Adder

Recap Pentium®III Processor Pentium® 4 Processor Frequency 1 GHz > 1. 4 Ghz Adder Speed 1 ns <. 36 ns L 1 Cache Speed 3 ns < 1. 42 ns L 1 Cache Size 16 KB 8 KB L 1 Cache Bandwidth 16 GB/sec > 44. 8 GB/sec L 2 Cache Bandwidth 16 GB/sec > 44. 8 GB/sec Uop Fetch Bandwidth 3 billion/sec > 4. 2 billion/sec Adder Bandwidth 2 billion/sec > 5. 6 billion/sec Branch targets 512 4092 Instructions In flight 40 126 Loads in flight 16 48 Stores in flight 12 24 Relative Improvement > 1. 4 Fall 2000 > 2. 8 > 2. 1 0. 5 > 2. 8 > 1. 4 > 2. 8 8 3. 15 3 2 Intel Copyright © 2000 Intel Corporation. PDX

Intel® Pentium® 4 Processor Summary Fall 2000 l Revolutionary, new microarchitecture from Intel designed

Intel® Pentium® 4 Processor Summary Fall 2000 l Revolutionary, new microarchitecture from Intel designed for the evolving Internet l Design features for balanced, high performance platform scalability and headroom l The world’s highest performance desktop processor Intel Copyright © 2000 Intel Corporation. PDX

Fall 2000 Intel Copyright © 2000 Intel Corporation. PDX

Fall 2000 Intel Copyright © 2000 Intel Corporation. PDX