Inside the Pentium 4 Processor Microarchitecture Fall 2000
- Slides: 46
Inside the ® Pentium 4 Processor Micro-architecture Fall 2000 Next Generation IA-32 Micro-architecture Doug Carmean Principal Architect Intel Architecture Group August 24, 2000 Intel Copyright © 2000 Intel Corporation. Labs
Agenda Fall 2000 l IA-32 Processor Roadmap l Design Goals l Frequency l Instructions Per Cycle l Summary Intel Copyright © 2000 Intel Corporation. PDX
Intel® Pentium® 4 Processor Fall 2000 Performance Intel® Net. Burst™ Micro-Architecture Today P 6 Micro-Architecture P 5 Micro-Architecture 486 Micro-architecture Time Copyright © 2000 Intel Corporation. Intel PDX
Intel® Pentium® 4 Processor Design Goals Fall 2000 l Deliver world class performance across both existing and emerging applications l Deliver performance headroom and scalability for the future Micro-architecture that will Drive Performance Leadership for the Next Several Years Intel Copyright © 2000 Intel Corporation. PDX
Intel® Netburst. TM Micro-architecture 400 MHz System Bus Fall 2000 Advanced Transfer Cache Advanced Dynamic Execution Hyper Pipelined Technology Rapid Execution Engine Streaming SIMD Extensions 2 Execution Trace Cache Copyright © 2000 Intel Corporation. Enhanced Floating Point / Multi-Media Intel PDX
Store AGU Load AGU ALU ALU FP move FP store FMul FAdd MMX SSE L 1 D-Cache and D-TLB Schedulers uop Queues 3 FP RF u. Code ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L 2 Cache and Control BTB & I-TLB 3. 2 GB/s System Interface Pentium® 4 Processor Block Diagram
Store AGU Load AGU ALU ALU FP move FP store FMul FAdd MMX SSE L 1 D-Cache and D-TLB Schedulers uop Queues 3 FP RF u. Code ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L 2 Cache and Control BTB & I-TLB 3. 2 GB/s System Interface Pentium® 4 Processor Block Diagram
CPU Architecture 101 Fall 2000 Delivered Performance = Frequency * Instructions Per Cycle Frequency Intel Copyright © 2000 Intel Corporation. PDX
Frequency l What Fall 2000 limits frequency? – Process technology – Microarchitecture l On a given process technology – Fewer gates per pipeline stage will deliver higher frequency Frequency is driven by Microarchitecture Intel Copyright © 2000 Intel Corporation. PDX
Netburst. TM Micro-architecture Pipeline vs P 6 Intro at 733 MHz 9. 18µ Basic P 6 Pipeline 1 Fetch 2 Fetch Fall 2000 3 4 5 6 7 8 Decode Rename ROB Rd Rdy/Sch Dispatch 10 Exec Basic Pentium® 4 Processor Pipeline 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 Que Sch 11 12 13 14 15 Sch Disp RF Intro at 16³ 17 18 19 20 1. 4 GHz RF Ex Flgs Br Ck Drive. 18µ Hyper pipelined Technology enables industry leading performance and clock rate Intel Copyright © 2000 Intel Corporation. PDX
Hyper Pipelined Technology Fall 2000 20 Netburst Micro-Architecture Today 1. 13 GHz ³ 1. 4 GHz Frequency 10 P 6 Micro-Architecture 233 MHz 166 MHz 5 P 5 Micro-Architecture 60 MHz Introduction Copyright © 2000 Intel Corporation. Time Intel PDX
Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive TC Nxt IP: Trace cache next instruction pointer L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Pointer from the BTB, indicating location of next instruction. Intel PDX
Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive TC Fetch: Trace cache fetch L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Trace Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Read the decoded instructions (u. OPs) out of the Execution Trace Cache Intel PDX
Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive: Wire delay L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Drive the u. OPs to the allocator Intel PDX
Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Alloc: Allocate L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Allocate resources required for execution. The resources include Load buffers, Store buffers, etc. . Intel PDX
Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Rename: Register renaming L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Rename the logical registers (EAX) to the physical register space (128 are implemented). Intel PDX
Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Que: Write into the u. OP Queue L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface u. OPs are placed into the queues, where they are held until there is room in the schedulers Intel PDX
Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Sch: Schedule L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Write into the schedulers and compute dependencies. Watch for dependency to resolve. Intel PDX
Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Disp: Dispatch L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Send the u. OPs to the appropriate execution unit. Intel PDX
Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive RF: Register File Copyright © 2000 Intel Corporation. AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Schedulers uop Queues 3 FP FPRF RF ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer. RF RF Integer L 2 Cache and Control BTB & I-TLB System Interface Read the register file. These are the source(s) for the pending operation (ALU or other). Intel PDX
Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Ex: Execute ROM Copyright © 2000 Intel Corporation. AGU AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L 2 Cache and Control BTB & I-TLB System Interface Execute the u. OPs on the appropriate execution port. Intel PDX
Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Flgs: Flags L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Compute flags (zero, negative, etc. . ). These are typically the input to a branch instruction. Intel PDX
Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive Br Ck: Branch Check L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface The branch operation compares the result of the actual branch direction with the prediction. Intel PDX
Hyper pipelined Technology 1 2 TC Nxt IP 3 4 5 6 TC Fetch Drive Alloc 7 8 Rename 9 10 11 12 13 14 15 Sch Disp Que Sch RF Fall 2000 16 17 RF Ex 18 19 20 Flgs Br Ck Drive: Wire delay L 2 Cache and Control AGU ALU ALU Fms Fop L 1 D-Cache and D-TLB Integer RF FP RF Schedulers 3 uop Queues 3 Rename/Alloc Trace Cache Decoder ROM Copyright © 2000 Intel Corporation. AGU BTB & I-TLB System Interface Drive the result of the branch check to the front end of the machine. Intel PDX
CPU Architecture 101 Fall 2000 Delivered Performance = Frequency * Instructions Per Cycle Intel Copyright © 2000 Intel Corporation. PDX
Improving Instructions Per Cycle l Improve Fall 2000 efficiency – Branch prediction – Do more things in a clock l Reduce time it takes to do something – Reducing latency Intel Copyright © 2000 Intel Corporation. PDX
Improving Instructions Per Cycle l Improve Fall 2000 efficiency – Branch prediction – Do more things in a clock l Reduce time it takes to do something – Reducing latency Intel Copyright © 2000 Intel Corporation. PDX
Branch Prediction Fall 2000 l Accurate branch prediction is key to enabling longer pipelines l Dramatic improvement over P 6 branch predictor: – 8 x the size (4 K) – Eliminated 1/3 of the mispredictions l Proven to be better than all other publicly disclosed predictors – (g-share, hybrid, etc) Intel Copyright © 2000 Intel Corporation. PDX
The Execution Trace Cache Copyright © 2000 Intel Corporation. Store AGU Load AGU ALU ALU FP move FP store FMul FAdd MMX Intel SSE L 1 D-Cache and D-TLB Schedulers uop Queues 3 FP RF u. Code ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L 2 Cache and Control BTB & I-TLB 3. 2 GB/s System Interface Fall 2000 PDX
Execution Trace Cache l Advanced Fall 2000 L 1 instruction cache – Caches “decoded” IA-32 instructions (uops) l Removes decoder pipeline latency l Capacity is ~12 K u. Ops l Integrates branches into single line – Follows predicted path of program execution Execution Trace Cache feeds fast engine Intel Copyright © 2000 Intel Corporation. PDX
Execution Trace Cache 1 cmp 2 br -> T 1. . . (unused code) T 1: T 2: T 3: 3 sub 4 br -> T 2. . . (unused code) 5 mov 6 sub 7 br -> T 3. . . (unused code) Fall 2000 Trace Cache Delivery 1 cmp 2 br T 1 3 T 1: sub 4 br T 2 5 mov 6 7 br T 3 8 T 3: add 9 sub 10 mul 11 cmp sub 12 br T 4 8 add 9 sub 10 mul 11 cmp 12 br -> T 4 Intel Copyright © 2000 Intel Corporation. PDX
Advanced Dynamic Execution Fall 2000 l Extends basic features found in P 6 core l Very deep speculative execution – 126 instructions in flight (3 x P 6) – 48 loads (3 x P 6) and 24 stores (2 x P 6) l Provides larger window of visibility – Better use of execution resources Deep Speculation Improves Parallelism Intel Copyright © 2000 Intel Corporation. PDX
Improving Instructions Per Cycle l Improve Fall 2000 efficiency – Branch prediction – Do more things in a clock l Reduce time it takes to do something – Reducing latency Intel Copyright © 2000 Intel Corporation. PDX
Rapid Execution Engine l Dramatically Fall 2000 lower ALU latency l P 6: – 1 clock @ 1 GHz 1 ns l P 4 P: – ½ clock @ >1. 4 GHz <0. 36 ns Intel Copyright © 2000 Intel Corporation. PDX
L 1 Data Cache Copyright © 2000 Intel Corporation. Store AGU Load AGU ALU ALU FP move FP store FMul FAdd MMX Intel SSE L 1 D-Cache and D-TLB Schedulers uop Queues 3 FP RF u. Code ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L 2 Cache and Control BTB & I-TLB 3. 2 GB/s System Interface Fall 2000 PDX
High Performance L 1 Data Cache Fall 2000 l 8 KB, 4 -way set associative, 64 -byte lines l Very high bandwidth – 1 Ld and 1 St per clock l New access algorithms l Very low latency – 2 clock read New algorithm enables faster cache Intel Copyright © 2000 Intel Corporation. PDX
Data Speculation Fall 2000 l Observation: Almost all memory accesses hit in the cache l Optimize for the common case – Assume that the access will hit the cache – Use a low cost mechanism to fix the rare cases that miss l Benefit: – Reduces latency – Significantly higher performance Intel Copyright © 2000 Intel Corporation. PDX
Replay Fall 2000 l Repairs incorrect speculation – Re-execute until correct l Replay is u. OP specific – Replay the u. OP that mis-speculated – Replay dependent u. OPs – Independent u. OPs are not replayed Intel Copyright © 2000 Intel Corporation. PDX
L 1 Cache is >2 x Faster Fall 2000 l P 6: – 3 clocks @ 1 GHz l P 4 P: 3 ns – 2 clocks @ ³ 1. 4 GHz <1. 4 ns Lower Latency is Higher Performance Intel Copyright © 2000 Intel Corporation. PDX
Example with Higher IPC and Faster Clock! Code Sequence P 6 @1 GHz Fall 2000 Pentium® 4 Processor @1. 4 GHz Ld Add Add 10 clocks 10 ns IPC = 0. 6 Copyright © 2000 Intel Corporation. 6 clocks 4. 3 ns IPCIntel = 1. 0 PDX
L 2 Advanced Transfer Cache Copyright © 2000 Intel Corporation. Store AGU Load AGU ALU ALU FP move FP store FMul FAdd MMX Intel SSE L 1 D-Cache and D-TLB Schedulers uop Queues 3 FP RF u. Code ROM 3 Rename/Alloc Trace Cache Decoder BTB Integer RF L 2 Cache and Control BTB & I-TLB 3. 2 GB/s System Interface Fall 2000 PDX
L 2 ATC Organization l 256 KB, Fall 2000 8 -way set associative – 128 -byte lines – Two 64 -byte pieces per line l Holds both data and instructions l High bandwidth: 45 GB/Sec @ 1. 4 GHz – 2. 8 x P 6 @1 GHz Intel Copyright © 2000 Intel Corporation. PDX
Aggregate Cache Latency Fall 2000 l Function of all caches in a processor l Overall Effective Latency L 1 latency + L 1 Miss Rate * L 2 latency + L 2 Miss Rate * DRAM Latency Average cache speed is >1. 8 x better than the Pentium® III Processor! Average on desktop applications, Pentium® III Processor @ 1 GHz, Pentium ® 4 Processor @ 1. 4 GHz Copyright © 2000 Intel Corporation. Intel PDX
Recap Pentium®III Processor Pentium® 4 Processor Frequency 1 GHz > 1. 4 Ghz Adder Speed 1 ns <. 36 ns L 1 Cache Speed 3 ns < 1. 42 ns L 1 Cache Size 16 KB 8 KB L 1 Cache Bandwidth 16 GB/sec > 44. 8 GB/sec L 2 Cache Bandwidth 16 GB/sec > 44. 8 GB/sec Uop Fetch Bandwidth 3 billion/sec > 4. 2 billion/sec Adder Bandwidth 2 billion/sec > 5. 6 billion/sec Branch targets 512 4092 Instructions In flight 40 126 Loads in flight 16 48 Stores in flight 12 24 Relative Improvement > 1. 4 Fall 2000 > 2. 8 > 2. 1 0. 5 > 2. 8 > 1. 4 > 2. 8 8 3. 15 3 2 Intel Copyright © 2000 Intel Corporation. PDX
Intel® Pentium® 4 Processor Summary Fall 2000 l Revolutionary, new microarchitecture from Intel designed for the evolving Internet l Design features for balanced, high performance platform scalability and headroom l The world’s highest performance desktop processor Intel Copyright © 2000 Intel Corporation. PDX
Fall 2000 Intel Copyright © 2000 Intel Corporation. PDX
- µops
- Processor microarchitecture
- Integer pipeline stages of pentium processor
- Pentium processor history
- Pentium 4 cache organization
- Ia32 architecture
- Intel pentium processor
- Compare 8086 and 80386 processor
- Pentium 4 processor
- Structured computer organization
- Types of microinstruction format
- Agner fog
- Isa computer architecture
- Microarchitecture diagram
- Pentium iii
- Ilp
- Intel pentium
- Pentium 1 2 3 4
- Superscalar architecture diagram
- Intel pentium wiki
- Intel pentium history
- Two steps
- Intel pentium
- Pentium evolution
- Pentium iv
- Intel pentium
- Pentium mips
- Ghz
- Pentium 4 cache organization
- Paralleilism
- Pentium architecture
- Các môn thể thao bắt đầu bằng tiếng bóng
- Hình ảnh bộ gõ cơ thể búng tay
- Sự nuôi và dạy con của hổ
- điện thế nghỉ
- Dạng đột biến một nhiễm là
- Biện pháp chống mỏi cơ
- Trời xanh đây là của chúng ta thể thơ
- độ dài liên kết
- Voi kéo gỗ như thế nào
- Thiếu nhi thế giới liên hoan
- Vẽ hình chiếu vuông góc của vật thể sau
- Một số thể thơ truyền thống
- Thế nào là hệ số cao nhất
- Hệ hô hấp
- Ng-html
- Bảng số nguyên tố