Intel Pentium 4 Processor Presented by Steve Kelley

Intel Pentium 4 Processor Presented by Steve Kelley Zhijian Lu

Outline Introduction (Zhijian) n Instruction Set Architecture (Zhijian) n Instruction Stream (Steve) n Data Stream (Zhijian) n What went wrong (Steve) n

Introduction Intel Pentium 4 processor is the latest IA-32 processor equipped with the full set of IA 32 SIMD operations n It is the first implementation of a new micro -architecture which is called “Net. Burst” by Intel n

Comparison Between Pentium 3 and Pentium 4

Execution on MPEG 4 Benchmarks @ 1 GHz

Instruction Set Architecture n Pentium 4 ISA = Pentium 3 ISA + SSE 2 (Streaming SIMD Extensions 2) n SSE 2 is an architectural enhancement to the IA-32 architecture

SSE 2 extends the Intel MMX technology and the SSE extensions with 144 new instructions: n 128 -bit SIMD integer arithmetic operations n 128 -bit SIMD double precision floating point operations n Enhanced cache and memory management operations

Comparison Between SSE and SSE 2 n n n Both support operations on 128 -bit XMM register SSE only supports 4 packed single-precision floating-point values SSE 2 supports more: 2 packed double-precision floating-point values 16 packed byte integers 8 packed word integers 4 packed doubleword integers 2 packed quadword integers Double quadword

Hardware Support for SSE 2 Adder and Multiplier units in the SSE 2 engine are 128 bits wide, twice the width of that in Pentium 3 n Increased bandwidth in load/Store for floating-point values load and store are 128 -bit wide One load plus one store can be completed between XMM register and L 1 cache in one clock cycle n

SSE 2 Instructions (1) n n Data movements Move data between XMM registers and memory Double precision floating-point operations Arithmetic instructions on both scalar and packed values Logical Instructions Perform logical operations on packed double precision floating-point values n

SSE 2 Instructions (2) n n Compare instructions Compare packed and scalar double precision floating-point values Shuffle and unpack instructions Shuffle or interleave double-precision floatingpoint values in packed double-precision floatingpoint operands Conversion Instructions Conversion between double word and doubleprecision floating-point or between singleprecision and double-precision floating-point values n

SSE 2 Instructions (3) n n Packed single-precision floating-point instructions Convert between single-precision floating-point and double word integer operands 128 -bit SIMD integer instructions Operations on integers contained in XMM registers Cacheability Control and Instruction Ordering More operations for caching of data when storing from XMM registers to memory and additional control of instruction ordering on store operations n

Conclusion Pentium 4 is equipped with the full set of IA 32 SIMD technology. All existing software can run correctly on it. n AMD has decided to embrace and implement SSE and SSE 2 in his future CPU n

Instruction Stream

Instruction Stream n What’s new? – Added Trace Cache – Improved branch predictor

Front End Prefetches instructions that are likely to be executed n Fetches instructions that haven’t been prefetched n Decodes instruction into mops n Generates mops for complex instructions or special purpose code n Predicts branches n

Prefetch n Three methods of prefetching: Instructions only – Hardware n Data only – Software n Code or data – Hardware n

Decoder Single decoder that can operate at a maximum of 1 instruction per cycle n Receives instructions from L 2 cache 64 bits at a time n Some complex instructions must enlist the help of the microcode ROM n

Trace Cache Primary instruction cache in Net. Burst achitecture n Stores decoded mops n ~12 K capacity n On a Trace Cache miss, instructions are fetched and decoded from the L 2 cache n

Trace Cache Has its own branch predictor that directs where instruction fetching needs to go next in the Trace Cache n Removes n – Decoding costs on frequently decoded instructions – Extra latency to decode instructions upon branch mispredictions

Microcode ROM Used for complex IA-32 instructions (> 4 mops) , such as string move, and for fault and interrupt handling n When a complex instruction is encountered, the Trace Cache jumps into the microcode ROM which then issues the mops n After the microcode ROM finishes, the front end of the machine resumes fetching mops from the Trace Cache n

Branch Prediction n Predicts ALL near branches – Includes conditional branches, unconditional calls and returns, and indirect branches Does not predict far transfers n – Includes far calls, irets, and software interrupts

Branch Prediction Dynamically predict the direction and target of branches based on PC using BTB n If no dynamic prediction is available, statically predict n – Taken for backwards looping branches – Not taken forward branches n Traces are built across predicted branches to avoid branch penalties

Branch Target Buffer Uses a branch history table and a branch target buffer to predict n Updating occurs when branch is retired n

Return Address Stack 16 entries n Predicts return addresses for procedure calls n Allows branches to and their targets to coexist in a single cache line n – Increases parallelism since decode bandwidth is not wasted

Branch Hints P 4 permits software to provide hints to the branch prediction and trace formation hardware to enhance performance n Take the forms of prefixes to conditional branch instructions n Used only at trace build time and have no effect on already built traces n

Out-of-Order Execution Designed to optimize performance by handling the most common operations in the most common context as fast as possible n 126 mops can in flight at once n – Up to 48 loads / 24 stores

Issue Instruction are fetched and decoded by translation engine n Translation engine builds instructions into sequences of mops n Stores mops to trace cache n Trace cache can issue 3 mops per cycle n

Execution Can dispatch up to 6 mops per cycle n Exceeds trace cache and retirement mop bandwidth n – Allows for greater flexibility in issuing mops to different execution units

Execution Units

Retirement Can retire 3 mops per cycle n Precise exceptions n Reorder buffer to organize completed mops n Also keeps track of branches and sends updated branch information to the BTB n

Execution Pipeline

Data Stream of Pentium 4 Processor

Register Renaming

Register Renaming (2) 8 -entry architectural register file n 128 -entry physical register file n 2 RAT Frontend RAT and Retirement RAT n Data do not need to be copied between register files when the instruction retires n

On-chip Caches n n L 1 instruction cache (Trace Cache) L 1 data cache L 2 unified cache Parameters: n All caches are not inclusive and a pseudo-LRU replacement algorithm is used

L 1 Instruction Cache Execution Trace Cache stores decoded instructions n Remove decoder latency from main execution loops n Integrate path of program execution flow into a single line n

L 1 Data Cache n n Nonblocking Support up to 4 outstanding load misses Load latency 2 -clock for integer 6 -clock for floating-point 1 Load and 1 Store per clock Speculation Load Assume the access will hit the cache “Replay” the dependent instructions when miss happen n n

L 2 Cache Load latency Net load access latency of 7 cycles n Nonblocking n Bandwidth One load and one store in one cycle New cache operation begin every 2 cycles 256 -bit wide bus between L 1 and L 2 48 Gbytes per second @ 1. 5 GHz n

Data Prefetcher in L 2 Cache Hardware prefetcher monitors the reference patterns n Bring cache lines automatically n Attempt to stay 256 bytes ahead of current data access location n Prefetch for up to 8 simultaneous independent streams n

Store and Load Out of order store and load operations Stores are always in program order n 48 loads and 24 stores could be in flight n Store buffers and load buffers are allocated at the allocation stage Total 24 store buffers and 48 load buffers n

Store operations are divided into two parts: Store data Store address n Store data is dispatched to the fast ALU, which operates twice per cycle n Store address is dispatched to the store AGU per cycle n

Store-to-Load Forwarding Forward data from pending store buffer to dependent load n Load stalls still happen when the bytes of the load operation are not exact the same as the bytes in the pending store buffer n

System Bus Deliver data with 3. 2 Gbytes/S n 64 -bit wide bus n Four data phase per clock cycle (quad pumped) n 100 MHz clocked system bus

Conclusion Reduced Cache Size VS Increased Bandwidth and Lower Latency

What Went Wrong

No L 3 cache Original plans called for a 1 M cache n Intel’s idea was to strap a separate memory chip, perhaps an SDRAM, on the back of the processor to act as the L 3 n But that added another 100 pads to the processor, and would have also forced Intel to devise an expensive cartridge package to contain the processor and cache memory n

Small L 1 Cache n Only 8 k! – Doubled size of L 2 cache to compensate n Compare with – AMD Athlon – 128 k – Alpha 21264 – 64 k – PIII – 32 k – Itanium – 16 k

Loses consistently to AMD In terms of performance, the Pentium 4 is as slow or slower than existing Pentium III and AMD Athlon processors n In terms of price, an entry level Pentium 4 sells for about double the cost of a similar Pentium III or AMD Athlon based system n 1. 5 GHz clock rate is more hype than substance n