Intel Pentium 4 ENCM 515 2002 Jonathan Bienert

Intel Pentium 4 ENCM 515 - 2002 Jonathan Bienert Tyson Marchuk

Overview: • • Product review Specialized architectural features (Net. Burst) SIMD instructional capabilities (MMX, SSE 2) SHARC 2106 x comparison

Intel Pentium 4 • Reworked micro-architecture for highbandwidth applications • Internet audio and streaming video, image processing, video content creation, speech, 3 D, CAD, games, multi-media, and multi-tasking user environments • These are DSP intensive applications! – What about uses other than in PC?

Hardware Features: (Net. Burst micro-architecture) • • Hyper pipelined technology Advanced dynamic execution Cache (data, L 1, L 2) Rapid ALU execution engines 400 MHz bus OOE Microcode ROM

Hyper Pipeline • 20 -stage pipeline!!! • breaks down complex CISC instructions – sub-stages mimic RISC – faster execution

Filling the pipeline. . . • Review of next 126 instructions to be executed • Branch prediction – if mispredict must flush 20 -stage pipeline!!! – branch target buffer (BTB) – 4 K branch history table (BHT) – assembly instruction hints

Cache • 8 KB Data Cache • L 1 Execution Trace Cache – 12 K of previous micro-instructions stored – saves having to translate • L 2 Advanced Transfer Cache – 256 K for data – 256 -bit transfer every cycle • allows 77 GB/s data transfer on 2. 4 GHz

Rapid ALU Execution Engines • 2 ALUs – allow parallel operations • Many arithmetic operations take 1/2 cycle – each 2 X ALU can have 2 operations per cycle

Software Features: • Multimedia Extensions (MMX) – 8 MMX registers • Streaming SIMD Extensions (SSE 2) – 8 SSE/SSE 2 registers • Standard x 86 Registers – EAX, EBX, ECX, EDX, ESI, etc. – Register rename to over 100

MMX (Multimedia Extensions) • Accelerated performance through SIMD • multimedia, communication, internet applications • 64 -bit packed INTEGER data – signed/unsigned

SSE 2 (Streaming SIMD Extensions) • Accelerate a broad range of applications – video, speech, and image, photo processing, encryption, financial, engineering, and scientific applications • 128 -bit SIMD instruction formats · 4 single precision FP values · 2 double precision FP values · 16 byte values · 8 word values · 4 double word values · 2 quad word values · 1 128 -bit integer value

SIMD Example (16 -tap FIR filter - Real numbers) • Applications for real FIR filters • general purpose filters in image processing, audio, and communication algorithms • Will utilize SSE 2 SIMD instruction set

Thinking about SIMD • SSE 2 instruction format is 128 -bits • 128 -bit SSE 2 registers • Many data formats! • What precision do we want? • Lets use 32 -bit floating point for coefficients, input, output 4 data sets x 32 -bit = 128 bits

Parallelizing • Require many single multiplications (coefficients x inputs), then add the results for output! • Multiplications… • then need to perform additions. . .

Using SSE 2 format • Can hold 4 elements of an array (of 32 -bit data) in each 128 -bit register • 4 single precision floating point ops per cycle (32 -bit)

Additions. . . • In both registers, now have 4 32 -bit results – First add the results into an accumulator register • 4 single precision floating point ops per cycle (32 -bit)

Additions. . . • In a register, now have 4 32 -bit results – however, NO SSE 2 instruction to add these 4! – But can use other instructions • Some BIT INTERTWINING…then add – This will give results for several output values!

ADI SHARC 21 k vs. P 4 Disadvantages • Slower clock speed (40 MHz vs 2400 MHz) • Less opportunities for parallelism (5 vs 11) • Much less memory (Cache and System) – Limited algorithm applicability – Limited applications • Older (Less support – compiler) – 1994 vs 2001

ADI Sharc 21 k vs. P 4 • • • Advantages Hardware loops Easier to program for optimal speed Cheaper Lower power consumption Runs cooler

FIR Performance • Hard to obtain P 4 performance numbers • Can estimate based on 2 FP multiplies per clock, clock rate and assumption that pipeline can be kept full. – 2 * 2. 4 GHz ~ 4. 8 billion multiplies per second – If ~4 multiplies per element & 44000 samples/s – FIR length > ~25 k taps • SHARC => ~ 200 taps (Lab 4) • Factor of ~125 x

IIR Performance • • Hard to obtain P 4 performance numbers No hardware circular buffers Does have BTB, BHT, etc. Prefetches ~256 bytes ahead of current position in code.

FFT Performance • Hard to obtain P 4 performance numbers • Prime 95 uses FFT to calculate Lucas. Lehmer test for Mersenne Primes – Involves FFT, squaring and i. FFT, etc. • 256 k points on P 4 2. 3 GHz ~ 10. 517 ms • Compare to SHARC 2048 point FFT ~0. 37 ms • If SHARC could do 256 k, 46. 25 ms (But…)

Optimization Example • Hard to optimize Pentium 4 assembly • Example of multiplying by a constant, 10 • Taken mainly from: www. emulators. com/docs/pentium_1. htm

Multiplying by 10 • Slowest way: – IMUL EAX, 10 • Usually optimal way (Visual C++ 6. 0) – LEA EAX, [EAX+EAX*4] – SHL EAX, 1 – Shift – Add – Shift – On most x 86 processors takes 2 cycles – Pentium MMX and before 3 cycles – On Pentium 4 takes 6 cycles!

Multiplying by 10 • Optimal for Pentium 4 – LEA ECX, [EAX + EAX] – LEA EAX, [ECX+EAX*8] – On most x 86 still takes 2 cycles – On Pentium 4 takes ~ 3 cycles (OOE - Ops) – But on older processors Pentium MMX and before this now takes 4 cycles!

Multiplying by 10 • Best generic case – LEA EAX, [EAX + EAX*4] – ADD EAX, EAX – On most x 86 still takes 2 cycles – On older processors Pentium MMX and before this now takes 3 cycles again – On Pentium 4 this takes 4 cycles • Obviously really hard to optimize

REFERENCES • Intel application note: AP 809 - Real and Complex Filter Using Streaming SIMD Extentions • graphics from: http: //www 6. tomshardware. com/cpu/00 q 4/0 01120/p 4 -01. html