The Alpha 21364 and 21464 Microprocessors Continuing the

  • Slides: 36
Download presentation
The Alpha 21364 and 21464 Microprocessors: Continuing the Performance Lead Beyond Y 2 K

The Alpha 21364 and 21464 Microprocessors: Continuing the Performance Lead Beyond Y 2 K Shubu Mukherjee, Ph. D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer Corporation Shrewsbury, Massachusetts Slides: 1998 Microprocessor Forum (Peter Bannon) and 1999 Microprocessor Forum (Joel Emer) Better answers

Alpha Microprocessor Roadmap Lower Cost Higher Performance 0. 125 mm 0. 18 mm 0.

Alpha Microprocessor Roadmap Lower Cost Higher Performance 0. 125 mm 0. 18 mm 0. 35 mm 21464 EV 8 21364 EV 7 21264 EV 6 0. 125 mm 0. 28 mm 21364 EV 78 21264 EV 67 0. 18 mm 21264 EV 68 1998 Better answers 1999 2000 2001 First System Ship 2002 2003

Alpha 21264 Microprocessor u Architectural Features First “Out-of-Order” Alpha l Four-wide superscalar l …

Alpha 21264 Microprocessor u Architectural Features First “Out-of-Order” Alpha l Four-wide superscalar l … l u Performance World’s Fastest Microprocessor (www. spec. org, 11/17/99) l 39 SPECINT 95, 68 SPECFP 95 @ 700 Mhz l – Better answers Intel Pentium III @ 733 Mhz delivers 36 SPECINT 95, 30 SPECFP 95

Alpha Microprocessor Roadmap Lower Cost Higher Performance 0. 125 mm 0. 18 mm 0.

Alpha Microprocessor Roadmap Lower Cost Higher Performance 0. 125 mm 0. 18 mm 0. 35 mm 21464 EV 8 21364 EV 7 21264 EV 6 0. 125 mm 0. 28 mm 21364 EV 78 21264 EV 67 0. 18 mm 21264 EV 68 1998 Better answers 1999 2000 2001 First System Ship 2002 2003

Alpha 21364 Goals u u Leadership single stream performance l Higher operating frequency l

Alpha 21364 Goals u u Leadership single stream performance l Higher operating frequency l Integrated memory interface Leadership multiprocessor performance l Integrated system / multiprocessor interface Better answers

Alpha 21364 Features u System-on-a-Chip Alpha 21264 core with enhancements l Integrated L 2

Alpha 21364 Features u System-on-a-Chip Alpha 21264 core with enhancements l Integrated L 2 Cache l Integrated memory controller l Integrated network interface l u Fault-Tolerance l Better answers Support for lock-step operation to enable highavailability systems.

21364 Chip Block Diagram 16 L 1 Miss Buffers 64 K Icache 21264 Core

21364 Chip Block Diagram 16 L 1 Miss Buffers 64 K Icache 21264 Core 64 K Dcache 16 L 1 Victim Buf Better answers Address In R A M B U S Address Out L 2 Cache Memory Controller Network Interface 16 L 2 Victim Buf N S E W I/O

21364 Core FETCH Stage: 0 Branch Predictors MAP 1 2 QUEUE 3 REG 4

21364 Core FETCH Stage: 0 Branch Predictors MAP 1 2 QUEUE 3 REG 4 EXEC 5 Int Reg Map Int Issue Queue (20) Reg File (80) Exec 80 in-flight instructions plus 32 loads and 32 stores Next-Line Address L 1 Ins. Cache 64 KB 2 -Set Better answers Reg File (80) Exec DCACHE 6 Addr Exec L 1 Data Cache 64 KB 2 -Set L 2 cache 1. 5 MB 6 -Set 4 Instructions / cycle FP Reg Map FP Issue Queue (15) Reg File (72) FP ADD Div/Sqrt FP MUL Victim Buffer Miss Address

Integrated L 2 Cache 1. 5 MB u 6 -way set associative u 16

Integrated L 2 Cache 1. 5 MB u 6 -way set associative u 16 GB/s total read/write bandwidth u 16 Victim buffers for L 1 -> L 2 u 16 Victim buffers for L 2 -> Memory u ECC SECDED code u 12 ns load to use latency u Better answers

Integrated Memory Controller u Direct RAMbus High data capacity per pin l 800 MHz

Integrated Memory Controller u Direct RAMbus High data capacity per pin l 800 MHz operation l 30 ns CAS latency pin to pin l 6 GB/sec read or write bandwidth u 100 s of open pages u Directory based cache coherence u ECC SECDED u Better answers

Integrated Network Interface Direct processor-to-processor interconnect u 10 GB/second per processor u 15 ns

Integrated Network Interface Direct processor-to-processor interconnect u 10 GB/second per processor u 15 ns processor-to-processor latency u Out-of-order network with adaptive routing u Asynchronous clocking between processors u 3 GB/second I/O interface per processor u Better answers

21364 System Block Diagram M 364 IO IO M M 364 364 IO IO

21364 System Block Diagram M 364 IO IO M M 364 364 IO IO M M 364 IO Better answers M 364 IO IO

Alpha 21364 Technology 0. 18 mm CMOS u 1000+ MHz u 100 Watts @

Alpha 21364 Technology 0. 18 mm CMOS u 1000+ MHz u 100 Watts @ 1. 5 volts 2 u 3. 5 cm u 6 Layer Metal u 100 million transistors u 8 million logic l 92 million RAM l Better answers

Alpha 21364 Status 70 SPECint 95 (estimated) u 120 SPECfp 95 (estimated) u RTL

Alpha 21364 Status 70 SPECint 95 (estimated) u 120 SPECfp 95 (estimated) u RTL model running u Tapeout: Summer 2000 u Better answers

21364 Summary: System on a Chip u Integrated L 2 cache and memory controller

21364 Summary: System on a Chip u Integrated L 2 cache and memory controller l u outstanding single processor performance Integrated network interface high performance multi-processor systems l scales to large number of processors l Better answers

Alpha Microprocessor Overview Lower Cost Higher Performance 0. 125 mm 0. 18 mm 0.

Alpha Microprocessor Overview Lower Cost Higher Performance 0. 125 mm 0. 18 mm 0. 35 mm 21464 EV 8 21364 EV 7 21264 EV 6 0. 125 mm 0. 28 mm 21364 EV 78 21264 EV 67 0. 18 mm 21264 EV 68 1998 Better answers 1999 2000 2001 First System Ship 2002 2003

Alpha 21464 Goals u Leadership single stream performance l Higher operating frequency / better

Alpha 21464 Goals u Leadership single stream performance l Higher operating frequency / better technology New microarchitecture l Integrated memory interface (like 21364) l u Leadership multiprocessor performance l l Simultaneous Multithreading (with minimal change/cost) Integrated system / multiprocessor interface (like 21364) Better answers

Alpha 21464 Technology Overview u Leading edge process technology – 1. 2 -2. 0

Alpha 21464 Technology Overview u Leading edge process technology – 1. 2 -2. 0 GHz 0. 125µm CMOS l SOI-compatible l Cu interconnect l low-k dielectrics l u Chip characteristics ~1. 2 V Vdd l ~250 Million transistors l Better answers

Alpha 21464 Architecture Overview Enhanced out-of-order execution u 8 -wide superscalar u Large on-chip

Alpha 21464 Architecture Overview Enhanced out-of-order execution u 8 -wide superscalar u Large on-chip L 2 cache u Direct RAMBUS interface u On-chip router for system interconnect u Glueless, directory-based, cc. NUMA u l u for up to 512 -way multiprocessing 4 -way simultaneous multithreading (SMT) Better answers

Instruction Issue Time Reduced function unit utilization due to dependencies Better answers

Instruction Issue Time Reduced function unit utilization due to dependencies Better answers

Superscalar Issue Time Superscalar leads to more performance, but lower utilization Better answers

Superscalar Issue Time Superscalar leads to more performance, but lower utilization Better answers

Predicated Issue Time Adds to function unit utilization, but results are thrown away Better

Predicated Issue Time Adds to function unit utilization, but results are thrown away Better answers

Chip Multiprocessor Time Limited utilization when only running one thread Better answers

Chip Multiprocessor Time Limited utilization when only running one thread Better answers

Fine Grained Multithreading Time Intra-thread dependencies still limit performance Better answers

Fine Grained Multithreading Time Intra-thread dependencies still limit performance Better answers

Simultaneous Multithreading Time Maximum utilization of function units by independent operations Better answers

Simultaneous Multithreading Time Maximum utilization of function units by independent operations Better answers

Basic Out-of-order Pipeline Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg

Basic Out-of-order Pipeline Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg Write PC Register Map Regs Dcache Icache Threadblind Better answers Regs Retire

SMT Pipeline Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg Write

SMT Pipeline Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg Write PC Register Map Regs Icache Better answers Dcache Regs Retire

Changes for SMT u Basic pipeline – unchanged u Replicated resources l l u

Changes for SMT u Basic pipeline – unchanged u Replicated resources l l u Program counters Register maps Shared resources l l l Register file (size increased) Instruction queue First and second level caches Translation buffers Branch predictor Better answers

Multiprogrammed workload Better answers

Multiprogrammed workload Better answers

Decomposed SPEC 95 Applications Better answers

Decomposed SPEC 95 Applications Better answers

Multithreaded Applications Better answers

Multithreaded Applications Better answers

Architectural Abstraction 1 Processor with 4 Thread Processing Units (TPUs) u Shared hardware resources

Architectural Abstraction 1 Processor with 4 Thread Processing Units (TPUs) u Shared hardware resources u TPU 0 Icache TPU 1 TPU 2 TLB Scache Better answers TPU 3 Dcache

21464 System Block Diagram 0123 M EV 8 IO IO IO M M M

21464 System Block Diagram 0123 M EV 8 IO IO IO M M M EV 8 IO Better answers M EV 8 IO IO

Alpha 21464 Summary u Leadership single stream performance l Higher operating frequency / better

Alpha 21464 Summary u Leadership single stream performance l Higher operating frequency / better technology New microarchitecture l Integrated memory interface (like 21364) l u Leadership multiprocessor performance l l Simultaneous Multithreading (with minimal changes/cost) Integrated system / multiprocessor interface (like 21364) Better answers

Maintain Performance Lead Beyond Y 2 K u Alpha 21364 Reuses 21264 microprocessor core

Maintain Performance Lead Beyond Y 2 K u Alpha 21364 Reuses 21264 microprocessor core l System on a chip l u Alpha 21464 New microarchitecture l System on a chip l Simultaneous Multithreading l Better answers

My Current Research: Beyond 21464? u The Truth Project (w/ Joel Emer) l u

My Current Research: Beyond 21464? u The Truth Project (w/ Joel Emer) l u The Multinet Project (w/ Rick Kessler) l u Tightly-coupled multiprocessor networks The Reliant Project (w/ Steve Reinhardt) l u Examines different microarchitectural issues Self-Checking Microprocessors using SMT, ISCA submission Asim (w/ VSSAD Labs) l Performance Model for Alphas beyond 21464 Better answers