The Cray X 1 Architecture Steve Scott Cray

  • Slides: 34
Download presentation
The Cray X 1 Architecture Steve Scott Cray X 1 Chief Architect Cray Proprietary

The Cray X 1 Architecture Steve Scott Cray X 1 Chief Architect Cray Proprietary (expired)

Cray’s Computing Vision Scalable High-Bandwidth Computing 2010 ‘Cascade’ ‘Black Widow 2’ 2006 Sustained Petaflops

Cray’s Computing Vision Scalable High-Bandwidth Computing 2010 ‘Cascade’ ‘Black Widow 2’ 2006 Sustained Petaflops X 1 E 2004 Product Integration X 1 Red Storm Cray X 1 Overview 2006 2004 2005 RS ‘Strider 2’ ‘Strider 3’ Cray Proprietary (expired) ‘Strider X’ Slide 2

Cray X 1 Cray PVP T 3 E • Powerful vector processors • Very

Cray X 1 Cray PVP T 3 E • Powerful vector processors • Very high memory bandwidth • Non-unit stride computation • Special ISA features • Extreme scalability • Optimized communication • Memory hierarchy • Synchronization features • Modernized the ISA • Improved via vectors Extreme scalability with high bandwidth vector processors Cray X 1 Overview Cray Proprietary (expired) Slide 3

Cray X 1 Instruction Set Architecture New ISA – – Much larger register set

Cray X 1 Instruction Set Architecture New ISA – – Much larger register set (32 x 64 vector, 64+64 scalar) All operations performed under mask 64 - and 32 -bit memory and IEEE arithmetic Integrated synchronization features Advantages of a vector ISA – Compiler provides useful dependence information to hardware – Very high single processor performance with low complexity ops/sec = (cycles/sec) * (instrs/cycle) * (ops/instr) – Localized computation on processor chip – large register state with very regular access patterns – registers and functional units grouped into local clusters (pipes) excellent fit with future IC technology – Latency tolerance and pipelining to memory/network very well suited for scalable systems – Easy extensibility to next generation implementation Cray X 1 Overview Cray Proprietary (expired) Slide 4

Not Your Father’s Vector Machine • New instruction set • New system architecture •

Not Your Father’s Vector Machine • New instruction set • New system architecture • New processor microarchitecture • “Classic” vector machines were programmed differently – Classic vector: Optimize for loop length with little regard for locality – Scalable micro: Optimize for locality with little regard for loop length • The Cray X 1 is programmed like a parallel microbased machine – Rewards locality: register, cache, local memory, remote memory – Decoupled microarchitecture performs well on short loop Cray X 1 Overview Cray Proprietary (expired) Slide 5

Cray X 1 Node P P P P $ $ $ $ M M

Cray X 1 Node P P P P $ $ $ $ M M M M mem mem mem mem IO IO 51 Gflops, 200 GB/s • Four multistream processors (MSPs), each 12. 8 Gflops • High bandwidth local shared memory (128 Direct Rambus channels) • 32 network links and four I/O links per node Cray X 1 Overview Cray Proprietary (expired) Slide 6

NUMA Scalable up to 1024 Nodes Interconnection Network • 16 parallel networks for bandwidth

NUMA Scalable up to 1024 Nodes Interconnection Network • 16 parallel networks for bandwidth • Global shared memory across machine Cray X 1 Overview Cray Proprietary (expired) Slide 7

Designed for Scalability • Distributed shared memory (DSM) architecture – Low latency, load/store access

Designed for Scalability • Distributed shared memory (DSM) architecture – Low latency, load/store access to entire machine (tens of TBs) • Decoupled vector memory architecture for latency tolerance – Thousands of outstanding references, flexible addressing • Very high performance network – High bandwidth, fine-grained transfers – Same router as Origin 3000, but 16 parallel copies of the network • Architectural features for scalability – Remote address translation – Global coherence protocol optimized for distributed memory – Fast synchronization • Parallel I/O scales with system size Cray X 1 Overview Cray Proprietary (expired) Slide 8

Network Topology (16 CPUs) P M 0 Section 0 Cray X 1 Overview P

Network Topology (16 CPUs) P M 0 Section 0 Cray X 1 Overview P P node 0 P M 15 M 1 P P P M 15 M 1 P M 0 P M 15 P M 0 P M 1 P M 0 P P M 15 M 1 node 2 node 3 Section 15 Section 1 Cray Proprietary (expired) Slide 9

Network Topology (128 CPUs) R R R R Cray X 1 Overview Cray Proprietary

Network Topology (128 CPUs) R R R R Cray X 1 Overview Cray Proprietary (expired) Slide 10

Network Topology (512 CPUs) one chassis Cray X 1 Overview Cray Proprietary (expired) Slide

Network Topology (512 CPUs) one chassis Cray X 1 Overview Cray Proprietary (expired) Slide 11

Network Topology (1024 CPUs) one chassis Cray X 1 Overview Cray Proprietary (expired) Slide

Network Topology (1024 CPUs) one chassis Cray X 1 Overview Cray Proprietary (expired) Slide 12

Some Design Challenges • How do we tolerate ever growing memory latencies? • How

Some Design Challenges • How do we tolerate ever growing memory latencies? • How do we support address translations and cache coherence with high bandwidth vector processors? Cray X 1 Overview Cray Proprietary (expired) Slide 13

Decoupled Vector Microarchitecture • Decoupled access/execute and decoupled scalar/vector • Scalar unit runs ahead,

Decoupled Vector Microarchitecture • Decoupled access/execute and decoupled scalar/vector • Scalar unit runs ahead, doing addressing and control – Scalar and vector loads issued early – Store addresses computed early, saved for later use – Operations queued and executed later when data arrives • Hardware dynamically unrolls loops – Scalar starts on next loop before current loop has completed – Memory pipeline stays full of requests – Special sync operations keep pipeline full, even across barriers This is key to making the system perform well on short loop nests Cray X 1 Overview Cray Proprietary (expired) Slide 14

Maintaining Decoupling Past Synchronization Points Msync control and data barrier: P 0 …. St

Maintaining Decoupling Past Synchronization Points Msync control and data barrier: P 0 …. St Vx …. Msync …. Ld P 1 …. St Vx …. Msync …. Ld P 2 …. St Vx …. Msync …. Ld P 3 …. St Vx …. Msync …. Ld Want to protect against hazards, but not drain memory pipeline. Vector store addresses computed early (before data is available): – sent out to the shared L 2 cache, where it modifies cache state – later loads from other P chips can now be performed – loads only wait if there is a true conflict Cray X 1 Overview Cray Proprietary (expired) Slide 15

Address Translation • High translation bandwidth: – scalar + four vector translations per cycle

Address Translation • High translation bandwidth: – scalar + four vector translations per cycle per P chip • Remote (hierarchical) translation: – allows each node to manage its own memory (eases memory mgmt. ) – TLB only needs to hold translations for one node scales Cray X 1 Overview Cray Proprietary (expired) Slide 16

Cache Coherence • Global coherence, but only cache memory from local node – Supports

Cache Coherence • Global coherence, but only cache memory from local node – Supports SMP-style codes up to 4 MSPs – References outside this domain converted to non-allocate • Scalable codes use explicit communication anyway – Keeps directory entry and protocol simple – Significant reliability benefits for large scale systems • Explicit cache allocation control – Per instruction hints – Use non-allocating refs to avoid cache pollution • Coherence directory stored on the M chips (rather than in DRAM) – Low latency and really high bandwidth to support vectors • Factor of several hundred over typical DRAM-based directory Cray X 1 Overview Cray Proprietary (expired) Slide 17

Mechanical Design Cray Proprietary (expired)

Mechanical Design Cray Proprietary (expired)

Multi-Chip Module 8 - 7 S IBM IC’s & 80 Decoupling Capacitors 72 mm

Multi-Chip Module 8 - 7 S IBM IC’s & 80 Decoupling Capacitors 72 mm X 8. 3 mm 3832 LGA Pads on BSM 34, 000 C 4 Pads on TSM 173, 000 mm of Routing – a Routing Density of 3300 mm per Square cm Cray X 1 Overview 18 Plane Pairs of X & Y Routing in Ceramic Cray Proprietary (expired) 83 layer, Glass Ceramic/Copper Conductor/Mesh Construction Slide 19

Compliant Interconnect Resistance of 0. 015 m ohms @ 0. 75 amps 3832 Contacts

Compliant Interconnect Resistance of 0. 015 m ohms @ 0. 75 amps 3832 Contacts on 1 -mm Spacing Force of 40 grams (153 kg per MCM) Alignment is by Socket Spring/Fence Centering Cray X 1 Overview Inductance of < 1. 5 n. H @ 500 -1000 MHz Cray Proprietary (expired) Compliance of 12 mils Non-Yielding Re-Mateable Interface (100 times) Slide 20

Printed Circuit Board 558 mm x 431 mm and 3. 5 mm 17, 000

Printed Circuit Board 558 mm x 431 mm and 3. 5 mm 17, 000 Nets and 99, 000 Buried & Thru Vias 1 mm Via Pitch • Two Routes / Channel 3, 600, 000 mm of Routing (1500 mm per square cm) 34 layers - 16 Power and Ground, 16 Signal, & 2 Surface Cray X 1 Overview 3 mil Traces on 4 and 8 mil Pitch 100 W impedance - Differential 45 W impedance - Single Ended Cray Proprietary (expired) Distributes 2000 Amps DC Current Slide 21

Spray Cap Assembly Heat Flux of IC’s on the MCM are: 2 P Chip

Spray Cap Assembly Heat Flux of IC’s on the MCM are: 2 P Chip Heat Flux - 45 W/cm 2 E Chip Heat Flux - 15 W/cm IC Junction Temperature of 85 o C @ Heat Flux Density up to 70 W/cm 2 O-Ring Seal Flow Rate is 1 ml/w/min @ Pressure Differential of 25 psig Mixed Vapor Return Evaporation Efficiency Of ~ 25% Maintain Component Junction Temperatures @ 75 o C +/-10 o Cray X 1 Overview Fluid Inlet Fluorinert™ FC 72, the liquid coolant is atomized and sprayed onto the (ICs) to maintain a continuously wetted surface Cray Proprietary (expired) Slide 22

Power Converter Synchronous Rectification 1, 000 hour MTBF Input Voltage- 48 Volt • Electronic

Power Converter Synchronous Rectification 1, 000 hour MTBF Input Voltage- 48 Volt • Electronic Inrush Control • Current-Share • Margins • Enable Features Conduction Cooled Power Density of 16 watts/in 3 Cray X 1 Overview 190 amp output current @ 1. 8 volts(125 amps @ 2. 5 -volt) with efficiencies of 80% at max. output Cray Proprietary (expired) Slide 23

Memory Daughter Card Forced Convection Cooled Via a Heat Spreader • (qja=16 C/W@1200 fpm)

Memory Daughter Card Forced Convection Cooled Via a Heat Spreader • (qja=16 C/W@1200 fpm) 16 RDRAM® (Rambus®) Memory Chips 1. 25 CFM of Air per Daughter Cards Support Multiple Memory Chip Densities Cray X 1 Overview Connects to the PCB Via a Custom 294 -Pin Connector Cray Proprietary (expired) Slide 24

System Packaging • LC Cabinets – – All System Heat is Rejected to Facility

System Packaging • LC Cabinets – – All System Heat is Rejected to Facility Water 1 to 16 Node Modules per LC Cabinet Up to 8 Cabinets, 512 CPUs, standard hypercube Up to 64 Cabinets, 4, 096 CPUs, mesh topology • AC Cabinets – All System Heat is Rejected to Facility Air – 1 to 4 Node Modules (16 CPUs) per AC Cabinet – Maximum 4 Cabinets, 64 CPUs • 50 Hz & 60 Hz Power Supported Cray X 1 Overview Cray Proprietary (expired) Slide 25

System Specifications • Power LC Cabinet AC Cabinet 82 KW, 208 v 3Ø 20

System Specifications • Power LC Cabinet AC Cabinet 82 KW, 208 v 3Ø 20 KW, 208 v 3Ø • Cooling LC Cabinet AC Cabinet 40 -45 GPM 2000 CFM • Footprint LC Cabinet AC Cabinet 43. 5 in x 84 x 82. 25 in 32 in x 48 in x 82 in • Weight LC Cabinet AC Cabinet Cray X 1 Overview 3500 lbs 1200 lbs Cray Proprietary (expired) Slide 26

Liquid Cooled Cabinet Cable Routing Card Cage & Connectors Router Modules Power Distribution Bus

Liquid Cooled Cabinet Cable Routing Card Cage & Connectors Router Modules Power Distribution Bus Node Modules Damper Doors Incoming Power Box Power Supplies Heat Exchanger Blower Assembly FC-72 Gear Pumps FC-72 Filters Cray X 1 Overview Cray Proprietary (expired) Slide 27

Air Cooled Cabinet Cable Routing Card Cage & Connectors Router Modules Node Modules Blower

Air Cooled Cabinet Cable Routing Card Cage & Connectors Router Modules Node Modules Blower Assembly Heat Exchanger Incoming Power Box FC-72 Filters Power Supplies Cray X 1 Overview Cray Proprietary (expired) Slide 28

Node board Field-Replaceable Memory Daughter Cards Spray Cooling Caps CPU MCM 8 chips Network

Node board Field-Replaceable Memory Daughter Cards Spray Cooling Caps CPU MCM 8 chips Network Interconnect PCB 17” x 22’’ Cray X 1 Overview Air Cooling Manifold Cray Proprietary (expired) Slide 29

Cray X 1 Node Module Cray X 1 Overview Cray Proprietary (expired) Slide 30

Cray X 1 Node Module Cray X 1 Overview Cray Proprietary (expired) Slide 30

Node Module (Power Converter side) Cray X 1 Overview Cray Proprietary (expired) Slide 31

Node Module (Power Converter side) Cray X 1 Overview Cray Proprietary (expired) Slide 31

Cray X 1 Chassis Cray X 1 Overview Cray Proprietary (expired) Slide 32

Cray X 1 Chassis Cray X 1 Overview Cray Proprietary (expired) Slide 32

64 Processor Cray X 1 System ~820 Gflops Cray X 1 Overview Cray Proprietary

64 Processor Cray X 1 System ~820 Gflops Cray X 1 Overview Cray Proprietary (expired) Slide 33

256 Processor Cray X 1 System ~3. 3 Tflops Cray X 1 Overview Cray

256 Processor Cray X 1 System ~3. 3 Tflops Cray X 1 Overview Cray Proprietary (expired) Slide 34