Architecture of Parallel Computers CSC ECE 506 Summer

  • Slides: 46
Download presentation
Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Multiprocessors Lecture 10

Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Multiprocessors Lecture 10 6/19/2006 Dr Steve Hunter

What is a Multiprocessor? • A collection of communicating processors – Goals: balance load,

What is a Multiprocessor? • A collection of communicating processors – Goals: balance load, reduce inherent communication and extra work • A multi-cache, multi-memory system P P . . . P – Role of these components essential regardless of programming model – Programming model and communication abstraction affect specific performance tradeoffs Interconnect Arch of Parallel Computers Node Controller Cache Proc Node Controller . . . CSC / ECE 506 Cache Proc 2

Scalable Multiprocessors • Study of machines which scale from 100’s to 1000’s of processors.

Scalable Multiprocessors • Study of machines which scale from 100’s to 1000’s of processors. • Scalability has implications at all levels of system design and all aspects must scale • Areas emphasized in text: – Memory bandwidth must scale with number of processors – Communication network must provide scalable bandwidth at reasonable latency – Protocols used for transferring data and synchronization techniques must scale • A scalable system attempts to avoid inherent design limits on the extent to which resources can be added to the system. For example: – How does the bandwidth/throughput of the system when adding processors? – How does the latency or time per operation increase? – How does the cost of the system increase? – How are the systems packaged? Arch of Parallel Computers CSC / ECE 506 3

Scalable Multiprocessors • Basic metrics affecting the scalability of a computer system from an

Scalable Multiprocessors • Basic metrics affecting the scalability of a computer system from an application perspective are (Hwang 93): – – – – Machine size: the number of processors Clock rate: determines the basic machine cycle Problem size: amount of computational workload or the number of data points CPU time: the actual CPU time in seconds I/O Demand: the input/output demand in moving the program, data, and results Memory capacity: the amount of main memory used in a program execution Communication overhead: the amount of time spent for interprocessor communication, synchronization, remote access, etc. – Computer cost: the total cost of hardware and software resources required to execute a program – Programming overhead: the development overhead associated with an application program • Power (watts) and cooling are also becoming inhibitors to scalability Arch of Parallel Computers CSC / ECE 506 4

Scalable Multiprocessors • Some other recent trends: – Multi-core processors on a single socket

Scalable Multiprocessors • Some other recent trends: – Multi-core processors on a single socket – Reduced focus on increasing the processor clock rate – System-on-Chip (So. C) combining processor cores, integrated interconnect, cache, high-performance I/O, etc. – Geographically distributed applications utilizing Grid and HPC technologies – Standardizing of high-performance interconnects (e. g. , Infiniband, Ethernet) and focus by Ethernet community to reduce latency – For example, Force 10’s recently announced 10 Gb Ethernet switch » S 2410 data center switch has set industry benchmarks for 10 Gigabit price and latency » Designed for high performance clusters, 10 Gigabit Ethernet connectivity to the server and Ethernet-based storage solutions, the S 2410 supports 24 line-rate 10 Gigabit Ethernet ports with ultra low switching latency of 300 nanoseconds at an industry-leading price point. » The S 2410 eliminates the need to integrate Infiniband or proprietary technologies into the data center and opens the high performance storage market to 10 Gigabit Ethernet technology. Standardizing on 10 Gigabit Ethernet in the data core, edge and storage radically simplifies management and reduces total network cost. Arch of Parallel Computers CSC / ECE 506 5

Bandwidth Scalability • What fundamentally limits bandwidth? – Number of wires, clock rate •

Bandwidth Scalability • What fundamentally limits bandwidth? – Number of wires, clock rate • Must have many independent wires or high clock rate • Connectivity through bus or switches Arch of Parallel Computers CSC / ECE 506 6

Some Memory Models P 1 Pn $ $ Switch (Interleaved) First-level $ Inter connection

Some Memory Models P 1 Pn $ $ Switch (Interleaved) First-level $ Inter connection network Mem Main memory (Interleaved) Centralized Memory Dance Hall, UMA Shared Cache Pn P 1 Mem $ Inter connection network Distributed Memory (NUMA) Arch of Parallel Computers CSC / ECE 506 7

Generic Distributed Memory Organization • Network bandwidth requirements? – independent processes? – communicating processes?

Generic Distributed Memory Organization • Network bandwidth requirements? – independent processes? – communicating processes? • Latency? Arch of Parallel Computers CSC / ECE 506 8

Some Examples Arch of Parallel Computers CSC / ECE 506 9

Some Examples Arch of Parallel Computers CSC / ECE 506 9

AMD Opteron Processor Technology Arch of Parallel Computers CSC / ECE 506 10

AMD Opteron Processor Technology Arch of Parallel Computers CSC / ECE 506 10

AMD Opteron Architecture • AMD Opteron™ Processor Key Architectural Features – – – Single-Core

AMD Opteron Architecture • AMD Opteron™ Processor Key Architectural Features – – – Single-Core and Dual-Core AMD Opteron processors Direct Connect Architecture Integrated DDR DRAM Memory Controller Hyper. Transport™ Technology Low-Power Arch of Parallel Computers CSC / ECE 506 11

AMD Opteron Architecture • Direct Connect Architecture – – Addresses and helps reduce the

AMD Opteron Architecture • Direct Connect Architecture – – Addresses and helps reduce the real challenges and bottlenecks of system architectures Memory is directly connected to the CPU optimizing memory performance I/O is directly connected to the CPU for more balanced throughput and I/O CPUs are connected directly to CPUs allowing for more linear symmetrical multiprocessing • Integrated DDR DRAM Memory Controller – Changes the way the processor accesses main memory, resulting in increased bandwidth, reduced memory latencies, and increased processor performance – Available memory bandwidth scales with the number of processors – 128 -bit wide integrated DDR DRAM memory controller capable of supporting up to eight (8) registered DDR DIMMs per processor – Available memory bandwidth up to 6. 4 GB/s (with PC 3200) per processor • Hyper. Transport™ Technology – Provides a scalable bandwidth interconnect between processors, I/O subsystems, and other chipsets – Support of up to three (3) coherent Hyper. Transport links, providing up to 24. 0 GB/s peak bandwidth per processor – Up to 8. 0 GB/s bandwidth per link providing sufficient bandwidth for supporting new interconnects including PCI-X, DDR, Infini. Band, and 10 G Ethernet – Offers low power consumption (1. 2 volts) to help reduce a system’s thermal budget Arch of Parallel Computers CSC / ECE 506 12

AMD Processor Architecture • Low-Power Processors – The AMD Opteron processor HE offers industry-leading

AMD Processor Architecture • Low-Power Processors – The AMD Opteron processor HE offers industry-leading performance per watt making it an ideal solution for rack-dense 1 U servers or blades in datacenter environments as well as cooler, quieter workstation designs. – The AMD Opteron processor EE provides maximum I/O bandwidth currently available in a single-CPU controller making it a good fit for embedded controllers in markets such as NAS and SAN. • Other features of the AMD Opteron processor include: – 64 -bit wide key data and address paths that incorporate a 48 -bit virtual address space and a 40 -bit physical address space – ECC (Error Correcting Code) protection for L 1 cache data, L 2 cache data and tags, and DRAM with hardware scrubbing of all ECC protected arrays – 90 nm SOI (Silicon on Insulator) process technology for lower thermal output levels and improved frequency scaling – Support for all instructions necessary to be fully compatible with SSE 2 technology – Two (2) additional pipeline stages (compared to AMD’s seventh generation architecture) for increased performance and frequency scalability – Higher IPC (Instructions per Clock) achieved through additional key features, such as larger TLBs (Translation Look-aside Buffers), flush filters, and enhanced branch prediction algorithm Arch of Parallel Computers CSC / ECE 506 13

AMD vs Intel • Performance – SPECint® rate 2000 – the Dual-Core AMD Opteron

AMD vs Intel • Performance – SPECint® rate 2000 – the Dual-Core AMD Opteron processor Model 280 outperforms the dual-core Xeon 2. 8 GHz processor by 28 percent – SPECfp® rate 2000 – The Dual-Core AMD Opteron processor Model 280 outperforms the dual-core Xeon 2. 8 GHz processor by 76 percent – SPECjbb® 2005 – The Dual-Core AMD Opteron processor Model 280 outperforms the dual -core Xeon 2. 8 GHz by 13 percent • Processor Power (Watts) – Dual-Core AMD Opteron™ processors at 95 watts, consume far less than the competition’s dual-core x 86 server processors which according to their published data, have a thermal design power of 135 watts and a max power draw of 150 watts. – Can result in 200 percent better performance-per-watt than the competition. – Even greater performance-per-watt can be achieved with lower-power processors that are (55 watt). Arch of Parallel Computers CSC / ECE 506 14

IBM POWER Processor Technology Arch of Parallel Computers CSC / ECE 506 15

IBM POWER Processor Technology Arch of Parallel Computers CSC / ECE 506 15

IBM POWER 4+ Processor Architecture Arch of Parallel Computers CSC / ECE 506 16

IBM POWER 4+ Processor Architecture Arch of Parallel Computers CSC / ECE 506 16

IBM POWER 4+ Processor Architecture • • Two processor cores on one chip as

IBM POWER 4+ Processor Architecture • • Two processor cores on one chip as shown Clock frequency of the POWER 4+ is 1. 5 --1. 9 GHz The L 2 cache modules are connected to the processors by the Core Interface Unit (CIU) switch, a 2× 3 crossbar with a bandwidth of 40 B/cycle per port. This enables to ship 32 B to either the L 1 instruction cache or the data cache of each of the processors and to store 8 B values at the same time. Also, for each processor there is a Non-cacheable Unit that interfaces with the Fabric Controller and that takes care of non-cacheable operations. The Fabric Controller is responsible for the communication with three other chips that are embedded in the same Multi Chip Module (MCM), to L 3 cache, and to other MCMs. The bandwidths at 1. 7 GHz are 13. 6, 9. 0, and 6. 8 GB/s, respectively. The chip further still contains a variety of devices: the L 3 cache directory and the L 3 and Memory Controller that should bring down the off-chip latency considerably The GX Controller is responsible for the traffic on the GX bus which transports data to/from the system and in practice is used for I/O. The maximum size of the L 3 cache is 32 MB Arch of Parallel Computers CSC / ECE 506 17

IBM POWER 5 Processor Architecture Arch of Parallel Computers CSC / ECE 506 18

IBM POWER 5 Processor Architecture Arch of Parallel Computers CSC / ECE 506 18

IBM POWER 5 Processor Architecture • • Like the POWER 4(+), the POWER 5

IBM POWER 5 Processor Architecture • • Like the POWER 4(+), the POWER 5 has two processor cores on a chip Clock frequency of the POWER 5 is 1. 9 GHz. However the higher density on the chip (the POWER 5 is built in 130 nm technology instead of 180 nm used for the POWER 4+) more devices could be placed on the chip and they could also be enlarged. The L 2 caches of two neighboring chips are connected and the L 3 caches are directly connected to the L 2 caches. Both are larger than their respective counterparts of the POWER 4: 1. 875 MB against 1. 5 MB for the L 2 cache and 36 MB against 32 MB for the L 3 cache. In addition the speed of the L 3 cache has gone up from about 120 cycles to 80 cycles. Also the associativity of the caches has improved: from 2 -way to 4 -way for the L 1 cache, from 8 -way to 10 -way for the L 2 cache, and from 8 to 12 -way for the L 3 cache. A big difference is also the improved bandwidth from memory to the chip: it has increased from 4 GB/s for the POWER 4+ to approximately 16 GB/s for the POWER 5 Arch of Parallel Computers CSC / ECE 506 19

Intel (Future) Processor Technology Arch of Parallel Computers CSC / ECE 506 20

Intel (Future) Processor Technology Arch of Parallel Computers CSC / ECE 506 20

DP Server Architecture FSB Scaling 800 MHz 1067 MHz 1333 MHz Platform Performance: It’s

DP Server Architecture FSB Scaling 800 MHz 1067 MHz 1333 MHz Platform Performance: It’s all about Bandwidth & Latency Bensley Platform Point to Point Interconnect Large Shared Caches 64 GB A M B Easy Capacity Expansion A M B A M B A M B Blackford A M B Energy 17 GB/s A M B Perf A M B Local and Remote Memory Latencies Consistent A M B Central Coherency Resolution Sustained & Balanced Throughput CONSTANTLY ANALYZING THE REQUIREMENTS, THE TECHNOLOGIES, AND THE TRADEOFFS *Graphics not representative of actual die photo or relative size

Energy Efficient Performance – High End DATACENTER “ENERGY LABEL” NASA Columbia 2 MWatt 60

Energy Efficient Performance – High End DATACENTER “ENERGY LABEL” NASA Columbia 2 MWatt 60 TFlops goal 10, 240 cpus – Itanium II $50 M Source: NASA 30, 720 Flops/Watt 1, 288 Flops/Dollar Computational Efficiency 17, 066 Flops/Watt 467 Flops/Dollar ASC Purple 6 MWatt 100 TFlops goal 12 K+ cpus – Power 5 $230 M Source: LLNL

Core™ Microarchitecture Advances With Quad Core Energy Efficient Performance 4 X Quad Core Clovertown

Core™ Microarchitecture Advances With Quad Core Energy Efficient Performance 4 X Quad Core Clovertown 3 X Clovertown H 1 ‘ 07 Woodcrest H 2 ‘ 06 Server 2 X Kentsfield 1 X Dempsey MV H 1 ‘ 06 Paxville DP H 2 ‘ 05 Irwindale H 1 ‘ 05 Desktop DP Performance Per Watt Comparison with SPECint_rate at the Platform Level Source: Intel® *Graphics not representative of actual die photo or relative size

Woodcrest for Servers PERFORMANCE 80% POWER 35% …relative to Intel® Xeon® 2. 8 GHz

Woodcrest for Servers PERFORMANCE 80% POWER 35% …relative to Intel® Xeon® 2. 8 GHz 2 x 2 MB Source: Intel based on estimated SPECint*_rate_base 2000 and thermal design power

Multi-Core Energy-Efficient Performance Dual-Core 1. 73 x Performance 1. 73 x Power 1. 13

Multi-Core Energy-Efficient Performance Dual-Core 1. 73 x Performance 1. 73 x Power 1. 13 x 1. 02 x 1. 00 x Over-clocked (+20%) Max Frequency Dual-core (-20%) Relative single-core frequency and Vcc

Intel Multi-Core Trajectory Quad-Core Dual-Core 2006 2007

Intel Multi-Core Trajectory Quad-Core Dual-Core 2006 2007

Blade Architectures - General Blade Server …. . Blade Server Interconnect • Blades interconnected

Blade Architectures - General Blade Server …. . Blade Server Interconnect • Blades interconnected by common fabrics – Infiniband, Ethernet, Fibre Channel are most common – Redundant interconnect available for failover – Links from interconnect provide external connectivity • Each blade contains multiple processors, memory and network interfaces – Some options may exist such as for memory, network connectivity, etc. • Power, cooling, management overhead optimized within chassis – Multiple chassis connected together for greater number of nodes Arch of Parallel Computers CSC / ECE 506 27

IBM Blade. Center H Architecture I/O Bridge Blade 1 HS Switch 2 Blade 2

IBM Blade. Center H Architecture I/O Bridge Blade 1 HS Switch 2 Blade 2 I/O Bridge 3/ SM 3 . . . HS Switch 3 High-speed Switch • Ethernet or Infiniband • 4 x (16 wire) blade links • 4 x (16 wire) bridge links • 1 x (4 wire) Mgmt links • Uplinks: Up to 12 x links for IB and at least four 10 Gb links for Ethernet HS Switch 4 I/O Bridge 4 / SM 4 I/O Bridge Switch Module 1 Switch Module 2 Blade 14 Mgmt Mod 1 • e. g. , Ethernet, Fibre Channel, Passthru • Dual 4 x (16 wire) wiring internally to each HSSM Mgmt Mod 2 Arch of Parallel Computers CSC / ECE 506 28

IBM Blade. Center H Architecture Blade 1 . . . Blade 2 . .

IBM Blade. Center H Architecture Blade 1 . . . Blade 2 . . . Blade 14 Blade 1 . . . Blade 14 . . . Blade 1 . . . Blade 2 Interconnect Blade 2 • External high performance interconnect(s) for multiple chassis • Independent scaling of blades and I/O • Scales for large clusters • Architecture used for Barcelona Supercomputer Center (Mare. Nostrum #8) Blade 14 Arch of Parallel Computers CSC / ECE 506 29

Cray (Octigabay) Blade Architecture 5. 4 GB/s 6. 4 GB/sec 5. 4 GB/s (DDR

Cray (Octigabay) Blade Architecture 5. 4 GB/s 6. 4 GB/sec 5. 4 GB/s (DDR 333) (HT) (DDR 333) Memory Opteron Rapid Array Communications Processor includes MPI hardware RAP Opteron 8 GB/s Per Link • • • Accelerator FPGA for application offload capabilities • RAP Memory MPI offloaded in hardware throughput 2900 MB/s and latency 1. 6 us Processor and communication interface is Hyper Transport Dedicated link and communication chip per processor FPGA Accelerator available for additional offload Arch of Parallel Computers CSC / ECE 506 30

Cray Blade Architecture Blade Characteristics • Two 2. 2 GHz Opteron processors – •

Cray Blade Architecture Blade Characteristics • Two 2. 2 GHz Opteron processors – • Two Rapid Array Communication Processors – – • • Dedicated memory per processor One dedicated link each One redundant link each Application Accelerator FPGA Local Hard Drive Shelf Characteristics • • • One or two IB 4 x switches Twelve or twenty four external links Additional I/O: – – – • Arch of Parallel Computers CSC / ECE 506 Three high speed I/O links Four PCI-X bus slots 100 Mb Ethernet for management Active management system 31

Cray Blade Architecture 100 Mb Ethernet High-Speed I/O PCI-X Active Mgmt System 5. 4

Cray Blade Architecture 100 Mb Ethernet High-Speed I/O PCI-X Active Mgmt System 5. 4 GB/s 6. 4 GB/sec 5. 4 GB/s (DDR 333) (HT) (DDR 333) Memory RAP includes MPI offload capabilities Opteron RAP Opteron 8 GB/s RAP Memory Accelerator Rapid. Array. Interconnect (24 x 24 IB 4 x. Switch) • • • Six blades per 3 U shelf Twelve 4 x IB external links for primary switch An additional twelve links are available with optional redundant switch Arch of Parallel Computers CSC / ECE 506 32

Cray Blade Architecture . . . Interconnect • • With up to 24 external

Cray Blade Architecture . . . Interconnect • • With up to 24 external links per Octigabay 12 K shelf, a variety of configurations can be achieved depending on the applications Octiga. Bay suggests interconnecting shelves by mesh, tori, fat trees, and fullyconnected shelves for systems that fit in one rack. Fat tree configurations require extra switches, which Octiga. Bay terms “spine switches. ” Mellanox Infiniband technology used for interconnect Up to 25 shelves can be directly connected, yielding a 300 Opteron system Arch of Parallel Computers CSC / ECE 506 33

IBM Blue. Gene/L Architecture – Compute Card • • The Blue. Gene/L is the

IBM Blue. Gene/L Architecture – Compute Card • • The Blue. Gene/L is the first in a new generation of systems made by IBM for very massively parallel computing. The individual speed of the processor has been traded in favor of very dense packaging and a low power consumption per processor. The basic processor in the system is a modified Power. PC 400 at 700 MHz. Two of these processors reside on a chip together with 4 MB of shared L 3 cache and a 2 KB L 2 cache for each of the processors. The processors have two load ports and one store port from/to the L 2 caches at 8 bytes/cycle. This is half of the bandwidth required by the two floating-point units (FPUs) and as such quite high. The CPUs have 32 KB of instruction cache and of data cache on board. In favorable circumstances a CPU can deliver a peak speed of 2. 8 Gflop/s because the two FPUs can perform fused multiply-add operations. Note that the L 2 cache is smaller than the L 1 cache which is quite unusual but which allows it to be fast. Arch of Parallel Computers CSC / ECE 506 34

IBM Blue. Gene/L Architecture Arch of Parallel Computers CSC / ECE 506 35

IBM Blue. Gene/L Architecture Arch of Parallel Computers CSC / ECE 506 35

IBM Blue. Gene/L Overview • • Blue. Gene/L boasts a peak speed of over

IBM Blue. Gene/L Overview • • Blue. Gene/L boasts a peak speed of over 360 tera. OPS, a total memory of 32 tebibytes, total power of 1. 5 megawatts, and machine floor space of 2, 500 square feet. The full system has 65, 536 dual-processor compute nodes. Multiple communications networks enable extreme application scaling: Nodes are configured as a 32 x 64 3 D torus; each node is connected in six different directions for nearest-neighbor communications A global reduction tree supports fast global operations such as global max/sum in a few microseconds over 65, 536 nodes Multiple global barrier and interrupt networks allow fast synchronization of tasks across the entire machine within a few microseconds 1, 024 gigabit-per-second links to a global parallel file system to support fast input/output to disk The Blue. Gene/L possesses no less than 5 networks, 2 of which are of interest for inter-processor communication: a 3 -D torus network and a tree network. The torus network is used for most general communication patterns. The tree network is used for often occurring collective communication patterns like broadcasting, reduction operations, etc. Arch of Parallel Computers CSC / ECE 506 36

IBM’s X 3 Architecture Arch of Parallel Computers CSC / ECE 506 37

IBM’s X 3 Architecture Arch of Parallel Computers CSC / ECE 506 37

IBM System x X 3 Chipset - Scalable Intel MP Server. . . EM

IBM System x X 3 Chipset - Scalable Intel MP Server. . . EM 64 T Scalability Controller Scalability Ports to Other Xeon MP Processors Memory Interface RAM DIMMs EM 64 T I/O Bridge PCI-X 2. 0 266 MHz Memory Controller Memory Interface RAM DIMMs

IBM System x X 3 Chipset – Low latency EM 64 T 108 ns

IBM System x X 3 Chipset – Low latency EM 64 T 108 ns Scalability Controller Scalability Ports to Other Xeon MP Processors Memory Interface RAM DIMMs I/O Bridge PCI-X 2. 0 266 MHz Memory Controller Memory Interface RAM DIMMs

IBM System x X 3 Chipset – Low latency EM 64 T 222 ns

IBM System x X 3 Chipset – Low latency EM 64 T 222 ns Scalability Controller Scalability Ports to Other Xeon MP Processors Memory Interface RAM DIMMs I/O Bridge PCI-X 2. 0 266 MHz Memory Controller Memory Interface RAM DIMMs

IBM System x X 3 Chipset – High Bandwidth EM 64 T Scalability Controller

IBM System x X 3 Chipset – High Bandwidth EM 64 T Scalability Controller Memory Controller 15 GB/s Scalability Ports to Other Xeon MP Processors 6. 4 GB/s 10. 6 GB/s I/O Bridge PCI-X 2. 0 266 MHz 21. 3 GB/s Memory Interface RAM DIMMs

IBM System x X 3 Chipset – Snoop Filter Others X 3 No traffic

IBM System x X 3 Chipset – Snoop Filter Others X 3 No traffic on FSB EM 64 T Internal Cache Miss EM 64 T Controller Memory Interface EM 64 T Scalabilit y Controller I/O Bridge EM 64 T I/O Bridge Memory Controller I/O Bridge Memory Interface Memory Interface § Cache from EACH processor is mirrored on hurricane § Relieves traffic on FSB § Faster access to main memory § Cache from EACH processor must be snooped § Creates traffic along FSB 42 EM 64 T Node Controller Memory Interface EM 64 T IBM Confidential IBM Systems

IBM System x X 3 Chipset – Snoop Filter Others EM 64 T X

IBM System x X 3 Chipset – Snoop Filter Others EM 64 T X 3 EM 64 T Node Controller I/O Bridge Memory Interface Memory Interface § Cache from EACH processor is mirrored on hurricane § Relieves traffic on FSB § Faster access to main memory § Cache from EACH processor must be snooped § Creates traffic along FSB 43 EM 64 T Memory Controller Memory Interface Scalabilit y Controller I/O Bridge Memory Interface EM 64 T IBM Confidential IBM Systems

IBM System x Multi-node Scalability – Putting it together Requester Hurricane Data Null Hurricane

IBM System x Multi-node Scalability – Putting it together Requester Hurricane Data Null Hurricane This node owns the requested cache line Cached address maps to Main memory on this node * Snoop filter and Remote Directory work together in multi-node configurations * Local processor cache miss is broadcast to all memory controllers * Only the node owning latest copy of data responds * Maximizes system bus bandwidth IBM Confidential

IBM System x X 3 Chipset – Scalability Ports 4 Way Cabled Scalability Ports

IBM System x X 3 Chipset – Scalability Ports 4 Way Cabled Scalability Ports 16 Way – Single OS Image MP X 3 Scales to 32 Way Dual Core Capable – 64 Cores

The End Arch of Parallel Computers CSC / ECE 506 46

The End Arch of Parallel Computers CSC / ECE 506 46