Tera MTA MultiThreaded Architecture Thriveni Movva CMPS 5433

  • Slides: 29
Download presentation
Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Tera MTA (Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Presentation Contains Ø Evolution of Tera MTA Ø Design goals of Tera MTA Ø

Presentation Contains Ø Evolution of Tera MTA Ø Design goals of Tera MTA Ø Tera MTA Architecture Ø Interconnection Network Ø Applications Ø Advantages & Drawbacks Ø Current MTA Status

Evolution Of Tera MTA Ø 1987: Tera Computer Company was established by Burton Smith

Evolution Of Tera MTA Ø 1987: Tera Computer Company was established by Burton Smith in Washington, USA Ø 1988: Software development starts Ø 1991: Hardware development starts Ø 1997: First MTA-1 shipment to SDSC (San Diego Supercomputer Center)

Tera MTA: Design Goals Ø To solves the two major problems then faced by

Tera MTA: Design Goals Ø To solves the two major problems then faced by high-performance Ø Ø parallel computers • scalability • Programmability To be suitable for very high-speed implementations The architecture to be applicable to a wide spectrum of problems. To Ease compiler implementation To overcome John von Neumann’s bottleneck (a problem of memory usage)

About Tera MTA Ø The Tera MTA is a high performance system having •

About Tera MTA Ø The Tera MTA is a high performance system having • scalar multithreaded processors with synchronization among threads • uniform access shared memory i. e all data accessible with equal ease -No locality - No cache - No mapping • simple programming • zero cost context switching

About Multi-Threading architecture (MTA) Ø Uses a new technique called Multi-threading that lets multiprocessors

About Multi-Threading architecture (MTA) Ø Uses a new technique called Multi-threading that lets multiprocessors share memory without using caches Ø Because these multi-threaded architecture computers can have thousands of processors that stay almost constantly busy, there will be no waits for slow memory accesses Ø Multi-threading allows each processor to switch thread contexts between execution cycles and as a result the processor stays busy Ø Whenever a processor starts a slow memory or I/O instruction, rather than waiting tens of cycles for the stalled instruction to complete, the processor executes its next instruction from a different thread using different registers Ø Each processor has many copies of the programming and pipeline control registers, one copy for each execution thread that it can support

Tera MTA Overview Ø Up to 256 processors with each processor running @ 260

Tera MTA Overview Ø Up to 256 processors with each processor running @ 260 MHz Ø Up to 128 active threads per processor Ø Up to 256 I/O processors Ø Peak Performance of 256 GFlop/sec Ø Processors and memory modules populate a sparse 3 D torus interconnection network Ø 4096 interconnection network nodes Ø Flat, shared main memory ranging from 16 to 512 GB Ø Cost : $5 million to $40 million

A View of the Tera Multiprocessor

A View of the Tera Multiprocessor

Key Architecture Details Ø Each MTA processor has 128 “streams” each of which is

Key Architecture Details Ø Each MTA processor has 128 “streams” each of which is hardware Ø Ø Ø (including 32 registers and a program counter that is devoted to running single thread of control The processor executes instructions from streams, that are not blocked, in a fair round robin fashion A stream can issue an instruction every 21 cycles (the length of the instruction pipeline) so at least 21 ready threads are required to keep a processor fully busy The processor makes a context switch on each cycle, choosing the next instruction from one of the streams that is ready to execute Using ‘rich’ interconnect network guarantees that any potential delays caused by references to data in memory are completely hidden Randomized memory mapping and high interconnectivity network provide near-uniform access time from any processor to any memory location.

Key Architecture Details Ø Hardware multithreading is used to tolerate high latencies to memory.

Key Architecture Details Ø Hardware multithreading is used to tolerate high latencies to memory. This latency is typically on the order of 150 clock cycles Ø Expected benefits of the MTA include high processor utilization, near linear scalability, and reduced programming effort specially compared to distributed memory machines using explicit message passing Ø The current MTA interconnect network is a 3–D toroidal mesh

Tera MTA’S Interconnection Network Ø The interconnection network is a three-dimensional sparsely populated torus

Tera MTA’S Interconnection Network Ø The interconnection network is a three-dimensional sparsely populated torus of pipelined packet-switching nodes, each of which is linked to some of its neighbors Ø Each link can transport a packet-containing source and destination addresses, an operation, and 64 data bits in both directions simultaneously on every clock tick. Ø Some of the nodes are also linked to resources, i. e. , processors, data memory units, I/O processors, and I/O cache units. Ø Instead of locating the processors on one side of the network and the memories on the other, the resources are distributed more-or-less uniformly throughout the network.

Tera MTA’S Interconnection Network Ø The interconnection network of one 256 -processor Tera system

Tera MTA’S Interconnection Network Ø The interconnection network of one 256 -processor Tera system contains 4096 nodes arranged in a 16*16*16 toroidal mesh Ø As the Tera architecture scales to larger numbers of processors p, the number of network nodes grows as p 3/2 rather than as the p log p associated with the more commonly used multistage networks. For example, a 1024 -processor system would have 32, 768 nodes

Multithreading on one processor Unused streams

Multithreading on one processor Unused streams

Multithreading on multiple processors

Multithreading on multiple processors

Latency Tolerance In Tera MTA Ø The latency incurred in memory references is hidden

Latency Tolerance In Tera MTA Ø The latency incurred in memory references is hidden by multithreading Ø As there may be up to 128 instruction streams (threads) and 8 memory references can be issued without waiting for the preceding ones, a latency of 1024 cycles can be tolerated Ø The lookahead allows threads to achieve peak performance. Ø Three operations (M, A, C) can be executed simultaneously per instruction per processor

The Tera Idea: Higher investment in hardware yields improved utilization and reduces software overhead

The Tera Idea: Higher investment in hardware yields improved utilization and reduces software overhead

Tera MTA Applications Ø Ø Ø PULSE 3 D, used for simulating real-time heartbeats

Tera MTA Applications Ø Ø Ø PULSE 3 D, used for simulating real-time heartbeats to better treat heart diseases. MSC Software’s NASTRAN, a structural analysis code used extensively by the automobile and aerospace industries. Livermore Software's LS-DYNA, which can simulate physical occurrences such as car crashes and metal stamping. GAUSSIAN 98, a computational chemistry application used in molecular modeling. MPIRE (for Massively Parallel Interactive Rendering Environment), a powerful graphics and animation application that visualizes complex phenomena. Used in seismic analysis, national security and weather forecasting.

Advantages of Tera MTA Ø Tera MTA uses multiple contexts to hide latency Ø

Advantages of Tera MTA Ø Tera MTA uses multiple contexts to hide latency Ø Tera machines perform a context switch every clock cycle Ø Both pipeline latency and memory latency are hidden in the Tera Ø Ø approach The thread creation is very cheap With 128 contexts per processor, a large number(2 k) of registers must be shared finely between threads As long as there is plenty of parallelism in user programs to hide latency and plenty of compiler support, the performance is potentially very high. The advantages of Tera's architecture available to users via minimal changes to their application code.

Drawbacks of Tera MTA Ø The performance will be bad for limited parallelism, such

Drawbacks of Tera MTA Ø The performance will be bad for limited parallelism, such as guaranteed low single-context performance. Ø A large number of contexts demands lots of registers and other hardware resources which in turn implies higher cost and complexity. Ø Finally, the limited focus on latency reduction and caching entails lots of slack parallelism to hide latency as well as lots of memory bandwidth; both require a higher cost for building the machine. Ø Bandwidth (not latency) limits practical MTA system size and large MTA systems will have expensive memory networks.

Tera MTA: Tools Tera provides two powerful tools Traceview and Canal that allow the

Tera MTA: Tools Tera provides two powerful tools Traceview and Canal that allow the programmer to: Ø Ø Understand how the compiler has multithreaded a program How effectively the program actually utilizes the hardware.

Customers Ø San Diego Supercomputer Center (SDSC) Ø Logicon, under a Naval research Lab

Customers Ø San Diego Supercomputer Center (SDSC) Ø Logicon, under a Naval research Lab Ø Tera computer company

Tera MTA Macro Architecture

Tera MTA Macro Architecture

Problems Solved using Tera MTA Ø irregular memory access patterns Ø Synchronization among threads

Problems Solved using Tera MTA Ø irregular memory access patterns Ø Synchronization among threads Ø load balancing

Current Industry Status: Cray Inc (ex-Tera) n Cray Inc. (Nasdaq NM: CRAY) Est. :

Current Industry Status: Cray Inc (ex-Tera) n Cray Inc. (Nasdaq NM: CRAY) Est. : April 1, 2000 (Tera Computer + Cray Research) HQ: Seattle WA, USA Products: Supercomputers (Vector, Micro Processor, Multithread) Market: Government, Industry, Academic Research 1972: Est. by Seymour Cray in Minnesota, USA 1976: First Cray-1 shipment to Los Alamos 1980 s: Ship follow-on products Cray XMP, Cray YMP, Cray-2 1990 s: More follow-on products Cray C 90, Cray J 90,Cray T 3 D Cray T 90, Cray T 3 E, Cray SV 1 1996: Merged with Silicon Graphics(SGI) 1987: Est. by Burton Smith in Washington, USA 1988: Software development starts 1991: Hardware development starts 1997: First MTA-1 shipment to SDSC (San Diego Supercomputer Center) 2000: Purchased Cray business unit from SGI

Cray Inc. (2000–present; result of merger between Tera Computers and Cray Research) Ø Cray

Cray Inc. (2000–present; result of merger between Tera Computers and Cray Research) Ø Cray SX-6 Ø Cray MTA-2 Ø Cray SV 1 Ø Cray Red Storm Ø Cray X 1 Ø Cray XD 1

Cray MTA-2 , Multi-threaded Architecture 128 Virtual Processors in a CPU module Up to

Cray MTA-2 , Multi-threaded Architecture 128 Virtual Processors in a CPU module Up to 1 TB Scalable Shared memory Zero Overhead Thread Switching

Cray MTA-2 Overview Multithread system Cray MTA-2

Cray MTA-2 Overview Multithread system Cray MTA-2

Unique capability of Cray MTA Visualization of Nebula using MPIRE Application on Cray MTA

Unique capability of Cray MTA Visualization of Nebula using MPIRE Application on Cray MTA system

References • http: //www. hoise. com/vmw/00/articles/vmw/JH-VM-01 -00 -1. html • http: //www. cs. njit.

References • http: //www. hoise. com/vmw/00/articles/vmw/JH-VM-01 -00 -1. html • http: //www. cs. njit. edu/pact/eight/tutorial/tera. html http: //techreports. larc. nasa. gov/icase/1998/icase-1998 -interim 33. pdf http: //www. bearcave. com/misl_tech/venture_capital. html • •