Concurrent Computers Instructor S Masoud Sadjadi http www

Acknowledgements n The content of many of the slides in this lecture notes have

Concurrency and Computers n n We will see computer systems designed to allow concurrency

Parallel Computer Architectures (a) On-chip parallelism. (b) A coprocessor. (c) A multiprocessor. (d) A

Concurrency within a CPU Registers + ALUs + Hardware to decode instructions and do

Concurrency within a CPU n Several techniques to allow concurrency within a single CPU

Pipelining n n If one has a sequence of tasks to do If each

Pipelining n n Each step goes as fast as the slowest stage Therefore, the

RISC n n Having all instructions doable in the same number of stages of

Pipelined Functional Units n n n Although the RISC idea is attractive, some operations

Pipelining Today n n Pipelined functional units are common Fallacy: All computers today are

Instruction Level Parallelism n n Instruction Level Parallelism is the set of techniques by

Vector Units n A functional unit that can do elt-wise operations on entire vectors

Vector Units n n Typically, a vector register holds ~ 32 -64 elements But

MMX Extension n n Many techniques that are initially implemented in the “supercomputer” market,

Vectorization Example n Conversion from RGB to YUV Y = (9798*R + 19235*G +

Multi-threaded Architectures n Computer architecture is a difficult field to make innovations in n

On-Chip Multithreading n n Multithreading has been around for years, so what’s new about

Single-treaded Processor n CPU n n Multiple programs in memory Only one executes at

Single-threaded SMP? Two threads execute at once, so threads spend less time waiting n

Super-threading n n Principle: the processor can execute more than one thread at a

Hyper-threading n n Principle: the processor can execute more than one thread at a

Concurrency within a “Box” n Two main techniques n n n SMP Multi-core Let’s

SMPs n Symmetric Multi-Processors n n often mislabeled as “Shared-Memory Processors”, which has now

Distributed caches n n The problem with distributed caches is that of memory consistency

Cache Coherency n Memory consistency is jeopardized by having multiple caches n n P

Snoopy Cache-Coherence State Address Data $ Mem n n n Pn P 0 bus

Limits of Snoopy Coherence MEM 1. 28 GB/s ° ° ° MEM Assume: 4

Sample Machines n Intel Pentium Pro Quad n n n Coherent 4 processors Sun

Concurrency within a “Box” n Two main techniques n n SMP Multi-core 40

Moore’s Law n Moore's Law describes an important trend in the history of computer

Moore’s Law! n Many people interpret Moore’s law as “computer gets twice as fast

No more Moore? n n n We are used to getting faster CPUs all

Commodity improvements n There are three main ways in which commodity processors keep improving:

Is Moore’s laws not true? n Ironically, Moore’s law is still true n n

Multi-core processors n n In addition to putting concurrency in the public’s eye, multicore

Multi-Core n Quote from PC World Magazine Summer 2005: “Don't expect dual-core to be

Multiple boxes together n Example n Take four “boxes” n n Hook them up

Multiple Boxes Together n Why do we use multiple boxes? n n The problem

Where does this leave us? n So far we have seen many ways in

Taxonomy of parallel machines? n It’s not going to happen n n Up until

Taxonomy of platforms? n n We’ll look at one traditional taxonomy We’ll look at

The Flynn taxonomy n n Proposed in 1966!!! Functional taxonomy based on the notion

Taxonomy of Parallel Computers Flynn’s taxonomy of parallel computers. 55

SIMD single stream of instructions Processing Element n n n Processing Element Control Unit

MIMD n n n Most general category Pretty much every supercomputer in existence today

Taxonomy of Parallel Computers A taxonomy of parallel computers. 58

A host of parallel machines n n There are (have been) many kinds of

Top Ten Computers (http: //www. top 500. org) 63

Top 500 Computers--Countries (http: //www. top 500. org) 64

Top 500 Computers--Manufacturers (http: //www. top 500. org) 65

Top 500 Computers—Manufacturers Trend (http: //www. top 500. org) 66

Top 500 Computers--Operating Systems (http: //www. top 500. org) 67

Top 500 Computers—Operating Systems Trend (http: //www. top 500. org) 68

Top 500 Computers--Processors (http: //www. top 500. org) 69

Top 500 Computers—Processors Trend (http: //www. top 500. org) 70

Top 500 Computers--Customers (http: //www. top 500. org) 71

Top 500 Computers—Customers Trend (http: //www. top 500. org) 72

Top 500 Computers--Applications (http: //www. top 500. org) 73

Top 500 Computers--Applications Trend (http: //www. top 500. org) 74

Top 500 Computers—Architecture (http: //www. top 500. org) 75

Top 500 Computers—Architecture Trend (http: //www. top 500. org) 76

SMPs n n n “Symmetric Multi. Processors” (often mislabeled as “Shared-Memory Processors”, which has

Distributed Shared Memory n Memory is logically shared, but physically distributed in banks n

Clusters, Constellations, MPPs n n n These are the only 3 categories today in

Clusters n n n 58. 2% of the Top 500 machines are labeled as

What is Beowulf? n n n An experiment in parallel computing systems Established vision

Constellations? ? ? n n n Commodity clusters that differ from the previous ones

MPP? ? ? ? n n n Probably the most imprecise term for describing

Vector Processors n Vector architectures were based on a single processor n n Historically

Vector Processors n Advantages n n Memory-to-memory n n quick fetch and decode of

Global Address Space n Cray T 3 D, T 3 E, X 1, and

Blue Gene/L n n n 65, 536 processors Relatively modest clock rates, so that

Blue. Gene The Blue. Gene/L custom processor chip. 96

Blue. Gene The Blue. Gene/L. (a) Chip. (b) Card. (c) Board. (d) Cabinet. (e)

If you like dead Supercomputers n Lots of old supercomputers w/ pictures n n

Network Topologies n n People have experimented with different topologies for distributed memory machines,

Hypercube n Defined by its dimension, d 1 D 2 D 3 D 4

Hypercube n Properties Has 2 d nodes The number of hops between two nodes

Conclusion n n Concurrency appears at all levels Both in “commodity systems” and in

Slides: 89

Download presentation

Concurrent Computers Instructor: S. Masoud Sadjadi http: //www. cs. fiu. edu/~sadjadi/Teaching/ sadjadi At cs Dot fiu Dot edu 1

Acknowledgements n The content of many of the slides in this lecture notes have been adopted from the online resources prepared previously by the people listed below. Many thanks! n Henri Casanova n n Kai Wang n n Principles of High Performance Computing http: //navet. ics. hawaii. edu/~casanova henric@hawaii. edu Department of Computer Science University of South Dakota http: //www. usd. edu/~Kai. Wang Andrew Tanenbaum 2

Concurrency and Computers n n We will see computer systems designed to allow concurrency (for performance benefits) Concurrency occurs at many levels in computer systems n Within a CPU n n Within a “Box” n n For example, On-Chip Parallelism For example, Coprocessor and Multiprocessor Across Boxes n For example, Multicomputers, Clusters, and Grids 3

Parallel Computer Architectures (a) On-chip parallelism. (b) A coprocessor. (c) A multiprocessor. (d) A multicomputer. (e) A grid. 4

Concurrency and Computers n n We will see computer systems designed to allow concurrency (for performance benefits) Concurrency occurs at many levels in computer systems n n n Within a CPU Within a “Box” Across Boxes 5

Concurrency within a CPU Registers + ALUs + Hardware to decode instructions and do all types of useful things Caches RAM Busses adapters Controllers I/O devices Displays Keyboards Controllers Networks 6

Concurrency within a CPU n Several techniques to allow concurrency within a single CPU n Pipelining n n n RISC architectures Pipelined functional units ILP Vector units On-Chip Multithreading Let’s look at them briefly 7

Pipelining n n If one has a sequence of tasks to do If each task consists of the same n steps or stages If different steps can be done simultaneously Then one can have a pipelined execution of the tasks n n e. g. , for assembly line Goal: higher throughput (i. e. , number of tasks per time unit) Time to do 1 task Time to do 2 tasks Time to do 3 tasks Time to do 4 tasks Time to do 100 tasks Pays off if many tasks = 9 = 13 = 17 = 21 = 45 = 409 9

Pipelining n n Each step goes as fast as the slowest stage Therefore, the asymptotic throughput (i. e. , the throughput when the number of tasks tends to infinity) is equal to: 1 / (duration of the slowest stage) Therefore, in an ideal pipeline, all stages would be identical (balanced pipeline) Question: Can we make computer instructions all consist of the same number of stage, where all stages take the same number of clock cycles? duration of the slowest stage 10

RISC n n Having all instructions doable in the same number of stages of the same durations is the RISC idea Example: n MIPS architecture (See THE architecture book by Patterson and Hennessy) n 5 stages n n n Instruction Fetch (IF) Instruction Decode (ID) Instruction Execute (EX) Memory accesses (MEM) Register Write Back (WB) Concurrent execution of two instructions Each stage takes one clock cycle LD R 2, 12(R 3) DADD R 3, R 5, R 6 IF ID EX MEM WB 11

Pipelined Functional Units n n n Although the RISC idea is attractive, some operations are just too expensive to be done in one clock cycle (during the EX stage) Common example: floating point operations Solution: implement them as a sequence of stages, so that they can be pipelined EX Integer unit FP/integer multiply IF ID M 1 M 2 M 3 M 4 M 5 M 6 M 7 MEM WB FP/integer add A 1 A 2 A 3 A 4 12

Pipelining Today n n Pipelined functional units are common Fallacy: All computers today are RISC n n n RISC was of course one of the most fundamental “new” ideas in computer architectures x 86: Most commonly used Instruction Set Architecture today Kept around for backwards compatibility reasons, because it’s easy to implement (not to program for) BUT: modern x 86 processors decode instructions into “micro-ops”, which are then executed in a RISC manner Bottom line: pipelining is a pervasive (and conveniently hidden) form of concurrency in computers today n Take a computer architecture course to know all about it 13

Concurrency within a CPU n Several techniques to allow concurrency within a single CPU n n Pipelining ILP Vector units On-Chip Multithreading 14

Instruction Level Parallelism n n Instruction Level Parallelism is the set of techniques by which performance of a pipelined processor can be pushed even further ILP can be done by the hardware n n Dynamic instruction scheduling Dynamic branch predictions Multi-issue superscalar processors ILP can be done by the compiler n Static instruction scheduling n Multi-issue VLIW (Very Long Instruction Word) processors n n with multiple functional units Broad concept: More than one instruction is issued per clock cycle n e. g. , 8 -way multi-issue processor 15

Concurrency within a CPU n Several techniques to allow concurrency within a single CPU n n Pipelining ILP Vector units On-Chip Multithreading 16

Vector Units n A functional unit that can do elt-wise operations on entire vectors with a single instruction, called a vector instruction n n These are specified as operations on vector registers A “vector processor” comes with some number of such registers n MMX extension on x 86 architectures #elts . . . + #elts adds in parallel . . . #elts 17

Vector Units n n Typically, a vector register holds ~ 32 -64 elements But the number of elements is always larger than the amount of parallel hardware, called vector pipes or lanes, say 2 -4 #elts + + + #elts / #pipes adds in parallel #elts 18

MMX Extension n n Many techniques that are initially implemented in the “supercomputer” market, find their way to the mainstream Vector units were pioneered in supercomputers n n Supercomputers are mostly used for scientific computing Scientific computing uses tons of arrays (to represent mathematical vectors and often does regular computation with these arrays Therefore, scientific code is easy to “vectorize”, i. e. , to generate assembly that uses the vector registers and the vector instructions Intel’s MMX or Power. PC’s Alti. Vec n MMX vector registers n n n eight 8 -bit elements four 16 -bit elements two 32 -bit elements Alti. Vec: twice the lengths Used for “multi-media” applications n n n image processing rendering. . . 19

Vectorization Example n Conversion from RGB to YUV Y = (9798*R + 19235*G + 3736*B) / 32768; U = (-4784*R - 9437*G + 4221*B) / 32768 + 128; V = (20218*R - 16941*G - 3277*B) / 32768 + 128; n n This kind of code is perfectly parallel as all pixels can be computed independently Can be done easily with MMX vector capabilities n n n Load 8 R values into an MMX vector register Load 8 G values into an MMX vector register Load 8 B values into an MMX vector register Do the *, +, and / in parallel Repeat 20

Concurrency within a CPU n Several techniques to allow concurrency within a single CPU n n Pipelining ILP Vector units On-Chip Multithreading 21

Multi-threaded Architectures n Computer architecture is a difficult field to make innovations in n n Who’s going to spend money to manufacture your new idea? Who’s going to be convinced that a new compiler can/should be written Who’s going to be convinced of a new approach to computing? One of the “cool” innovations in the last decade has been the concept of a “Multithreaded Architecture” 22

On-Chip Multithreading n n Multithreading has been around for years, so what’s new about this? Here we’re talking about Hardware Support for threads n n Simultaneous Multi Threading (SMT) Super. Threading Hyper. Threading Let’s try to understand what all of these mean before looking at multi-threaded Supercomputers 23

Single-treaded Processor n CPU n n Multiple programs in memory Only one executes at a time n n n Front-end: fetching/decoding/reordering Execution core: actual execution 4 -issue CPU with bubbles 7 -unit CPU with pipeline bubbles Time-slicing via context switching 24

Single-threaded SMP? Two threads execute at once, so threads spend less time waiting n The number of “bubbles” is also doubled è Twice as much speed and twice as much waste n 25

Super-threading n n Principle: the processor can execute more than one thread at a time Also called time-slice multithreading The processor is then called a multithreaded processor Requires more hardware cleverness n n Leads to less Waste n n n A thread can run during a cycle while another thread is waiting for the memory Just a finer grain of interleaving But there is a restriction n n logic switches at each cycle Each stage of the front end or the execution core only runs instructions from ONE thread! Does not help with poor instruction parallelism within one thread n Does not reduce bubbles within a row 26

Hyper-threading n n Principle: the processor can execute more than one thread at a time, even within a single clock cycle!! Requires even more hardware cleverness n n On the diagram: Only two threads execute simultaneously. n n logic switches within each cycle Intel’s hyper-threading only adds 5% to the die area Some people argue that “two” is not “hyper” Finest level of interleaving From the OS perspective, there are two “logical” processors 27

Concurrency within a “Box” n Two main techniques n n n SMP Multi-core Let’s look at both of them 29

SMPs n Symmetric Multi-Processors n n often mislabeled as “Shared-Memory Processors”, which has now become tolerated Processors are all connected to a single memory Symmetric: each memory cell is equally close to all processors Many dual-proc and quad-proc systems n e. g. , for servers P 2 P 1 $ Pn $ $ network/bus memory 30

Distributed caches n n The problem with distributed caches is that of memory consistency Intuitive memory model n n Reading an address should return the last value written to that address Easy to do in uniprocessors n n n although there may be some I/O issues But difficult in multi-processor / multi-core Memory consistency: “A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. ” [Lamport, 1979] 31

Cache Coherency n Memory consistency is jeopardized by having multiple caches n n P 1 and P 2 both have a cached copy of a data item P 1 write to it, possibly write-through to memory At this point P 2 owns a stale copy When designing a multi-processor system, one must ensure that this cannot happen n By defining protocols for cache coherence 32

Snoopy Cache-Coherence State Address Data $ Mem n n n Pn P 0 bus snoop memory op from Pn $ memory bus Memory bus is a broadcast medium Caches contain information on which addresses they store Cache Controller “snoops” all transactions on the bus n A transaction is a relevant transaction if it involves a cache block currently contained in this cache n Take action to ensure coherence n invalidate, update, or supply value 33

Limits of Snoopy Coherence MEM 1. 28 GB/s ° ° ° MEM Assume: 4 GHz processor => 16 GB/s inst BW per processor (32 -bit) => 9. 6 GB/s data BW at 30% load-store of 8 byte elements °°° cache 25. 6 GB/s PROC Suppose 98% inst hit rate and 90% data hit rate => 320 MB/s inst BW per processor => 960 MB/s data BW per processor => 1. 28 GB/s combined BW Assuming 10 GB/s bus bandwidth 8 processors will saturate the bus 34

Sample Machines n Intel Pentium Pro Quad n n n Coherent 4 processors Sun Enterprise server n n Coherent Up to 16 processor and/or memory-I/O cards 35

Concurrency within a “Box” n Two main techniques n n SMP Multi-core 40

Moore’s Law n Moore's Law describes an important trend in the history of computer hardware n n n The number of transistors that can be inexpensively placed on an integrated circuit is increasing exponentially, doubling approximately every two years. The observation was first made by Intel cofounder Gordon E. Moore in a 1965 paper. The trend has continued for more than half a century and is not expected to stop for another decade at least. 41

Moore’s Law! n Many people interpret Moore’s law as “computer gets twice as fast every 18 months” n n n which is not technically true it’s all about microprocessor density But this is no longer true We should have 10 GHz processors right now And we don’t! 42

No more Moore? n n n We are used to getting faster CPUs all the time We are used for them to keep up with more demanding software Known as “Andy giveth, and Bill taketh away” n n n Andy Grove Bill Gates It’s a nice way to force people to buy computers often But basically, our computers get better, do more things, and it just happens automatically Some people call this the “performance free lunch” Conventional wisdom: “Not to worry, tomorrow’s processors will have even more throughput, and anyway today’s applications are increasingly throttled by factors other than CPU throughput and memory speed (e. g. , they’re often I/O-bound, network-bound, database-bound). ” 43

Commodity improvements n There are three main ways in which commodity processors keep improving: n n All applications can easily benefit from these improvements n n Higher clock rate More aggressive instruction reordering and concurrent units Bigger/faster caches at the cost of perhaps a recompilation Unfortunately, the first two are hitting their limit n n Higher clock rate lead to high heat, power consumption No more instruction reordering without compromising correctness 44

Is Moore’s laws not true? n Ironically, Moore’s law is still true n n But its wrong interpretation is not n n n Multiple CPUs on a single chip Multi-core adds another level of concurrency n n Clock rates do not doubled any more But we can’t let this happen: computers have to get more powerful Therefore, the industry has thought of new ways to improve them: multi-core n n The density indeed still doubles But unlike, say multiple functional units, hard to compile for them Therefore, applications must be rewritten to benefit from the (nowadays expected) performance increase n “Concurrency is the next major revolution in how we write software” (Dr. Dobb’s Journal, 30(3), March 2005) 45

Multi-core processors n n In addition to putting concurrency in the public’s eye, multicore architectures will have deep impact Languages will be forced to deal well with concurrency n n n Efficiency and Performance optimization will become more important: write code that is fast on one core with limited clock rate The CPU may very well become a bottleneck (again) for single-core programs n n New language designs? New language extensions? New compilers? Other factors will improve, but not the clock rate Prediction: many companies will be hiring people to (re)write concurrent applications 46

Multi-Core n Quote from PC World Magazine Summer 2005: “Don't expect dual-core to be the top performer today for games and other demanding singlethreaded applications. But that will change as applications are rewritten. For example, by year's end, Unreal Tournament should have released a new game engine that takes advantage of dual-core processing. “ 47

Multiple boxes together n Example n Take four “boxes” n n Hook them up to a network n n n e. g. , a switch bought at CISCO, Myricom, etc. Install software that allows you to write/run applications that can utilize these four boxes concurrently This is a simple way to achieve concurrency across computer systems n n n e. g. , four Intel Itaniums bought at Dell Everybody has heard of “clusters” by now They are basically like the above example and can be purchased already built from vendors We will talk about this kind of concurrent platform at length during this class 49

Multiple Boxes Together n Why do we use multiple boxes? n n The problem is that single boxes do not scale to meet the needs of many scientific applications n n n Every programmer would rather have an SMP/multi-core architecture that provides all the power/memory she/he needs Can’t have enough processors or powerful enough cores Can’t have enough memory But if you can live with a single box, do it! n We will see that single-box programming is much easier than multi-box programming 50

Where does this leave us? n So far we have seen many ways in which concurrency can be achieved/implemented in computer systems n n Within a box Across boxes So we could look at a system and just list all the ways in which it does concurrency It would be nice to have a great taxonomy of parallel platforms in which we can pigeon-hole all (past and present) systems n Provides simple names that everybody can use and understand quickly 51

Taxonomy of parallel machines? n It’s not going to happen n n Up until last year Gordon Bell and Jim Gray published an article in Comm. of the ACM, discussing what the taxonomy should be Dongarra, Sterling, etc. answered telling them they were wrong and saying what the taxonomy should be, and proposing a new multidimensional scheme! Both papers agree that most terms are conflated, misused, etc. (MPP) Complicated by the fact that concurrency appears at so many levels n Example: A 16 -node cluster, where each node is a 4 -way multiprocessor, where each processor is hyperthreaded, has vector units, and is fully pipelined with multiple, pipelined functional units 52

Taxonomy of platforms? n n We’ll look at one traditional taxonomy We’ll look at current categorizations from Top 500 We’ll look at examples of platforms We’ll look at interesting/noteworthy architectural features that one should know as part of one’s parallel computing culture 53

The Flynn taxonomy n n Proposed in 1966!!! Functional taxonomy based on the notion of streams of information: data and instructions Platforms are classified according to whether they have a single (S) or multiple (M) stream of each of the above Four possibilities n n SISD (sequential machine) SIMD MISD (rare, no commercial system. . . systolic arrays) 54

Taxonomy of Parallel Computers Flynn’s taxonomy of parallel computers. 55

SIMD single stream of instructions Processing Element n n n Processing Element Control Unit Processing Element fetch decode broadcast Processing Element PEs can be deactivated and activated on-the-fly Vector processing (e. g. , vector add) is easy to implement on SIMD Debate: is a vector processor an SIMD machine? n n n often confused strictly not true according to the taxonomy (it’s really SISD with pipelined operations) but it’s convenient to think of the two as equivalent 56

MIMD n n n Most general category Pretty much every supercomputer in existence today is a MIMD machine at some level This limits the usefulness of the taxonomy But you had to have heard of it at least once because people keep referring to it, somehow. . . Other taxonomies have been proposed, none very satisfying Shared- vs. Distributed- memory is a common distinction among machines, but these days many are hybrid anyway 57

Taxonomy of Parallel Computers A taxonomy of parallel computers. 58

A host of parallel machines n n There are (have been) many kinds of parallel machines For the last 12 years their performance has been measured and recorded with the LINPACK benchmark, as part of Top 500 It is a good source of information about what machines are (were) and how they have evolved Note that it’s really about “supercomputers” http: //www. top 500. org 59

What can we find on the Top 500? 61

Pies 62

Top Ten Computers (http: //www. top 500. org) 63

Top 500 Computers--Countries (http: //www. top 500. org) 64

Top 500 Computers--Manufacturers (http: //www. top 500. org) 65

Top 500 Computers—Manufacturers Trend (http: //www. top 500. org) 66

Top 500 Computers--Operating Systems (http: //www. top 500. org) 67

Top 500 Computers—Operating Systems Trend (http: //www. top 500. org) 68

Top 500 Computers--Processors (http: //www. top 500. org) 69

Top 500 Computers—Processors Trend (http: //www. top 500. org) 70

Top 500 Computers--Customers (http: //www. top 500. org) 71

Top 500 Computers—Customers Trend (http: //www. top 500. org) 72

Top 500 Computers--Applications (http: //www. top 500. org) 73

Top 500 Computers--Applications Trend (http: //www. top 500. org) 74

Top 500 Computers—Architecture (http: //www. top 500. org) 75

Top 500 Computers—Architecture Trend (http: //www. top 500. org) 76

SMPs n n n “Symmetric Multi. Processors” (often mislabeled as “Shared-Memory Processors”, which has now become tolerated) Processors all connected to a (large) memory UMA: Uniform Memory Access, makes is easy to program Symmetric: all memory is equally close to all processors Difficult to scale to many processors (<32 typically) Cache Coherence via “snoopy caches” or “directories” P 2 P 1 $ Pn $ $ network/bus memory 78

Distributed Shared Memory n Memory is logically shared, but physically distributed in banks n Any processor can access any address in memory n Cache lines (or pages) are passed around the machine n Cache coherence: Distributed Directories n NUMA: Non-Uniform Memory Access (some processors may be closer to some banks) P 2 P 1 $ $ memory Pn memory network memory n $ memory SGI Origin 2000 is a canonical example n Scales to 100 s of processors n Hypercube topology for the memory (later) 79

Clusters, Constellations, MPPs n n n These are the only 3 categories today in the Top 500 They all belong to the Distributed Memory model (MIMD) (with many twists) Each processor/node has its own memory and cache but cannot directly access another processor’s memory. n n nodes may be SMPs Each “node” has a network interface (NI) for all communication and synchronization. P 0 memory NI P 1 memory NI Pn. . . NI memory interconnect n So what are these 3 categories? 80

Clusters n n n 58. 2% of the Top 500 machines are labeled as “clusters” Definition: Parallel computer system comprising an integrated collection of independent “nodes”, each of which is a system in its own right capable on independent operation and derived from products developed and marketed for other standalone purposes A commodity cluster is one in which both the network and the compute nodes are available in the market In the Top 500, “cluster” means “commodity cluster” A well-known type of commodity clusters are “Beowulf-class PC clusters”, or “Beowulfs” 81

What is Beowulf? n n n An experiment in parallel computing systems Established vision of low cost, high end computing, with public domain software (and led to software development) Tutorials and book for best practice on how to build such platforms Today by Beowulf cluster means a commodity that runs Linux software Project initiated by Sterling and D. Becker in 1994 one cluster and GNU-type T. at NASA 82

Constellations? ? ? n n n Commodity clusters that differ from the previous ones by the dominant level of parallelism Clusters consist of nodes, and nodes are typically SMPs If there are more procs in a node than nodes in the cluster, then we have a constellation Typically, constellations are space-shared among users, with each user running open. MP on a node, although an app could run on the whole machine using MPI/open. MP To be honest, this term is not very useful and not very used. 83

MPP? ? ? ? n n n Probably the most imprecise term for describing a machine (isn’t a 256 -node cluster of 4 -way SMPs massively parallel? ) May use proprietary networks, vector processors, as opposed to commodity component Cray T 3 E, Cray X 1, and Earth Simulator are distributed memory machines, but the nodes are SMPs. Basicallly, everything that’s fast and not commodity is an MPP, in terms of today’s Top 500. Let’s look at these “non-commodity” things n People’s definition of “commodity” varies 84

Vector Processors n Vector architectures were based on a single processor n n Historically important n n Multiple functional units All performing the same operation Instructions may specify large amounts of parallelism (e. g. , 64 way) but hardware executes only a subset in parallel Overtaken by MPPs in the 90 s as seen in Top 500 Re-emerging in recent years n n At a large scale in the Earth Simulator (NEC SX 6) and Cray X 1 At a small scale in SIMD media extensions to microprocessors n n SSE, SSE 2 (Intel: Pentium/IA 64) Altivec (IBM/Motorola/Apple: Power. PC) VIS (Sun: Sparc) Key idea: Compiler does some of the difficult work of finding parallelism, so the hardware doesn’t have to 85

Vector Processors n Advantages n n Memory-to-memory n n quick fetch and decode of a single instruction for multiple operations the instruction provides the processor with a regular source of data, which can arrive at each cycle, and processed in a pipelined fashion The compiler does the work for you of course no registers can process very long vectors, but startup time is large appeared in the 70 s and died in the 80 s Cray, Fujitsu, Hitachi, NEC 86

Global Address Space n Cray T 3 D, T 3 E, X 1, and HP Alphaserver cluster n Network interface supports “Remote Direct Memory Access” n n NI can directly access memory without interrupting the CPU One processor can read/write memory with one-sided operations (put/get) n Not just a load/store as on a shared memory machine n Remote data is typically not cached locally P 0 memory NI P 1 memory NI Pn. . . interconnect NI memory 87

Blue Gene/L n n n 65, 536 processors Relatively modest clock rates, so that power consumption is low, cooling is easy, and space is small (1024 nodes in the same rack) Besides, processor speed is on par with the memory speed so faster clock rate does not help 2 -way SMP nodes (really different from the X 1) several networks n n n 64 x 32 3 -D torus for point-to-point tree for collective operations and for I/O plus other Ethernet, etc. 95

Blue. Gene The Blue. Gene/L custom processor chip. 96

Blue. Gene The Blue. Gene/L. (a) Chip. (b) Card. (c) Board. (d) Cabinet. (e) System. 97

If you like dead Supercomputers n Lots of old supercomputers w/ pictures n n Dead Supercomputers n n http: //www. geocities. com/Athens/6270/superp. html http: //www. paralogos. com/Dead. Super/Projects. html e-Bay n Cray Y-MP/C 90, 1993 n $45, 100. 70 n n From the Pittsburgh Supercomputer Center who wanted to get rid of it to make space in their machine room Original cost: $35, 000 Weight: 30 tons Cost $400, 000 to make it work at the buyer’s ranch in Northern California 100

Network Topologies n n People have experimented with different topologies for distributed memory machines, or to arrange memory banks in NUMA shared-memory machines Examples include: n n n Ring: KSR (1991) 2 -D grid: Intel Paragon (1992) Torus Hypercube: n. Cube, Intel i. PSC/860, used in the SGI Origin 2000 for memory Fat-tree: IBM Colony and Federation Interconnects (SP-x) Arrangement of switches n pioneered with “Butterfly networks” like in the BBN TC 2000 in the early 1990 n n n 200 MHz processors in a multi-stage network of switches Virtually Shared Distributed memory (NUMA) I actually worked with that one! 101

Hypercube n Defined by its dimension, d 1 D 2 D 3 D 4 D 102

Hypercube n Properties Has 2 d nodes The number of hops between two nodes is at most d n n Routing and Addressing n 0110 0010 The diameter of the network grows logarithmically with the number of nodes, which was the key for interest in hypercubes But each node needs d neighbors, which is a problem 0011 0100 1010 0101 1110 0111 n n 1011 1100 n 0000 0001 1000 1001 n d-bit address routing from xxxx to yyyy: just keep going to a neighbor that has a smaller Hamming distance reminiscent of some p 2 p things TONS of Hypercube research (even today!!) 103

Conclusion n n Concurrency appears at all levels Both in “commodity systems” and in “supercomputers” n n When needing performance, one has to exploit concurrency to the best of its capabilities n n n The distinction is rather annoying e. g. , as a developer of a geophysics application to run on a 10, 000 heavy-iron supercomputers at the SANDIA national lab e. g. , as a game developer on a 8 -way multi-core hyper-threaded desktop system sold by Dell In this course we’ll gain hands-on understanding of how to write concurrent/parallel software n n Using GCB and MIND clusters Using the LA Grid and Open Sciences Grid 104