Informationsteknologi Todays class n Parallel Computer Architectures Saturday

  • Slides: 56
Download presentation
Informationsteknologi Today’s class n Parallel Computer Architectures Saturday, November 17, 2007 Computer Architecture I

Informationsteknologi Today’s class n Parallel Computer Architectures Saturday, November 17, 2007 Computer Architecture I - Class 14 1

Informationsteknologi Reasons for Parallelism Physical constraints, such as speed of light and quantum mechanical

Informationsteknologi Reasons for Parallelism Physical constraints, such as speed of light and quantum mechanical effects, are being reached n To increase speed of computation provide parallel computation rather than faster CPUs n Saturday, November 17, 2007 Computer Architecture I - Class 14 2

Informationsteknologi Levels for Parallelism n n n CPU – pipelining and superscalar architecture Very

Informationsteknologi Levels for Parallelism n n n CPU – pipelining and superscalar architecture Very long instruction words with implicit parallelism CPUs with special features to handle multiple threads of control at once Multiple CPUs on the same chip Extra CPU boards with additional processing capacity Replicate entire CPUs – multiprocessors and multicomputers Saturday, November 17, 2007 Computer Architecture I - Class 14 3

Informationsteknologi Parallel Computer Architectures Saturday, November 17, 2007 Computer Architecture I - Class 14

Informationsteknologi Parallel Computer Architectures Saturday, November 17, 2007 Computer Architecture I - Class 14 4

On-Chip Parallelism

On-Chip Parallelism

Informationsteknologi Instruction-Level Parallelism At the lowest level, achieve parallelism by issuing multiple instructions per

Informationsteknologi Instruction-Level Parallelism At the lowest level, achieve parallelism by issuing multiple instructions per clock cycle n Two ways to do this: n ® Superscalar processors ® VLIW (Very Large Instruction Word) processors Saturday, November 17, 2007 Computer Architecture I - Class 14 6

Informationsteknologi VLIW Processors Saturday, November 17, 2007 Computer Architecture I - Class 14 7

Informationsteknologi VLIW Processors Saturday, November 17, 2007 Computer Architecture I - Class 14 7

Informationsteknologi The Tri. Media VLIW CPU Designed by Phillips (inventor of CD) n Used

Informationsteknologi The Tri. Media VLIW CPU Designed by Phillips (inventor of CD) n Used as an embedded processor in CD, DVD, MP 3 players, digital cameras, etc. n Each instruction holds up to 5 operations n Saturday, November 17, 2007 Computer Architecture I - Class 14 8

Informationsteknologi The Tri. Media VLIW CPU Characteristics n n n Byte-oriented memory 32 -bit

Informationsteknologi The Tri. Media VLIW CPU Characteristics n n n Byte-oriented memory 32 -bit words 128 general-purpose 32 -bit registers ® n Four special purpose registers ® n PC, PSW, two registers for interrupt handling 64 -bit register that counts the number of CPU cycles since the CPU was last reset ® n R 0 is always 0, R 1 is always 1 Takes 2000 years to wrap around at 300 MHz 64 KB instruction cache, 16 KB data cache Saturday, November 17, 2007 Computer Architecture I - Class 14 9

Informationsteknologi The Tri. Media VLIW CPU Functional Units Saturday, November 17, 2007 Computer Architecture

Informationsteknologi The Tri. Media VLIW CPU Functional Units Saturday, November 17, 2007 Computer Architecture I - Class 14 10

Informationsteknologi On-Chip Multithreading When a memory reference misses the level 1 and level 2

Informationsteknologi On-Chip Multithreading When a memory reference misses the level 1 and level 2 caches there is a long wait until the requested word is loaded into the cache n This stalls the pipeline n On-chip multithreading deals with this by allowing the CPU to manage multiple threads of control at the same time n Saturday, November 17, 2007 Computer Architecture I - Class 14 11

Informationsteknologi Fine-Grained and Coarse. Grained Multithreading Saturday, November 17, 2007 Computer Architecture I -

Informationsteknologi Fine-Grained and Coarse. Grained Multithreading Saturday, November 17, 2007 Computer Architecture I - Class 14 12

Informationsteknologi Multithreading with a Dual. Issue Superscalar CPU Saturday, November 17, 2007 Computer Architecture

Informationsteknologi Multithreading with a Dual. Issue Superscalar CPU Saturday, November 17, 2007 Computer Architecture I - Class 14 13

Informationsteknologi Single-Chip Multiprocessors Provide a larger performance gain than multithreading n Contain two or

Informationsteknologi Single-Chip Multiprocessors Provide a larger performance gain than multithreading n Contain two or more CPUs n Saturday, November 17, 2007 Computer Architecture I - Class 14 14

Informationsteknologi Homogeneous Multiprocessors on a Chip Saturday, November 17, 2007 Computer Architecture I -

Informationsteknologi Homogeneous Multiprocessors on a Chip Saturday, November 17, 2007 Computer Architecture I - Class 14 15

Informationsteknologi Heterogeneous Multiprocessors on a Chip n The logical structure of a simple DVD

Informationsteknologi Heterogeneous Multiprocessors on a Chip n The logical structure of a simple DVD player contains a heterogeneous multiprocessor containing multiple cores for different functions. Saturday, November 17, 2007 Computer Architecture I - Class 14 16

Coprocessors

Coprocessors

Informationsteknologi Coprocessors n Increase speed of computer by adding a second, specialized processor Saturday,

Informationsteknologi Coprocessors n Increase speed of computer by adding a second, specialized processor Saturday, November 17, 2007 Computer Architecture I - Class 14 18

Informationsteknologi Introduction to Networking Saturday, November 17, 2007 Computer Architecture I - Class 14

Informationsteknologi Introduction to Networking Saturday, November 17, 2007 Computer Architecture I - Class 14 19

Informationsteknologi Network Processors n Programmable devices that can handle incoming and outgoing network packets

Informationsteknologi Network Processors n Programmable devices that can handle incoming and outgoing network packets at wire speed Saturday, November 17, 2007 Computer Architecture I - Class 14 20

Informationsteknologi Media Processors Handle high-resolution photographic images and audio & video streams n Ordinary

Informationsteknologi Media Processors Handle high-resolution photographic images and audio & video streams n Ordinary CPUs are not good at the massive computations needed to process the large amounts of data in these applications n Saturday, November 17, 2007 Computer Architecture I - Class 14 21

Informationsteknologi The Nexperia Media Processor Saturday, November 17, 2007 Computer Architecture I - Class

Informationsteknologi The Nexperia Media Processor Saturday, November 17, 2007 Computer Architecture I - Class 14 22

Shared-Memory Multiprocessors

Shared-Memory Multiprocessors

Informationsteknologi Multiprocessor n n n A parallel computer in which all the CPUs share

Informationsteknologi Multiprocessor n n n A parallel computer in which all the CPUs share a common memory Hard to build (because of the shared memory) Easy to program Saturday, November 17, 2007 Computer Architecture I - Class 14 24

Informationsteknologi Multicomputer n n Each CPU has its own private memory, accessible only to

Informationsteknologi Multicomputer n n Each CPU has its own private memory, accessible only to itself and not to any other CPU Also known as a distributed memory system Easy to build Hard to program Saturday, November 17, 2007 Computer Architecture I - Class 14 25

Informationsteknologi Taxonomy of Parallel Computers Saturday, November 17, 2007 Computer Architecture I - Class

Informationsteknologi Taxonomy of Parallel Computers Saturday, November 17, 2007 Computer Architecture I - Class 14 26

Informationsteknologi Taxonomy of Parallel Computers Saturday, November 17, 2007 Computer Architecture I - Class

Informationsteknologi Taxonomy of Parallel Computers Saturday, November 17, 2007 Computer Architecture I - Class 14 27

Sequential Memory Consistency Informationsteknologi n In the presence of multiple read and write requests,

Sequential Memory Consistency Informationsteknologi n In the presence of multiple read and write requests, some interleaving of all the requests is chosen by the hardware (nondeterministically), but all CPUs see the same order Saturday, November 17, 2007 Computer Architecture I - Class 14 28

Informationsteknologi Processor Memory Consistency Writes by any CPU are seen by all CPUs in

Informationsteknologi Processor Memory Consistency Writes by any CPU are seen by all CPUs in the order they were issued n For every memory word, all CPUs see all writes to it in the same order n Does not guarantee that every CPU sees the same ordering n Saturday, November 17, 2007 Computer Architecture I - Class 14 29

Informationsteknologi Weak Memory Consistency n Does not guarantee that writes from a single CPU

Informationsteknologi Weak Memory Consistency n Does not guarantee that writes from a single CPU are seen in order by other CPUs Saturday, November 17, 2007 Computer Architecture I - Class 14 30

Informationsteknologi UMA Symmetric Multiprocessor Architectures n Simplest multiprocessors are based on a single bus

Informationsteknologi UMA Symmetric Multiprocessor Architectures n Simplest multiprocessors are based on a single bus Saturday, November 17, 2007 Computer Architecture I - Class 14 31

Informationsteknologi Cache Coherence n n n Suppose CPU 1 and CPU 2 each have

Informationsteknologi Cache Coherence n n n Suppose CPU 1 and CPU 2 each have a copy of the same data in their respective caches Now suppose CPU 1 modifies the data in its cache and immediately thereafter CPU 2 reads its copy CPU 2 will get stale data This problem is known as the cache coherence problem Solutions to this problem have the cache controller eavesdropping on the bus Saturday, November 17, 2007 Computer Architecture I - Class 14 32

Informationsteknologi Snooping Caches n n The write through cache coherence protocol – note that

Informationsteknologi Snooping Caches n n The write through cache coherence protocol – note that all writes go to memory The empty boxes indicate that no action is taken Saturday, November 17, 2007 Computer Architecture I - Class 14 33

Informationsteknologi The MESI Cache Coherence Protocol Saturday, November 17, 2007 Computer Architecture I -

Informationsteknologi The MESI Cache Coherence Protocol Saturday, November 17, 2007 Computer Architecture I - Class 14 34

Informationsteknologi The MESI Cache Coherence Protocol Saturday, November 17, 2007 Computer Architecture I -

Informationsteknologi The MESI Cache Coherence Protocol Saturday, November 17, 2007 Computer Architecture I - Class 14 35

Informationsteknologi Crossbar Switches n n Use of a single bus limits the size of

Informationsteknologi Crossbar Switches n n Use of a single bus limits the size of a UMA multiprocessor to 16 or 32 CPUs The simplest circuit for connecting n CPUs to k memories is the crossbar switch Saturday, November 17, 2007 Computer Architecture I - Class 14 36

Informationsteknologi Multistage Switching Networks n n n Larger UMA multiprocessors are based on the

Informationsteknologi Multistage Switching Networks n n n Larger UMA multiprocessors are based on the humble 2 x 2 switch shown here Messages arriving on either input line (A or B) can be switched to either output line (X or Y) Messages contain four parts: ® ® Module – which memory to use Address – specifies address within the module Opcode – specifies the operation, such as READ or WRITE Value – optional field, may contain an operand Saturday, November 17, 2007 Computer Architecture I - Class 14 37

Informationsteknologi Multistage Switching Networks Saturday, November 17, 2007 Computer Architecture I - Class 14

Informationsteknologi Multistage Switching Networks Saturday, November 17, 2007 Computer Architecture I - Class 14 38

Informationsteknologi NUMA Multiprocessors n n n Non Uniform Memory Access All memory modules do

Informationsteknologi NUMA Multiprocessors n n n Non Uniform Memory Access All memory modules do not have the same access time Three key characteristics: A single address space visible to all CPUs ® Access to remote memory is done using LOAD and STORE instructions ® Access to remote memory is slower than access to local memory ® n n NC-NUMA (no cache present) CC-NUMA (coherent caches present) Saturday, November 17, 2007 Computer Architecture I - Class 14 39

Informationsteknologi NUMA Multiprocessors n A NUMA machine based on two levels of buses Saturday,

Informationsteknologi NUMA Multiprocessors n A NUMA machine based on two levels of buses Saturday, November 17, 2007 Computer Architecture I - Class 14 40

Informationsteknologi Directory-Based Multiprocessor Most popular approach for building Cache Coherent NUMA multiprocessors n Maintains

Informationsteknologi Directory-Based Multiprocessor Most popular approach for building Cache Coherent NUMA multiprocessors n Maintains a database telling where each cache line is and what its status is n Database is queried on every instruction that references memory, so must be kept in extremely fast special purpose hardware n Saturday, November 17, 2007 Computer Architecture I - Class 14 41

Informationsteknologi Directory-Based Multiprocessor n n n Below is an example 256 -node multiprocessor Each

Informationsteknologi Directory-Based Multiprocessor n n n Below is an example 256 -node multiprocessor Each node has one CPU and 16 MB RAM Total memory is 232 bytes, divided into 226 cache lines of 64 bytes each Saturday, November 17, 2007 Computer Architecture I - Class 14 42

Informationsteknologi Directory-Based Multiprocessor n n n Suppose CPU 20 issues a LOAD instruction for

Informationsteknologi Directory-Based Multiprocessor n n n Suppose CPU 20 issues a LOAD instruction for memory at physical address 0 x 24000108 This translates to node 36, line 4, offset 8 A request is made over the interconnection network; the directory for node 36 shows line 4 is not cached, so it is fetched from local RAM, sent back to node 20, and the directory entry for line 4 is updated to show it cached at node 20 Saturday, November 17, 2007 Computer Architecture I - Class 14 43

Informationsteknologi Directory-Based Multiprocessor n n n Now suppose we want to load memory referenced

Informationsteknologi Directory-Based Multiprocessor n n n Now suppose we want to load memory referenced by node 36’s cache line 2 From the directory entry we see it’s at node 82 At this point the hardware updates directory entry 2 to show the line is now at node 20 and then send a message to node 82 telling it to pass the line to node 20 and invalidate its cache Saturday, November 17, 2007 Computer Architecture I - Class 14 44

Message-Passing Multicomputers

Message-Passing Multicomputers

Multicomputers Informationsteknologi n n Each CPU has its own private memory, not directly accessible

Multicomputers Informationsteknologi n n Each CPU has its own private memory, not directly accessible to any other CPU Programs interact with messages, since they cannot get at each other’s memory via LOAD and STORE Saturday, November 17, 2007 Computer Architecture I - Class 14 46

Informationsteknologi Topology (a) A star n (b) A complete interconnect n (c) A tree

Informationsteknologi Topology (a) A star n (b) A complete interconnect n (c) A tree n (d) A ring n Saturday, November 17, 2007 Computer Architecture I - Class 14 47

Informationsteknologi Topology (e) A grid n (f) A double torus n (g) A cube

Informationsteknologi Topology (e) A grid n (f) A double torus n (g) A cube n (h) A 4 D hypercube n Saturday, November 17, 2007 Computer Architecture I - Class 14 48

Informationsteknologi Massively Parallel Processors Use standard CPUs, such as the Intel Pentium, as their

Informationsteknologi Massively Parallel Processors Use standard CPUs, such as the Intel Pentium, as their processors n Very high performance proprietary interconnection network n Enormous I/O capacity n Fault tolerant n ® Do not want a program that runs for many hours aborted because one CPU crashed Saturday, November 17, 2007 Computer Architecture I - Class 14 49

Informationsteknologi Blue. Gene/L Custom Processor Chip Saturday, November 17, 2007 Computer Architecture I -

Informationsteknologi Blue. Gene/L Custom Processor Chip Saturday, November 17, 2007 Computer Architecture I - Class 14 50

Informationsteknologi Blue. Gene/L System Saturday, November 17, 2007 Computer Architecture I - Class 14

Informationsteknologi Blue. Gene/L System Saturday, November 17, 2007 Computer Architecture I - Class 14 51

Informationsteknologi Cluster Computing n n n Consists of hundreds or thousands of computers connected

Informationsteknologi Cluster Computing n n n Consists of hundreds or thousands of computers connected by a commercially available network board Centralized cluster – a cluster of workstations mounted in a big rack in a single room Decentralized cluster – workstations spread around a building or campus, connected by a LAN Saturday, November 17, 2007 Computer Architecture I - Class 14 52

Informationsteknologi Google Built the world’s largest off-the-shelf cluster n It bought cheap, modest performance

Informationsteknologi Google Built the world’s largest off-the-shelf cluster n It bought cheap, modest performance PCs n Lots of them! n A typical Google PC has a 2 GHz Pentium, 512 MB RAM, 80 GB disk n Saturday, November 17, 2007 Computer Architecture I - Class 14 53

Informationsteknologi A Typical Google Cluster Saturday, November 17, 2007 Computer Architecture I - Class

Informationsteknologi A Typical Google Cluster Saturday, November 17, 2007 Computer Architecture I - Class 14 54

Informationsteknologi Communication Software Special software required for interprocess communication and synchronization n Most message

Informationsteknologi Communication Software Special software required for interprocess communication and synchronization n Most message passing systems provide two primitives – SEND and RECEIVE n Saturday, November 17, 2007 Computer Architecture I - Class 14 55

Informationsteknologi Three Main Semantics n Synchronous message passing ® n Buffered message passing ®

Informationsteknologi Three Main Semantics n Synchronous message passing ® n Buffered message passing ® n If sender has executed SEND and the receiver has not yet executed RECEIVE the sender is blocked until the receiver executes RECEIVE If a message is sent before receiver is ready the message is buffered somewhere until the receiver takes it out Nonblocking message passing ® ® Sender is allowed to continue immediately after executing SEND However, it may not reuse the message buffer as the message may not have been sent yet Saturday, November 17, 2007 Computer Architecture I - Class 14 56