Architecture Classifications A taxonomy of parallel architectures in

  • Slides: 18
Download presentation
Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures into

Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures into four classes, based on how many instruction and data streams can be observed in the architecture. They are: q SISD - Single Instruction, Single Data • Operate sequentially on a single stream of instructions in single memory. • Classic “Von Neumann” architecture. • Machines may still consist of multiple processors, operating on independent data - these can be considered as multiple SISD systems. q SIMD - Single Instruction, Multiple Data • A single instruction stream (broadcast to all PE*s), acting on multiple data. • The most common form of this architecture class are Vector processors. • These can deliver results several times faster than scalar processors. * PE = Processing Element Computer Science, University of Warwick 1

Architecture Classifications q MISD - Multiple instruction, Single data q. There is a debate

Architecture Classifications q MISD - Multiple instruction, Single data q. There is a debate about whether the architecture with uniformly shared memory and separated cache is MISD or MIMD (MIMD is favoured) q. No q other practical implementations of this architecture. MIMD - Multiple instruction, Multiple data Independent instruction streams, acting on different (but related) data q q Note the difference between multiple SISD and MIMD Computer Science, University of Warwick 2

Architecture Classifications MIMD: SMP, NUMA, MPP, Cluster SISD: Machine with a single scalar processor

Architecture Classifications MIMD: SMP, NUMA, MPP, Cluster SISD: Machine with a single scalar processor SIMD: Machine with vector processors Computer Science, University of Warwick 3

Architecture Classifications Shared memory (uniform memory access) q Processors share access to a common

Architecture Classifications Shared memory (uniform memory access) q Processors share access to a common memory space. • Implemented over a shared memory bus or Shared Memory communication network. q Memory locks required q Local cache is critical: • If not, bus contention (or network traffic) reduces the systems efficiency. Interconnect • For this reason, pure shared memory systems do not scale (Scalability is the measure of how well the system performance improves linearly to the number of processing elements) • Naturally, cache introduces problems of coherency (ensuring that stale cache lines are invalidated when other processors alter shared memory). q PE PE 0 n Support for critical sections are required Computer Science, University of Warwick 4

Architecture Classifications Shared memory (Nonuniform memory access) q q PE may be fetching from

Architecture Classifications Shared memory (Nonuniform memory access) q q PE may be fetching from local or remote memory hence non-uniform access times. • NUMA • cc-NUMA (cache-coherent Non-Uniform Memory Access) Groups of processors are connected together by a fast interconnect (SMP) q These are then connected together by a high-speed interconnect. q Global address space. Interconnect Shared Memory 1 m PE PE 1 n (m-1)n+1 m. n Computer Science, University of Warwick 5

Architecture Classifications Distributed Memory q Each processor has it’s own local memory. q When

Architecture Classifications Distributed Memory q Each processor has it’s own local memory. q When processors need to exchange (or share data), they must do this through an explicit communication • Message q q q Interconnect passing (MPI language) Typically larger latencies between PEs (especially if they communicate via overnetwork interconnections). Scalability, however, is good if the problems can be sufficiently contained within PEs. PE PE 0 n M M 0 n Typically, coarse-grained work units are distributed. Computer Science, University of Warwick 6

In-processor Parallelism Pipelines q Instruction pipelines • Reduces the idle time of hardware components.

In-processor Parallelism Pipelines q Instruction pipelines • Reduces the idle time of hardware components. • Good performance with independent instructions. q Performing more operations per clock cycle. q Discrepancy between peak and actual performance often caused by pipeline effects • Difficult to keep pipelines full. • Branch prediction helps. Computer Science, University of Warwick 7

In-processor Parallelism Vector architectures q Fast I/O - powerful busses and interconnections. q Large

In-processor Parallelism Vector architectures q Fast I/O - powerful busses and interconnections. q Large memory bandwidth and low latency access. q No cache because of above. q Perform operations involving large matrices, commonly encountered in engineering areas Computer Science, University of Warwick 8

In-processor Parallelism Commodity processors increasingly provide performance as good as dedicated Vector processors q

In-processor Parallelism Commodity processors increasingly provide performance as good as dedicated Vector processors q Price/performance is also far better. q Commodity processors now offer good performance for vectorizable code. q Explicit support for vectorization with SIMD instructions on COTS processors • Altivec on Power. PC • SSE (Streaming SIMD Extension) on x 86 Computer Science, University of Warwick 9

Multiprocessor Parallelism Use multiple processors on the same program: q Divide workload up between

Multiprocessor Parallelism Use multiple processors on the same program: q Divide workload up between processors. q Often achieved by dividing up a data structure. q Each processor works on it’s own data. q Typically processors need to communicate. q • Shared or distributed memory is one approach • Explicit messaging is increasingly common. Load balancing is critical for maintaining good performance. Computer Science, University of Warwick 10

Multiprocessor Parallelism Single Processor Symmetric Multiprocessor with Shared Memory CPU CPU Mem MPP System

Multiprocessor Parallelism Single Processor Symmetric Multiprocessor with Shared Memory CPU CPU Mem MPP System Net CPU CPU Mem Mem Computer Science, University of Warwick 11

Clusters Built using COTS components. q Brought about by improved processor speed as well

Clusters Built using COTS components. q Brought about by improved processor speed as well as networking and switching technology. q Mass-produced commodity off-the-shelf (COTS) hardware, rather than expensive proprietary hardware built solely for supercomputers. Computer Science, University of Warwick 12

Clusters are simpler to manage: q Single image, single identity q Often run familiar

Clusters are simpler to manage: q Single image, single identity q Often run familiar operating systems. • Linux is probably the most popular • Commodity compilers and support q Node for node swap-out on failure. q Can run multi-processor parallel tasks. q Or run sequential tasks for multiple users (job-level parallelism). Computer Science, University of Warwick 13

Clusters Clustering of SMPs q Attractive method of achieving high performance. q SMPs reduce

Clusters Clustering of SMPs q Attractive method of achieving high performance. q SMPs reduce the network overhead Computer Science, University of Warwick 14

Parallel Efficiency Main issues that effect parallel efficiency are: q Ratio of computation to

Parallel Efficiency Main issues that effect parallel efficiency are: q Ratio of computation to communication • q Communication bandwidth & latency • q Higher computation usually yields better performance. Latency has the biggest impact. Scalability • How does the bandwidth & latency scale with the number of processors. Computer Science, University of Warwick 15

Dependency and Parallelism àGranularity of parallelism: the size of the computations that are being

Dependency and Parallelism àGranularity of parallelism: the size of the computations that are being performed in parallel àFour types of parallelism (in order of granularity size) q Instruction-level parallelism (e. g. pipeline) q Thread-level parallelism (e. g. run a multi-thread java program) q Process-level parallelism (e. g. run an MPI job in a cluster) q Job-level parallelism (e. g. run a batch of independent singleprocessor jobs in a cluster) Computer Science, University of Warwick 16

Dependency and Parallelism àDependency: If event A must occur before event B, then B

Dependency and Parallelism àDependency: If event A must occur before event B, then B is dependent on A àTwo q Control dependency: waiting for the instruction which controls the execution flow to be completed • q types of Dependency IF (X!=0) Then Y=1. 0/X: Y has the control dependency on X!=0 Data dependency: dependency because of calculations or memory access • Flow dependency: A=X+Y; B=A+C; • Anti-dependency: B=A+C; A=X+Y; • Output dependency: A=2; X=A+1; A=5; Computer Science, University of Warwick 17

Identifying Dependency àDraw a Directed Acyclic Graph (DAG) to identify the dependency among a

Identifying Dependency àDraw a Directed Acyclic Graph (DAG) to identify the dependency among a sequence of instructions q Anti-dependency: a variable appears as a parent in a calculation and then as a child in a later calculation q Output dependency: a variable appears as a child in a calculation and then as a child again in a later calculation Computer Science, University of Warwick 18