Introduction to Parallel Processing Parallel Computer Architecture Definition

Parallel Computer Architecture A parallel computer (or multiple processor system) is a collection of

A Generic Parallel Computer Architecture 2 1 Parallel Machine Network (Custom or industry standard)

The Need And Feasibility of Parallel Computing • Application demands: More computing cycles/data memory

Why is Parallel Processing Needed? Challenging Applications in Applied Science/Engineering • • • •

Why is Parallel Processing Needed? Scientific Computing Demands Driving force for HPC and multiple

Scientific Supercomputing Trends • Proving ground and driver for innovative architecture and advanced high

Uniprocessor Performance Evaluation • • • CPU Performance benchmarking is heavily program-mix dependent. Ideal

Single CPU Performance Trends • The microprocessor is currently the most natural building block

100 Intel IBM Power PC DEC Gate delays/clock Processor freq scales by 2 X

Transistor Count Growth Rate Enabling Technology for Chip-Level Thread-Level Parallelism (TLP) ~ 3, 000

Parallelism in Microprocessor VLSI Generations (ILP) AKA operation level parallelism Multiple micro-operations per cycle

Dual-Core Chip-Multiprocessor (CMP) Architectures Single Die Shared L 2 Cache Single Die Private Caches

Example Six-Core CMP: AMD Phenom II X 6 Six processor cores sharing 6 MB

Example Eight-Core CMP: Intel Nehalem-EX Eight processor cores sharing 24 MB of level 3

Example 100 -Core CMP: Tilera TILE-Gx Processor No shared cache Communication Links On-Chip Network

Microprocessors Vs. Vector Processors Uniprocessor Performance: LINPACK Vector Processors Now about 5 -16 GFLOPS

Parallel Performance: LINPACK Current Top LINPACK Performance: Since ~ June 2013 Now about 33,

Why is Parallel Processing Needed? LINPAK Performance Trends 1 Tera. FLOP 1 GFLOP (1012

Computer System Peak FLOP Rating History Current Top Peak FP Performance: Since ~ June

November 2005 Source (and for current list): www. top 500. org CMPE 655 -

TOP 500 Supercomputers 32 nd List (November 2008): The Top 10 KW Source (and

TOP 500 Supercomputers 34 th List (November 2009): The Top 10 KW Source (and

TOP 500 Supercomputers 36 th List (November 2010): The Top 10 KW Source (and

TOP 500 Supercomputers 38 th List (November 2011): The Top 10 Current List KW

KW TOP 500 Supercomputers 40 th List (November 2012) The Top 10 Cray XK

TOP 500 Supercomputers 41 st List (June 2013): The Top 10 KW Source (and

TOP 500 Supercomputers 42 nd List (Nov. 2013): The Top 10 KW Source (and

TOP 500 Supercomputers 43 rd List (June 2014): The Top 10 KW Source (and

TOP 500 Supercomputers 45 th List (June 2015): The Top 10 KW Source (and

The Goal of Parallel Processing • Goal of applications in using parallel machines: Maximize

The Goal of Parallel Processing • Parallel processing goal is to maximize parallel speedup:

HPC Driving Force Elements of Parallel Computing Assign parallel computations (Tasks) to processors Computing

Elements of Parallel Computing 1 Computing Problems: Driving Force – Numerical Computing: Science and

Elements of Parallel Computing 3 Hardware Resources Parallel Architecture Computing power – Processors, memory,

Elements of Parallel Computing 5 System Software Support – Needed for the development of

Two Approaches to Parallel Programming Programmer Source code written in sequential languages C, C++

Factors Affecting Parallel System Performance • Parallel Algorithm Related: i. e Inherent Parallelism –

Sequential Execution on one processor 0 Possible Parallel Execution Schedule on Two Processors P

Non-pipelined Scalar Sequential Limited Pipelining Evolution of Computer Architecture Lookahead Functional Parallelism I/E Overlap

Parallel Architectures History Historically, parallel architectures were tied to parallel programming models: • Divergent

Parallel Programming Models • Programming methodology used in coding parallel applications • Specifies: 1

Flynn’s 1972 Classification of Computer Architecture (Taxonomy) Instruction Stream = Thread of Control or

Flynn’s Classification of Computer Architecture (Taxonomy) Uniprocessor Single Instruction stream over Multiple Data streams

Current Trends In Parallel Architectures Conventional or sequential • The extension of “computer architecture”

Software Modern Parallel Architecture Layered Framework CAD Database Multiprogramming Shared address Scientific modeling Message

Shared Address Space (SAS) Parallel Architectures (in shared address space) • Any processor can

Shared Address Space (SAS) Parallel Programming Model • Process: virtual address space plus one

Models of Shared-Memory Multiprocessors 1 • The Uniform/Centralized Memory Access (UMA) Model: – All

Models of Shared-Memory Multiprocessors 1 Uniform Memory Access (UMA) Model or Symmetric Memory Processors

Uniform Memory Access (UMA) Example: Intel Pentium Pro Quad Circa 1997 4 -way SMP

Non-Uniform Memory Access (NUMA) Example: 8 -Socket AMD Opteron Node Circa 2003 Dedicated point-to-point

Non-Uniform Memory Access (NUMA) Example: 4 -Socket Intel Nehalem-EX Node CMPE 655 - Shaaban

Distributed Shared-Memory Multiprocessor System Example: Circa 1995 -1999 Cray T 3 E NUMA MPP

Message-Passing Multicomputers • Comprised of multiple autonomous computers (computing nodes) connected via a suitable

Message-Passing Abstraction Tag Send (X, Q, t) Match Data Addr ess X Sender P

Message-Passing Example: Intel Paragon Circa 1983 Each node Is a 2 -way-SMP Communication Assist

Message-Passing Example: IBM SP-2 MPP • • Circa 1994 -1998 Made out of essentially

Message-Passing MPP Example: IBM Blue Gene/L Circa 2005 (2 processors/chip) • (2 chips/compute card)

Message-Passing Programming Tools • Message-passing programming environments include: – Message Passing Interface (MPI): •

Data Parallel Systems SIMD in Flynn taxonomy • Programming model (Data Parallel) – Similar

Dataflow Architectures • Represent computation as a graph of essential data dependencies – –

Speculative Tomasulo Processor Speculative Execution + Tomasulo’s Algorithm = Speculative Tomasulo The Tomasulo approach

Example of Flynn’s Taxonomy’s MISD (Multiple Instruction Streams Single Data Stream): Systolic Architectures •

C=AXB Systolic Array Example: 3 x 3 Systolic Array Matrix Multiplication • Processors arranged

Systolic Array Example: 3 x 3 Systolic Array Matrix Multiplication • Processors arranged in

Slides: 72

Download presentation

Introduction to Parallel Processing • Parallel Computer Architecture: Definition & Broad issues involved – A Generic Parallel Computer Architecture • The Need And Feasibility of Parallel Computing Why? – Scientific Supercomputing Trends – CPU Performance and Technology Trends, Parallelism in Microprocessor Generations – Computer System Peak FLOP Rating History/Near Future • • The Goal of Parallel Processing Elements of Parallel Computing Factors Affecting Parallel System Performance Parallel Architectures History – Parallel Programming Models – Flynn’s 1972 Classification of Computer Architecture • Current Trends In Parallel Architectures – Modern Parallel Architecture Layered Framework • • • Shared Address Space Parallel Architectures Message-Passing Multicomputers: Message-Passing Programming Tools Data Parallel Systems Dataflow Architectures Systolic Architectures: Matrix Multiplication Systolic Array Example PCA Chapter 1. 1, 1. 2 CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Parallel Computer Architecture A parallel computer (or multiple processor system) is a collection of communicating processing elements (processors) that cooperate to solve large computational problems fast by dividing such problems into parallel tasks, exploiting Thread-Level Parallelism (TLP). i. e Parallel Processing • Broad issues involved: Task = Computation done on one processor – The concurrency and communication characteristics of parallel algorithms for a given computational problem (represented by dependency graphs) – Computing Resources and Computation Allocation: • The number of processing elements (PEs), computing power of each element and amount/organization of physical memory used. • What portions of the computation and data are allocated or mapped to each PE. – Data access, Communication and Synchronization (Ordering) • • How the processing elements cooperate and communicate. How data is shared/transmitted between processors. Abstractions and primitives for cooperation/communication and synchronization. The characteristics and performance of parallel system network (System interconnects). – Parallel Processing Performance and Scalability Goals: • Maximize performance enhancement of parallelism: Maximize Parallel Speedup. Goals – By minimizing parallelization overheads and balancing workload on processors • Scalability of performance to larger systems/problems. Processor = Programmable computing element that runs stored programs written using pre-defined instruction set Processing Elements = PEs = Processors CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

A Generic Parallel Computer Architecture 2 1 Parallel Machine Network (Custom or industry standard) Processing (compute) nodes Interconnects Processing Nodes Network Interface AKA Communication Assist (CA) (custom or industry standard) Operating System Parallel Programming Environments 2 -8 cores per chip One or more processing elements or processors per node: Custom or commercial microprocessors. Single or multiple processors per chip Homogenous or heterogonous 1 Processing Nodes: 2 Parallel machine network (System Interconnects). Each processing node contains one or more processing elements (PEs) or processor(s), memory system, plus communication assist: (Network interface and communication controller) Function of a parallel machine network is to efficiently (reduce communication cost) transfer information (data, results. . ) from source node to destination node as needed to allow cooperation among parallel processing nodes to solve large computational problems divided into a number parallel computational tasks. Parallel Computer = Multiple Processor System CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

The Need And Feasibility of Parallel Computing • Application demands: More computing cycles/data memory needed Driving Force – Scientific/Engineering computing: CFD, Biology, Chemistry, Physics, . . . – General-purpose computing: Video, Graphics, CAD, Databases, Transaction Processing, Gaming… – Mainstream multithreaded programs, are similar to parallel programs • Technology Trends: Moore’s Law still alive – Number of transistors on chip growing rapidly. Clock rates expected to continue to go up but only slowly. Actual performance returns diminishing due to deeper pipelines. – Increased transistor density allows integrating multiple processor cores per creating Chip. Multiprocessors (CMPs) even for mainstream computing applications (desktop/laptop. . ). • Architecture Trends: + multi-tasking (multiple independent programs) – Instruction-level parallelism (ILP) is valuable (superscalar, VLIW) but limited. – Increased clock rates require deeper pipelines with longer latencies and higher CPIs. – Coarser-level parallelism (at the task or thread level, TLP), utilized in multiprocessor systems is the most viable approach to further improve performance. Multi-core • Main motivation for development of chip-multiprocessors (CMPs) • Economics: Processors – The increased utilization of commodity of-the-shelf (COTS) components in high performance parallel computing systems instead of costly custom components used in traditional supercomputers leading to much lower parallel system cost. • Today’s microprocessors offer high-performance and have multiprocessor support eliminating the need for designing expensive custom Pes. • Commercial System Area Networks (SANs) offer an alternative to custom more costly networks CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Why is Parallel Processing Needed? Challenging Applications in Applied Science/Engineering • • • • Traditional Driving Force For HPC/Parallel Processing Astrophysics Atmospheric and Ocean Modeling Such applications have very high Bioinformatics 1 - computational and 2 - data memory Biomolecular simulation: Protein folding requirements that cannot be met Computational Chemistry with single-processor architectures. Computational Fluid Dynamics (CFD) Many applications contain a large degree of computational parallelism Computational Physics Computer vision and image understanding Data Mining and Data-intensive Computing Engineering analysis (CAD/CAM) Global climate modeling and forecasting Material Sciences Military applications Driving force for High Performance Computing (HPC) Quantum chemistry and multiple processor system development VLSI design …. CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Why is Parallel Processing Needed? Scientific Computing Demands Driving force for HPC and multiple processor system development (Memory Requirement) Computational and memory demands exceed the capabilities of even the fastest current uniprocessor systems 5 -16 GFLOPS for uniprocessor GLOP = 109 FLOPS Tera. FLOP = 1000 GFLOPS = 1012 FLOPS Peta. FLOP = 1000 Tera. FLOPS = 1015 FLOPS CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Scientific Supercomputing Trends • Proving ground and driver for innovative architecture and advanced high performance computing (HPC) techniques: – Market is much smaller relative to commercial (desktop/server) segment. – Dominated by costly vector machines starting in the 1970 s through the 1980 s. – Microprocessors have made huge gains in the performance needed for such applications: Enabled with high transistor density/chip • • • High clock rates. (Bad: Higher CPI? ) Multiple pipelined floating point units. Instruction-level parallelism. Effective use of caches. Multiple processor cores/chip (2 cores 2002 -2005, 4 end of 2006, 6 -12 cores 2011) 16 cores in 2013 However even the fastest current single microprocessor systems still cannot meet the needed computational demands. As shown in last slide • Currently: Large-scale microprocessor-based multiprocessor systems and computer clusters are replacing (replaced? ) vector supercomputers that utilize custom processors. CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Uniprocessor Performance Evaluation • • • CPU Performance benchmarking is heavily program-mix dependent. Ideal performance requires a perfect machine/program match. Performance measures: – Total CPU time = TC / f = TC x C = I x CPI x C = I x (CPIexecution + M x k) x C (in seconds) TC = Total program execution clock cycles f = clock rate C = CPU clock cycle time = 1/f I = Instructions executed count CPI = Cycles per instruction CPIexecution = CPI with ideal memory M = Memory stall cycles per memory access k = Memory accesses per instruction – MIPS Rating = I / (T x 106) = f / (CPI x 106) = f x I /(TC x 106) (in million instructions per second) – Throughput Rate: Wp = 1/ T = f /(I x CPI) = (MIPS) x 106 /I (in programs per second) • Performance factors: (I, CPIexecution, m, k, C) are influenced by: instruction-set architecture (ISA) , compiler design, CPU micro-architecture, implementation and control, cache and memory hierarchy, program access locality, and program instruction mix and instruction dependencies. T = I x CPI x C CMPE 655 - Shaaban

Single CPU Performance Trends • The microprocessor is currently the most natural building block for multiprocessor systems in terms of cost and performance. • This is even more true with the development of cost-effective multi-core microprocessors that support TLP at the chip level. Custom Processors Commodity Processors CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

100 Intel IBM Power PC DEC Gate delays/clock Processor freq scales by 2 X per generation 21264 S 1, 000 Mhz 21164 A 21264 Pentium(R) 21064 A 21164 II 21066 MPC 750 604+ 10 Pentium Pro 601, 603 (R) Pentium(R) 100 486 386 2005 2003 2001 1999 1997 1995 1993 1991 1989 No longer the case 1 1987 10 • Frequency doubles each generation ? • Number of gates/clock reduce by 25% • Leads to deeper pipelines with more stages (e. g Intel Pentium 4 E has 30+ pipeline stages) T = I x CPI x C Gate Delays/ Clock 10, 000 Microprocessor Frequency Trend Realty Check: Clock frequency scaling is slowing down! (Did silicone finally hit the wall? ) Why? 1 - Static power leakage 2 - Clock distribution delays Result: Deeper Pipelines Longer stalls Higher CPI (lowers effective performance per cycle) Solution: Exploit TLP at the chip level, Chip-multiprocessor (CMPs) CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Transistor Count Growth Rate Enabling Technology for Chip-Level Thread-Level Parallelism (TLP) ~ 3, 000 x transistor density increase in the last 40 years Currently ~ 7 Billion Moore’s Law: 2 X transistors/Chip Every 1. 5 years (circa 1970) still holds Enables Thread-Level Parallelism (TLP) at the chip level: Chip-Multiprocessors (CMPs) + Simultaneous Multithreaded (SMT) processors Intel 4004 (2300 transistors) Solution • One billion transistors/chip reached in 2005, two billion in 2008 -9, Now ~ seven billion • Transistor count grows faster than clock rate: Currently ~ 40% per year • Single-threaded uniprocessors do not efficiently utilize the increased transistor count. Limited ILP, increased size of cache CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Parallelism in Microprocessor VLSI Generations (ILP) AKA operation level parallelism Multiple micro-operations per cycle Single-issue (multi-cycle non-pipelined) Pipelined CPI =1 Superscalar /VLIW CPI <1 (TLP) Simultaneous Multithreading SMT: e. g. Intel’s Hyper-threading Chip-Multiprocessors (CMPs) e. g IBM Power 4, 5 Intel Pentium D, Core Duo AMD Athlon 64 X 2 Dual Core Opteron Sun Ultra. Sparc T 1 (Niagara) Not Pipelined CPI >> 1 Chip-Level TLP/Parallel Processing Even more important due to slowing clock rate increase Single Thread ILP = Instruction-Level Parallelism TLP = Thread-Level Parallelism Per Chip Improving microprocessor generation performance by exploiting more levels of parallelism CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Dual-Core Chip-Multiprocessor (CMP) Architectures Single Die Shared L 2 Cache Single Die Private Caches Shared System Interface Two Dies – Shared Package Private Caches Private System Interface Shared L 2 or L 3 On-chip crossbar/switch Cores communicate using shared cache (Lowest communication latency) Examples: IBM POWER 4/5 Intel Pentium Core Duo (Yonah), Conroe (Core 2), i 7, Sun Ultra. Sparc T 1 (Niagara) AMD Phenom …. Cores communicate using on-chip Interconnects (shared system interface) Examples: AMD Dual Core Opteron, Athlon 64 X 2 Intel Itanium 2 (Montecito) Source: Real World Technologies, http: //www. realworldtech. com/page. cfm? Article. ID=RWT 101405234615 FSB Cores communicate over external Front Side Bus (FSB) (Highest communication latency) Examples: Intel Pentium D, Intel Quad core (two dual-core chips) CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Example Six-Core CMP: AMD Phenom II X 6 Six processor cores sharing 6 MB of level 3 (L 3) cache CMP = Chip-Multiprocessor CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Example Eight-Core CMP: Intel Nehalem-EX Eight processor cores sharing 24 MB of level 3 (L 3) cache Each core is 2 -way SMT (2 threads per core), for a total of 16 threads CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Example 100 -Core CMP: Tilera TILE-Gx Processor No shared cache Communication Links On-Chip Network Switch Network-on-Chip (No. C) Example For more information see: http: //www. tilera. com/ CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Microprocessors Vs. Vector Processors Uniprocessor Performance: LINPACK Vector Processors Now about 5 -16 GFLOPS per microprocessor core 1 GFLOP (109 FLOPS) Microprocessors CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Parallel Performance: LINPACK Current Top LINPACK Performance: Since ~ June 2013 Now about 33, 862, 700 GFlop/s = 33, 862. 7 Tera. Flops/s = 33. 86 Peta. Flops/s Tianhe-2 (Milky. Way-2) ( @ National University of Defense Technology, Changsha, China) 3, 120, 000 total processor cores: 384, 000 Intel Xeon cores (32, 000 Xeon E 5 -2692 12 -core processors @ 2. 2 GHz) + 2, 736, 000 Intel Xeon Phi cores (48, 000 Xeon Phi 31 S 1 P 57 -core processors @ 1. 1 GHz) 1 Tera. FLOP (1012 FLOPS = 1000 GFLOPS) GLOP = 109 FLOPS Tera. FLOP = 1000 GFLOPS = 1012 FLOPS Peta. FLOP = 1000 Tera. FLOPS = 1015 FLOPS Current ranking of top 500 parallel supercomputers in the world is found at: www. top 500. org CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Why is Parallel Processing Needed? LINPAK Performance Trends 1 Tera. FLOP 1 GFLOP (1012 FLOPS =1000 GFLOPS) (109 FLOPS) Uniprocessor Performance GLOP = 109 FLOPS Tera. FLOP = 1000 GFLOPS = 1012 FLOPS Peta. FLOP = 1000 Tera. FLOPS = 1015 FLOPS Parallel System Performance CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Computer System Peak FLOP Rating History Current Top Peak FP Performance: Since ~ June 2013 Now about 54, 902, 400 GFlops/s = 54, 902. 4 Tera. Flops/s = 54. 9 Peta. Flops/s Tianhe-2 (Milky. Way-2) ( @ National University of Defense Technology, Changsha, China) 3, 120, 000 total processor cores: 384, 000 Intel Xeon cores (32, 000 Xeon E 5 -2692 12 -core processors @ 2. 2 GHz) + 2, 736, 000 Intel Xeon Phi cores (48, 000 Xeon Phi 31 S 1 P 57 -core processors @ 1. 1 GHz) Tianhe-2 (Milky. Way-2) Peta FLOP (1015 FLOPS = 1000 Tera FLOPS) Quadrillion Flops Teraflop (1012 FLOPS = 1000 GFLOPS) GLOP = 109 FLOPS Tera. FLOP = 1000 GFLOPS = 1012 FLOPS Peta. FLOP = 1000 Tera. FLOPS = 1015 FLOPS Current ranking of top 500 parallel supercomputers in the world is found at: www. top 500. org CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

November 2005 Source (and for current list): www. top 500. org CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

TOP 500 Supercomputers 32 nd List (November 2008): The Top 10 KW Source (and for current list): www. top 500. org CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

TOP 500 Supercomputers 34 th List (November 2009): The Top 10 KW Source (and for current list): www. top 500. org CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

TOP 500 Supercomputers 36 th List (November 2010): The Top 10 KW Source (and for current list): www. top 500. org CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

TOP 500 Supercomputers 38 th List (November 2011): The Top 10 Current List KW Source (and for current list): www. top 500. org CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

KW TOP 500 Supercomputers 40 th List (November 2012) The Top 10 Cray XK 7 - Titan ( @ Oak Ridge National Laboratory) LINPACK Performance: 17. 59 Peta. Flops/s (quadrillion Flops per second) Peak Performance: 27. 1 Peta. Flops/s 560, 640 total processor cores: 299, 008 Opteron cores (18, 688 AMD Opteron 6274 16 -core processors @ 2. 2 GHz) + 261, 632 GPU cores (18, 688 Nvidia Tesla Kepler K 20 x GPUs @ 0. 7 GHz) Source (and for current list): www. top 500. org CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

TOP 500 Supercomputers 41 st List (June 2013): The Top 10 KW Source (and for current list): www. top 500. org CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

TOP 500 Supercomputers 42 nd List (Nov. 2013): The Top 10 KW Source (and for current list): www. top 500. org CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

TOP 500 Supercomputers 43 rd List (June 2014): The Top 10 KW Source (and for current list): www. top 500. org CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

TOP 500 Supercomputers 45 th List (June 2015): The Top 10 KW Source (and for current list): www. top 500. org CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

The Goal of Parallel Processing • Goal of applications in using parallel machines: Maximize Speedup over single processor performance Parallel Speedup (p processors) = Performance (p processors) Performance (1 processor) • For a fixed problem size (input data set), performance = 1/time Fixed Problem Size Parallel Speedup, Speedupp Speedup fixed problem (p processors) = Time (1 processor) Time (p processors) • Ideal speedup = number of processors = p Very hard to achieve + load imbalance Due to parallelization overheads: communication cost, dependencies. . . CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

The Goal of Parallel Processing • Parallel processing goal is to maximize parallel speedup: Or time Fixed Problem Size Parallel Speedup = Sequential Work on one processor Time(1) Time(p) < Max (Work + Synch Wait Time + Comm Cost + Extra Work) Time Parallelization overheads i. e the processor with maximum execution time • Ideal Speedup = number of processors Implies or Requires – Very hard to achieve: Implies no parallelization overheads and perfect load balance among all processors. • Maximize parallel speedup by: 1 2 • – Balancing computations on processors (every processor does the same amount of work) and the same amount of overheads. – Minimizing communication cost and other overheads associated with each step of parallel program creation and execution. Performance Scalability: + Achieve a good speedup for the parallel application on the parallel architecture as problem size and machine size (number of processors) are increased. CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

HPC Driving Force Elements of Parallel Computing Assign parallel computations (Tasks) to processors Computing Problems Processing Nodes/Network Parallel Algorithms and Data Structures Dependency analysis (Task Dependency Graphs) Mapping Parallel Hardware Architecture Parallel Programming High-level Languages Parallel Program Operating System Binding (Compile, Load) Applications Software Performance Evaluation e. g Parallel Speedup CMPE 655 - Shaaban

Elements of Parallel Computing 1 Computing Problems: Driving Force – Numerical Computing: Science and engineering numerical problems demand intensive integer and floating point computations. – Logical Reasoning: Artificial intelligence (AI) demand logic inferences and symbolic manipulations and large space searches. 2 Parallel Algorithms and Data Structures – Special algorithms and data structures are needed to specify the computations and communication present in computing problems (from dependency analysis). – Most numerical algorithms are deterministic using regular data structures. – Symbolic processing may use heuristics or non-deterministic searches. – Parallel algorithm development requires interdisciplinary interaction. CMPE 655 - Shaaban

Elements of Parallel Computing 3 Hardware Resources Parallel Architecture Computing power – Processors, memory, and peripheral devices (processing nodes) form the hardware core of a computer system. B – Processor connectivity (system interconnects, network), memory organization, influence the system architecture. A 4 Operating Systems Communication/connectivity – Manages the allocation of resources to running processes. – Mapping to match algorithmic structures with hardware architecture and vice versa: processor scheduling, memory mapping, interprocessor communication. • Parallelism exploitation possible at: 1 - algorithm design, 2 - program writing, 3 - compilation, and 4 - run time. CMPE 655 - Shaaban

Elements of Parallel Computing 5 System Software Support – Needed for the development of efficient programs in high-level languages (HLLs. ) – Assemblers, loaders. – Portable parallel programming languages/libraries – User interfaces and tools. 6 Compiler Support Approaches to parallel programming (a) – Implicit Parallelism Approach • Parallelizing compiler: Can automatically detect parallelism in sequential source code and transforms it into parallel constructs/code. • Source code written in conventional sequential languages (b) – Explicit Parallelism Approach: • Programmer explicitly specifies parallelism using: – Sequential compiler (conventional sequential HLL) and low-level library of the target parallel computer , or. . – Concurrent (parallel) HLL. • Concurrency Preserving Compiler: The compiler in this case preserves the parallelism explicitly specified by the programmer. It may perform some program flow analysis, dependence checking, limited optimizations for parallelism detection. Illustrated next CMPE 655 - Shaaban

Two Approaches to Parallel Programming Programmer Source code written in sequential languages C, C++ FORTRAN, LISP. . Source code written in concurrent dialects of C, C++ FORTRAN, LISP. . Parallelizing compiler Concurrency preserving compiler Parallel object code Execution by runtime system (a) Implicit Parallelism Compiler automatically detects parallelism in sequential source code and transforms it into parallel constructs/code Concurrent object code Execution by runtime system (b) Explicit Parallelism Programmer explicitly specifies parallelism using parallel constructs CMPE 655 - Shaaban

Factors Affecting Parallel System Performance • Parallel Algorithm Related: i. e Inherent Parallelism – Available concurrency and profile, grain size, uniformity, patterns. • Dependencies between computations represented by dependency graph – – Type of parallelism present: Functional and/or data parallelism. Required communication/synchronization, uniformity and patterns. Data size requirements. Communication to computation ratio (C-to-C ratio, lower is better). • Parallel program Related: – Programming model used. – Resulting data/code memory requirements, locality and working set characteristics. – Parallel task grain size. – Assignment (mapping) of tasks to processors: Dynamic or static. – Cost of communication/synchronization primitives. • Hardware/Architecture related: – – – Total CPU computational power available. + Number of processors (hardware parallelism) Types of computation modes supported. Shared address space Vs. message passing. Communication network characteristics (topology, bandwidth, latency) Memory hierarchy properties. Concurrency = Parallelism CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Sequential Execution on one processor 0 Possible Parallel Execution Schedule on Two Processors P 0, P 1 Task Dependency Graph 0 Task: 1 A 2 Computation run on one processor 1 A 3 3 Comm 4 B 5 B 6 7 C 9 10 D 11 E F Comm 12 13 E 14 G 15 7 F 17 18 19 G 20 21 Time B F 8 D 9 10 E 11 12 Idle Comm 13 14 G Idle 15 16 T 1 =21 C 6 Comm D Idle Comm 4 5 C 8 A 2 16 What would the speed be with 3 processors? 4 processors? 5 … ? 17 T 2 =16 18 19 20 Assume computation time for each task A-G = 3 Assume communication time between parallel tasks = 1 Assume communication can overlap with computation Speedup on two processors = T 1/T 2 = 21/16 = 1. 3 A simple parallel execution example 21 Time P 0 P 1 CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Non-pipelined Scalar Sequential Limited Pipelining Evolution of Computer Architecture Lookahead Functional Parallelism I/E Overlap Multiple Func. Units Pipeline Implicit Vector I/E: Instruction Fetch and Execute SIMD: Single Instruction stream over Multiple Data streams Pipelined (single or multiple issue) Explicit Vector Memory-to -Memory MIMD: Multiple Instruction streams over Multiple Data streams SIMD Associative Processor Array Data Parallel Message Passing Vector/data parallel Register-to -Register Parallel Machines MIMD Multicomputer Clusters Shared Memory Multiprocessor Massively Parallel Processors (MPPs) CMPE 655 - Shaaban

Parallel Architectures History Historically, parallel architectures were tied to parallel programming models: • Divergent architectures, with no predictable pattern of growth. Application Software Systolic Arrays System Software Architecture Dataflow SIMD Data Parallel Architectures Message Passing Shared Memory More on this next lecture CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Parallel Programming Models • Programming methodology used in coding parallel applications • Specifies: 1 - communication and 2 - synchronization However, a good way to utilize multi-core processors for the masses! • Examples: – Multiprogramming: or Multi-tasking (not true parallel processing!) No communication or synchronization at program level. A number of independent programs running on different processors in the system. – Shared memory address space (SAS): Parallel program threads or tasks communicate implicitly using a shared memory address space (shared data in memory). – Message passing: Explicit point to point communication (via send/receive pairs) is used between parallel program tasks using messages. – Data parallel: More regimented, global actions on data (i. e the same operations over all elements on an array or vector) – Can be implemented with shared address space or message passing. CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Flynn’s 1972 Classification of Computer Architecture (Taxonomy) Instruction Stream = Thread of Control or Hardware Context (a) • Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines or uniprocessors. (b) • Single Instruction stream over Multiple Data streams (SIMD): Vector computers, array of synchronized processing elements. Data parallel systems (GPUs? ) (c) • Multiple Instruction streams and a Single Data stream (MISD): Systolic arrays for pipelined execution. (d) • Multiple Instruction streams over Multiple Data streams (MIMD): Parallel computers: Tightly coupled processors Loosely coupled processors • Shared memory multiprocessors. • Multicomputers: Unshared distributed memory, message-passing used instead (e. g clusters) Classified according to number of instruction streams (threads) and number of data streams in architecture CMPE 655 - Shaaban

Flynn’s Classification of Computer Architecture (Taxonomy) Uniprocessor Single Instruction stream over Multiple Data streams (SIMD): Vector computers, array of synchronized processing elements. CU = Control Unit PE = Processing Element M = Memory Shown here: array of synchronized processing elements Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines or uniprocessors. Parallel computers or multiprocessor systems Multiple Instruction streams and a Single Data stream (MISD): Systolic arrays for pipelined execution. Classified according to number of instruction streams (threads) and number of data streams in architecture Multiple Instruction streams over Multiple Data streams (MIMD): Parallel computers: Distributed memory multiprocessor system shown CMPE 655 - Shaaban

Current Trends In Parallel Architectures Conventional or sequential • The extension of “computer architecture” to support communication and cooperation: – OLD: Instruction Set Architecture (ISA) – NEW: Communication Architecture • Defines: 1 – Critical abstractions, boundaries, and primitives (interfaces). 2 – Organizational structures that implement interfaces (hardware or software) Implementation of Interfaces • Compilers, libraries and OS are important bridges today i. e. software abstraction layers More on this next lecture CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Software Modern Parallel Architecture Layered Framework CAD Database Multiprogramming Shared address Scientific modeling Message passing Parallel applications Data parallel Programming models User Space Compilation or library Hardware Operating systems support Communication hardware Communication abstraction User/system boundary System Space Hardware/software boundary (ISA) Physical communication medium Hardware: Processing Nodes & Interconnects More on this next lecture CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Shared Address Space (SAS) Parallel Architectures (in shared address space) • Any processor can directly reference any memory location – Communication occurs implicitly as result of loads and stores • Convenient: Communication is implicit via loads/stores – Location transparency – Similar programming model to time-sharing in uniprocessors • Except processes run on different processors • Good throughput on multiprogrammed workloads i. e multi-tasking • Naturally provided on a wide range of platforms – Wide range of scale: few to hundreds of processors • Popularly known as shared memory machines or model – Ambiguous: Memory may be physically distributed among processing nodes. i. e Distributed shared memory multiprocessors Sometimes called Tightly-Coupled Parallel Computers CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Shared Address Space (SAS) Parallel Programming Model • Process: virtual address space plus one or more threads of control • Portions of address spaces of processes are shared: Virtual address spaces for a collection of processes communicating via shared addresses In SAS: Communication is implicit via loads/stores. Load P 1 Ordering/Synchronization is explicit using synchronization Primitives. Machine physical address space Pn pr i v at e Pn P 2 Common physical addresses Shared Space P 0 St or e Shared portion of address space Private portion of address space P 2 pr i v at e P 1 pr i v at e P 0 pr i v at e • Writes to shared address visible to other threads (in other processes too) Natural extension of the uniprocessor model: Thus communication is implicit via loads/stores • Conventional memory operations used for communication • Special atomic operations needed for synchronization: i. e for event ordering and mutual exclusion • Using Locks, Semaphores etc. Thus synchronization is explicit • OS uses shared memory to coordinate processes. • CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Models of Shared-Memory Multiprocessors 1 • The Uniform/Centralized Memory Access (UMA) Model: – All physical memory is shared by all processors. – All processors have equal access (i. e equal memory bandwidth and access latency) to all memory addresses. – Also referred to as Symmetric Memory Processors (SMPs). 2 • Distributed memory or Non-uniform Memory Access (NUMA) Model: – Shared memory is physically distributed locally among processors. Access latency to remote memory is higher. 3 • The Cache-Only Memory Architecture (COMA) Model: – A special case of a NUMA machine where all distributed main memory is converted to caches. – No memory hierarchy at each processor. CMPE 655 - Shaaban

Models of Shared-Memory Multiprocessors 1 Uniform Memory Access (UMA) Model or Symmetric Memory Processors (SMPs). UMA Interconnect: Bus, Crossbar, Multistage network P: Processor Mem: Memory C: Cache D: Cache directory /Network °°° M $ P M $ °°° M P D D C C C P P P $ P NUMA 2 D Distributed memory or Non-uniform Memory Access (NUMA) Model 3 Cache-Only Memory Architecture (COMA) CMPE 655 - Shaaban

Uniform Memory Access (UMA) Example: Intel Pentium Pro Quad Circa 1997 4 -way SMP Shared FSB • All coherence and multiprocessing glue in processor module • Highly integrated, targeted at high volume • Computing node used in Intel’s ASCI-Red MPP Bus-Based Symmetric Memory Processors (SMPs). A single Front Side Bus (FSB) is shared among processors This severely limits scalability to only ~ 2 -4 processors CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Non-Uniform Memory Access (NUMA) Example: 8 -Socket AMD Opteron Node Circa 2003 Dedicated point-to-point interconnects (Hyper. Transport links) used to connect processors alleviating the traditional limitations of FSB-based SMP systems. Each processor has two integrated DDR memory channel controllers: memory bandwidth scales up with number of processors. NUMA architecture since a processor can access its own memory at a lower latency than access to remote memory directly connected to other processors in the system. Total 32 processor cores when quad core Opteron processors used (128 cores with 16 -core processors) CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Non-Uniform Memory Access (NUMA) Example: 4 -Socket Intel Nehalem-EX Node CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Distributed Shared-Memory Multiprocessor System Example: Circa 1995 -1999 Cray T 3 E NUMA MPP Example MPP = Massively Parallel Processor System More recent Cray MPP Example: Cray X 1 E Supercomputer 3 D Torus Point-To-Point Network • • Communication Assist (CA) Scale up to 2048 processors, DEC Alpha EV 6 microprocessor (COTS) Custom 3 D Torus point-to-point network, 480 MB/s links Memory controller generates communication requests for non-local references No hardware mechanism for coherence (SGI Origin etc. provide this) Example of Non-uniform Memory Access (NUMA) CMPE 655 - Shaaban

Message-Passing Multicomputers • Comprised of multiple autonomous computers (computing nodes) connected via a suitable network. Industry standard System Area Network (SAN) or proprietary network • Each node consists of one or more processors, local memory, attached storage and I/O peripherals and Communication Assist (CA). • Local memory is only accessible by local processors in a node (no shared memory among nodes). • Inter-node communication is carried explicitly out by message passing through the connection network via send/receive operations. Thus communication is explicit • Process communication achieved using a message-passing programming environment (e. g. PVM, MPI). Portable, platform-independent – Programming model more removed or abstracted from basic hardware operations • Include: – A number of commercial Massively Parallel Processor systems (MPPs). 1 – Computer clusters that utilize commodity of-the-shelf (COTS) components. 2 Also called Loosely-Coupled Parallel Computers CMPE 655 - Shaaban

Message-Passing Abstraction Tag Send (X, Q, t) Match Data Addr ess X Sender P Address Y Send X, Q, t Recipient Local pr ocess address space Receive Y, P, t Tag Receive (Y, P, t) Data • Recipient Q Sender Process P • • • Local pr ocess address space Recipient blocks (waits) until message is received Process Q Send specifies buffer to be transmitted and receiving process. Communication is explicit Receive specifies sending process and application storage to receive into. via sends/receives Memory to memory copy possible, but need to name processes. Optional tag on send and matching rule on receive. i. e event ordering, in this case User process names local data and entities in process/tag space too In simplest form, the send/receive match achieves implicit pairwise synchronization event – Ordering of computations according to dependencies Synchronization is Many possible overheads: copying, buffer management, protection. . . implicit Pairwise synchronization using send/receive match Sender P Data Dependency /Ordering Blocking Receive Recipient Q CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Message-Passing Example: Intel Paragon Circa 1983 Each node Is a 2 -way-SMP Communication Assist (CA) 2 D grid point to point network CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Message-Passing Example: IBM SP-2 MPP • • Circa 1994 -1998 Made out of essentially complete RS 6000 workstations Network interface integrated in I/O bus (bandwidth limited by I/O bus) Multi-stage network MPP = Massively Parallel Processor System Communication Assist (CA) CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Message-Passing MPP Example: IBM Blue Gene/L Circa 2005 (2 processors/chip) • (2 chips/compute card) • (16 compute cards/node board) • (32 node boards/tower) • (64 tower) = 128 k = 131072 (0. 7 GHz Power. PC 440) processors (64 k nodes) System Location: Lawrence Livermore National Laboratory 2. 8 Gflops peak per processor core Networks: 3 D Torus point-to-point network Global tree 3 D point-to-point network (both proprietary) Design Goals: - High computational power efficiency - High computational density per volume LINPACK Performance: 280, 600 GFLOPS = 280. 6 Tera. FLOPS = 0. 2806 Peta FLOP Top Peak FP Performance: Now about 367, 000 GFLOPS = 367 Tera. FLOPS = 0. 367 Peta FLOP CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Message-Passing Programming Tools • Message-passing programming environments include: – Message Passing Interface (MPI): • Provides a standard for writing concurrent message-passing programs. • MPI implementations include parallel libraries used by existing programming languages (C, C++). Both MPI & PVM are examples – Parallel Virtual Machine (PVM): of the explicit parallelism approach to parallel programming • Enables a collection of heterogeneous computers to used as a coherent and flexible concurrent computational resource. • PVM support software executes on each machine in a userconfigurable pool, and provides a computational environment of concurrent applications. • User programs written for example in C, Fortran or Java are provided access to PVM through the use of calls to PVM library routines. Both MPI and PVM are portable (platform-independent) and allow the user to explicitly specify parallelism CMPE 655 - Shaaban

Data Parallel Systems SIMD in Flynn taxonomy • Programming model (Data Parallel) – Similar operations performed in parallel on each element of data structure – Logically single thread of control, performs sequential or parallel steps – Conceptually, a processor is associated with each data element • Architectural model – Array of many simple processors each with little memory • Processors don’t sequence through instructions – Attached to a control processor that issues instructions – Specialized and general communication, global synchronization • Example machines: – Thinking Machines CM-1, CM-2 (and CM-5) – Maspar MP-1 and MP-2, All PE are synchronized (same instruction or operation in a given cycle) Other Data Parallel Architectures: Vector Machines PE = Processing Element CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Dataflow Architectures • Represent computation as a graph of essential data dependencies – – Non-Von Neumann Architecture (Not PC-based Architecture) Logical processor at each node, activated by availability of operands i. e data or results Message (tokens) carrying tag of next instruction sent to next processor Tag compared with others in matching store; match fires execution Research Dataflow machine prototypes include: Token Distribution • The MIT Tagged Architecture Network • The Manchester Dataflow Machine The Tomasulo approach for dynamic instruction execution utilizes a dataflow driven execution engine: • The data dependency graph for a small window of instructions is constructed dynamically when instructions are issued in order of the program. Dependency graph for entire computation (program) One Node • The execution of an issued instruction is triggered by the availability of its operands (data it needs) over the CDB. Token Matching Tokens = Copies of computation results Token Distribution CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Speculative Tomasulo Processor Speculative Execution + Tomasulo’s Algorithm = Speculative Tomasulo The Tomasulo approach for dynamic instruction execution utilizes a dataflow driven execution engine: Commit or Retirement (In Order) FIFO Usually implemented as a circular buffer Instructions to issue in Order: Instruction Queue (IQ) Next to commit Store Results • The data dependency graph for a small window of instructions is constructed dynamically when instructions are issued in order of the program. • The execution of an issued instruction is triggered by the availability of its operands (data it needs) over the CDB. From 550 Lecture 6 CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Example of Flynn’s Taxonomy’s MISD (Multiple Instruction Streams Single Data Stream): Systolic Architectures • Replace single processor with an array of regular processing elements • Orchestrate data flow for high throughput with less memory access PE = Processing Element M = Memory • • Different from linear pipelining – Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory Different from SIMD: each PE may do something different • • Initial motivation: VLSI Application-Specific Integrated Circuits (ASICs) Represent algorithms directly by chips connected in regular pattern A possible example of MISD in Flynn’s Classification of Computer Architecture CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

C=AXB Systolic Array Example: 3 x 3 Systolic Array Matrix Multiplication • Processors arranged in a 2 -D grid • Each processor accumulates one element of the product b 2, 0 b 1, 0 b 0, 0 Alignments in time Column 0 b 2, 1 b 1, 1 b 0, 1 Column 1 b 2, 2 b 1, 2 b 0, 2 Columns of B Rows of A a 0, 2 a 0, 1 a 0, 0 Row 0 a 1, 2 a 1, 1 a 1, 0 Row 1 a 2, 2 a 2, 1 T=0 a 2, 0 Row 2 Example source: http: //www. cs. hmc. edu/courses/2001/spring/cs 156/ CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Systolic Array Example: 3 x 3 Systolic Array Matrix Multiplication • Processors arranged in a 2 -D grid • Each processor accumulates one element of the product b 2, 0 b 1, 0 Alignments in time b 2, 1 b 1, 1 b 0, 1 b 2, 2 b 1, 2 b 0, 0 a 0, 2 a 1, 2 a 2, 1 a 1, 1 a 0, 0*b 0, 0 a 1, 0 a 2, 0 T=1 Example source: http: //www. cs. hmc. edu/courses/2001/spring/cs 156/ CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Systolic Array Example: 3 x 3 Systolic Array Matrix Multiplication • Processors arranged in a 2 -D grid • Each processor accumulates one element of the product Alignments in time b 2, 1 b 1, 1 b 2, 0 b 1, 0 a 0, 2 a 0, 1 a 0, 0*b 0, 0 + a 0, 1*b 1, 0 a 0, 0 b 2, 2 b 1, 2 b 0, 1 a 0, 0*b 0, 1 b 0, 0 a 1, 2 a 2, 1 a 1, 0*b 0, 0 a 2, 0 T=2 Example source: http: //www. cs. hmc. edu/courses/2001/spring/cs 156/ CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Systolic Array Example: 3 x 3 Systolic Array Matrix Multiplication • Processors arranged in a 2 -D grid • Each processor accumulates one element of the product Alignments in time b 2, 2 b 1, 2 b 2, 1 b 2, 0 a 0, 2 a 0, 0*b 0, 0 + a 0, 1*b 1, 0 + a 0, 2*b 2, 0 a 0, 1 b 1, 1 a 0, 0*b 0, 1 + a 0, 1*b 1, 1 a 0, 0 b 0, 2 a 0, 0*b 0, 2 C 00 b 1, 0 a 1, 2 a 1, 1 a 1, 0*b 0, 0 + a 1, 1*b 1, 0 a 1, 0 b 0, 1 a 1, 0*b 0, 1 b 0, 0 a 2, 2 a 2, 1 a 2, 0 T=3 Example source: http: //www. cs. hmc. edu/courses/2001/spring/cs 156/ a 2, 0*b 0, 0 CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Systolic Array Example: 3 x 3 Systolic Array Matrix Multiplication • Processors arranged in a 2 -D grid • Each processor accumulates one element of the product Alignments in time b 2, 2 b 2, 1 a 0, 0*b 0, 0 + a 0, 1*b 1, 0 + a 0, 2*b 2, 0 a 0, 2 a 0, 1 a 0, 0*b 0, 2 + a 0, 1*b 1, 2 C 01 C 00 b 2, 0 a 1, 2 a 0, 0*b 0, 1 + a 0, 1*b 1, 1 + a 0, 2*b 2, 1 b 1, 2 a 1, 0*b 0, 0 + a 1, 1*b 1, 0 + a 1, 2*a 2, 0 a 1, 1 b 1, 1 a 1, 0*b 0, 1 +a 1, 1*b 1, 1 a 1, 0 b 0, 2 a 1, 0*b 0, 2 C 10 b 1, 0 T=4 a 2, 2 a 2, 1 a 2, 0*b 0, 0 + a 2, 1*b 1, 0 Example source: http: //www. cs. hmc. edu/courses/2001/spring/cs 156/ a 2, 0 b 0, 1 a 2, 0*b 0, 1 CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Systolic Array Example: 3 x 3 Systolic Array Matrix Multiplication • Processors arranged in a 2 -D grid • Each processor accumulates one element of the product Alignments in time b 2, 2 a 0, 0*b 0, 0 + a 0, 1*b 1, 0 + a 0, 2*b 2, 0 a 0, 0*b 0, 1 + a 0, 1*b 1, 1 + a 0, 2*b 2, 1 a 0, 2 C 01 C 00 C 02 b 2, 1 a 1, 0*b 0, 0 + a 1, 1*b 1, 0 + a 1, 2*a 2, 0 a 1, 2 b 2, 0 T=5 a 1, 0*b 0, 1 +a 1, 1*b 1, 1 + a 1, 2*b 2, 1 a 1, 1 b 1, 2 a 1, 0*b 0, 2 + a 1, 1*b 1, 2 C 11 C 10 a 2, 2 a 0, 0*b 0, 2 + a 0, 1*b 1, 2 + a 0, 2*b 2, 2 a 2, 0*b 0, 0 + a 2, 1*b 1, 0 + a 2, 2*b 2, 0 a 2, 1 b 1, 1 a 2, 0*b 0, 1 + a 2, 1*b 1, 1 a 2, 0 b 0, 2 a 2, 0*b 0, 2 C 20 Example source: http: //www. cs. hmc. edu/courses/2001/spring/cs 156/ CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Systolic Array Example: 3 x 3 Systolic Array Matrix Multiplication • Processors arranged in a 2 -D grid • Each processor accumulates one element of the product Alignments in time a 0, 0*b 0, 0 + a 0, 1*b 1, 0 + a 0, 2*b 2, 0 C 00 a 0, 0*b 0, 1 + a 0, 1*b 1, 1 + a 0, 2*b 2, 1 a 0, 0*b 0, 2 + a 0, 1*b 1, 2 + a 0, 2*b 2, 2 C 01 C 02 b 2, 2 a 1, 0*b 0, 0 + a 1, 1*b 1, 0 + a 1, 2*a 2, 0 C 10 a 1, 0*b 0, 1 +a 1, 1*b 1, 1 + a 1, 2*b 2, 1 a 1, 2 C 11 C 12 b 2, 1 a 2, 0*b 0, 0 + a 2, 1*b 1, 0 + a 2, 2*b 2, 0 T=6 C 20 Example source: http: //www. cs. hmc. edu/courses/2001/spring/cs 156/ a 2, 2 a 1, 0*b 0, 2 + a 1, 1*b 1, 2 + a 1, 2*b 2, 2 a 2, 0*b 0, 1 + a 2, 1*b 1, 1 + a 2, 2*b 2, 1 a 2, 1 b 1, 2 a 2, 0*b 0, 2 + a 2, 1*b 1, 2 C 21 CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015

Systolic Array Example: 3 x 3 Systolic Array Matrix Multiplication • Processors arranged in a 2 -D grid On one processor = O(n 3) t = 27? Speedup = 27/7 = 3. 85 • Each processor accumulates one element of the product Alignments in time a 0, 0*b 0, 0 + a 0, 1*b 1, 0 + a 0, 2*b 2, 0 C 00 a 0, 0*b 0, 1 + a 0, 1*b 1, 1 + a 0, 2*b 2, 1 C 01 a 1, 0*b 0, 0 + a 1, 1*b 1, 0 + a 1, 2*a 2, 0 Done C 10 a 0, 0*b 0, 2 + a 0, 1*b 1, 2 + a 0, 2*b 2, 2 C 02 a 1, 0*b 0, 2 + a 1, 1*b 1, 2 + a 1, 2*b 2, 2 a 1, 0*b 0, 1 +a 1, 1*b 1, 1 + a 1, 2*b 2, 1 C 12 b 2, 2 a 2, 0*b 0, 1 + a 2, 1*b 1, 1 + a 2, 2*b 2, 1 a 2, 0*b 0, 0 + a 2, 1*b 1, 0 + a 2, 2*b 2, 0 T=7 C 20 Example source: http: //www. cs. hmc. edu/courses/2001/spring/cs 156/ C 21 a 2, 2 a 2, 0*b 0, 2 + a 2, 1*b 1, 2 + a 2, 2*b 2, 2 C 22 CMPE 655 - Shaaban # lec # 1 Fall 2015 8 -25 -2015