CS 61 C Great Ideas in Computer Architecture

Review • To access cache, Memory Address divided into 3 fields: Tag, Index, Block

New-School Machine Structures (It’s a bit more complicated!) Software • Parallel Requests Assigned to

Agenda • • • Flynn Taxonomy Administrivia DLP and SIMD Technology Break Intel SSE

Alternative Kinds of Parallelism: The Programming Viewpoint • Job-level parallelism/process-level parallelism – Running independent

Alternative Kinds of Parallelism: Hardware vs. Software • Concurrent software can also run on

Alternative Kinds of Parallelism: Single Instruction/Single Data Stream • Single Instruction, Single Data stream

Alternative Kinds of Parallelism: Multiple Instruction/Single Data Stream • Multiple Instruction, Single Data streams

Alternative Kinds of Parallelism: Single Instruction/Multiple Data Stream • Single Instruction, Multiple Data streams

Alternative Kinds of Parallelism: Multiple Instruction/Multiple Data Streams • Multiple Instruction, Multiple Data streams

Flynn Taxonomy • In 2011, SIMD and MIMD most common parallel computers • Most

Data-Level Parallelism (DLP) (from 2 nd lecture, January 20) • 2 kinds of DLP

SIMD Architectures • Data parallelism: executing one operation on multiple data streams • Example

“Advanced Digital Media Boost” • To improve performance, Intel’s SIMD instructions – Fetch one

Example: SIMD Array Processing for each f in array f = sqrt(f) for each

Administrivia • Lab #7 posted • Midterm in 1 week: – Exam: Tu, Mar

Scores on Project 2 Part 2 85 Score (Max 85) 75 65 55 45

• Inclusive: all welcome, it works! – 82%: reaffirms CS major, will finish

Organizers Lopez 11/3/2020 Patterson Taylor Estrin Hicks Wladawsky Spring 2011 -- Lecture #11 Tapia

SSE Instruction Categories for Multimedia Support • SSE-2+ supports wider data types to allow

Intel Architecture SSE 2+ 128 -Bit SIMD Data Types 122 121 96 95 80

XMM Registers • Architecture extended with eight 128 -bit data registers: XMM registers –

SSE/SSE 2 Floating Point Instructions xmm: one operand is a 128 -bit SSE 2

Example: Add Two Single Precision FP Vectors Computation to be performed: vec_res. x =

Example: Image Converter • Converts BMP (bitmap) image to a YUV (color space) image

Example: Image Converter • FMADDPS – Multiply and add packed single precision floating point

Example: Image Converter Floating point numbers f(n) and x(n) in src 1 and src

Intel SSE Intrinsics • Intrinsics are C functions and procedures for putting in assembly

Example SSE Intrinsics Instrinsics: Corresponding SSE instructions: • Vector data type: _m 128 d

Example: 2 x 2 Matrix Multiply Definition of Matrix Multiply: 2 Ci, j =

Example: 2 x 2 Matrix Multiply • Using the XMM registers – 64 -bit/double

Example: 2 x 2 Matrix Multiply • Initialization C 1 0 0 C 2

Example: 2 x 2 Matrix Multiply • First iteration intermediate result C 1 0+A

Example: 2 x 2 Matrix Multiply • Second iteration intermediate result C 1, 1

Live Example: 2 x 2 Matrix Multiply Definition of Matrix Multiply: 2 Ci, j

Example: 2 x 2 Matrix Multiply (Part 1 of 2) #include <stdio. h> //

Example: 2 x 2 Matrix Multiply (Part 2 of 2) // used aligned loads

Inner loop from gcc –O -S L 2: movapd movddup mulpd addpd addq cmpq

Performance-Driven ISA Extensions • Subword parallelism, used primarily for multimedia applications – Intel MMX:

Big Idea: Amdahl’s (Heartbreaking) Law • Speedup due to enhancement E is Speedup w/

Big Idea: Amdahl’s Law Speedup = Example: the execution time of half of the

Big Idea: Amdahl’s Law Speedup = Non-speed-up part 1 (1 - F) + F

Big Idea: Amdahl’s Law If the portion of the program that can be parallelized

Example #1: Amdahl’s Law Speedup w/ E = 1 / [ (1 -F) +

Example #2: Amdahl’s Law Speedup w/ E = 1 / [ (1 -F) +

Strong and Weak Scaling • To get good speedup on a multiprocessor while keeping

Review • Flynn Taxonomy of Parallel Architectures – – SIMD: Single Instruction Multiple Data

Slides: 52

Download presentation

CS 61 C: Great Ideas in Computer Architecture (Machine Structures) SIMD I Instructors: Randy H. Katz David A. Patterson http: //inst. eecs. Berkeley. edu/~cs 61 c/sp 11 11/3/2020 Spring 2011 -- Lecture #13 1

11/3/2020 Spring 2011 -- Lecture #13 2

Review • To access cache, Memory Address divided into 3 fields: Tag, Index, Block Offset • Cache size is Data + Management (tags, valid, dirty bits) • Write misses trickier to implement than reads – Write back vs. Write through – Write allocate vs. No write allocate • Cache Performance Equations: – CPU time = IC × CPIstall × CC = IC × (CPIideal + Memory-stall cycles) × CC – AMAT = Time for a hit + Miss rate x Miss penalty • If understand caches, can adapt software to improve cache performance and thus program performance 11/3/2020 Spring 2011 -- Lecture #12 3

New-School Machine Structures (It’s a bit more complicated!) Software • Parallel Requests Assigned to computer e. g. , Search “Katz” • Parallel Threads Assigned to core e. g. , Lookup, Ads Hardware Harness Parallelism & Achieve High Performance Smart Phone Warehouse Scale Computer • Parallel Instructions >1 instruction @ one time e. g. , 5 pipelined instructions Memory • Hardware descriptions All gates @ one time 11/3/2020 Today’s Lecture Core (Cache) Input/Output • Parallel Data >1 data item @ one time e. g. , Add of 4 pairs of words … Core Instruction Unit(s) Core Functional Unit(s) A 0+B 0 A 1+B 1 A 2+B 2 A 3+B 3 Main Memory Logic Gates Spring 2011 -- Lecture #13 4

Agenda • • • Flynn Taxonomy Administrivia DLP and SIMD Technology Break Intel SSE (Amdahl’s Law if time permits) 11/3/2020 Spring 2011 -- Lecture #13 5

Alternative Kinds of Parallelism: The Programming Viewpoint • Job-level parallelism/process-level parallelism – Running independent programs on multiple processors simultaneously – Example? • Parallel processing program – Single program that runs on multiple processors simultaneously – Example? 11/3/2020 Spring 2011 -- Lecture #13 6

Alternative Kinds of Parallelism: Hardware vs. Software • Concurrent software can also run on serial hardware • Sequential software can also run on parallel hardware • Focus is on parallel processing software: sequential or concurrent software running on parallel hardware 11/3/2020 Spring 2011 -- Lecture #13 7

Alternative Kinds of Parallelism: Single Instruction/Single Data Stream • Single Instruction, Single Data stream (SISD) Processing Unit 11/3/2020 – Sequential computer that exploits no parallelism in either the instruction or data streams. Examples of SISD architecture are traditional uniprocessor machines Spring 2011 -- Lecture #13 8

Alternative Kinds of Parallelism: Multiple Instruction/Single Data Stream • Multiple Instruction, Single Data streams (MISD) 11/3/2020 – Computer that exploits multiple instruction streams against a single data stream for data operations that can be naturally parallelized. For example, certain kinds of array processors. – No longer commonly encountered, mainly of historical interest only 9 Spring 2011 -- Lecture #13

Alternative Kinds of Parallelism: Single Instruction/Multiple Data Stream • Single Instruction, Multiple Data streams (SIMD) – Computer that exploits multiple data streams against a single instruction stream to operations that may be naturally parallelized, e. g. , SIMD instruction extensions or Graphics Processing Unit (GPU) 11/3/2020 Spring 2011 -- Lecture #13 10

Alternative Kinds of Parallelism: Multiple Instruction/Multiple Data Streams • Multiple Instruction, Multiple Data streams (MIMD) – Multiple autonomous processors simultaneously executing different instructions on different data. – MIMD architectures include multicore and Warehouse Scale Computers – (Discuss after midterm) 11/3/2020 Spring 2011 -- Lecture #13 11

Flynn Taxonomy • In 2011, SIMD and MIMD most common parallel computers • Most common parallel processing programming style: Single Program Multiple Data (“SPMD”) – Single program that runs on all processors of an MIMD – Cross-processor execution coordination through conditional expressions (thread parallelism after midterm ) • SIMD (aka hw-level data parallelism): specialized function units, for handling lock-step calculations involving arrays – Scientific computing, signal processing, multimedia (audio/video processing) 11/3/2020 Spring 2011 -- Lecture #13 12

Data-Level Parallelism (DLP) (from 2 nd lecture, January 20) • 2 kinds of DLP – Lots of data in memory that can be operated on in parallel (e. g. , adding together 2 arrays) – Lots of data on many disks that can be operated on in parallel (e. g. , searching for documents) • 2 nd lecture (and 1 st project) did DLP across 10 s of servers and disks using Map. Reduce • Today’s lecture (and 3 rd project) does Data Level Parallelism (DLP) in memory 11/3/2020 Spring 2011 -- Lecture #13 13

SIMD Architectures • Data parallelism: executing one operation on multiple data streams • Example to provide context: – Multiplying a coefficient vector by a data vector (e. g. , in filtering) y[i] : = c[i] x[i], 0 i < n • Sources of performance improvement: – One instruction is fetched & decoded for entire operation – Multiplications are known to be independent – Pipelining/concurrency in memory access as well 11/3/2020 Spring 2011 -- Lecture #13 Slide 14

“Advanced Digital Media Boost” • To improve performance, Intel’s SIMD instructions – Fetch one instruction, do the work of multiple instructions – MMX (Multi. Media e. Xtension, Pentium II processor family) – SSE (Streaming SIMD Extension, Pentium III and beyond) 11/3/2020 Spring 2011 -- Lecture #13 15

Example: SIMD Array Processing for each f in array f = sqrt(f) for each f in array { load f to the floating-point register calculate the square root write the result from the register to memory } for each 4 members in array { load 4 members to the SSE register calculate 4 square roots in one operation write the result from the register to memory } 11/3/2020 Spring 2011 -- Lecture #13 16

Administrivia • Lab #7 posted • Midterm in 1 week: – Exam: Tu, Mar 8, 6 -9 PM, 145/155 Dwinelle • Split: A-Lew in 145, Li-Z in 155 – Covers everything through lecture March 3 – Closed book, can bring one sheet notes, both sides – Copy of Green card will be supplied – No phones, calculators, …; just bring pencils & eraser – TA Review: Su, Mar 6, 2 -5 PM, 2050 VLSB • Sent (anonymous) 61 C midway survey before Midterm 11/3/2020 Spring 2011 -- Lecture #12 17

Scores on Project 2 Part 2 85 Score (Max 85) 75 65 55 45 • Top 25%: ≥ 79 / 85 35 • Next 50%: ≥ 60, <79 / 85 25 15 5 -5 0 11/3/2020 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 Fraction of Students Spring 2011 -- Lecture #13 0. 8 0. 9 1 18

• Inclusive: all welcome, it works! – 82%: reaffirms CS major, will finish degree • – 30% ugrads, 40% grads, 30% • in beautiful San Francisco tapiaconference. org/2011 CDC is a joint org. of ACM, IEEE/CS, CRA If 8 great you care, come! speakers on Grad School Success, • Workshops Volunteer poster for Career Success, Resume • Luminiaries: Deborah Estrin UCLA, Early student+opportunities Preparation BOFs • Banquet and Dance or import Blaise Aguera y Arcas Microsoft, (work remote Francisco Activity: Alcatraz Alan Eustace Google, Bill Wulf UVA, • Sanstudent) Tour, Chinatown, Bike over Golden Irving Wladawsky-Berger IBM, Bridge, … grad students to • Gate Encourage • If interested in diversity, by today John Kubiatowicz UC Berkeley apply doctoral consortium (3/1) email Sheila Humphrys with name, year, topic interest + 2 to 3 • Rising Stars: Hicks Rice, Howard sentences why want to go to Tapia Georgia Tech, Lopez Intel humphrys@EECS. Berkeley. EDU • General Chair: Dave Patterson 11/3/2020 Spring 2011 -- Lecture #11 http: //tapiaconference. org/2011/participate. html 19

Organizers Lopez 11/3/2020 Patterson Taylor Estrin Hicks Wladawsky Spring 2011 -- Lecture #11 Tapia Kubiatowicz Eustace Lanius Howard Vargas Tapia Awardee Speakers Aguera y Arcas Wulf 20 Perez-Quinones

Agenda • • • Flynn Taxonomy Administrivia DLP and SIMD Technology Break Intel SSE (Amdahl’s Law if time permits) 11/3/2020 Spring 2011 -- Lecture #13 21

SSE Instruction Categories for Multimedia Support • SSE-2+ supports wider data types to allow 16 x 8 -bit and 8 x 16 -bit operands 11/3/2020 Spring 2011 -- Lecture #13 22

Intel Architecture SSE 2+ 128 -Bit SIMD Data Types 122 121 96 95 80 79 64 63 48 47 32 31 16 15 16 / 128 bits 122 121 96 95 80 79 64 63 48 47 32 31 16 15 8 / 128 bits 96 95 64 63 32 31 64 63 4 / 128 bits 2 / 128 bits • Note: in Intel Architecture (unlike MIPS) a word is 16 bits – Single precision FP: Double word (32 bits) – Double precision FP: Quad word (64 bits) 11/3/2020 Spring 2011 -- Lecture #13 23

XMM Registers • Architecture extended with eight 128 -bit data registers: XMM registers – IA 64 -bit address architecture: available as 16 64 -bit registers (XMM 8 – XMM 15) – E. g. , 128 -bit packed single-precision floating-point data type (doublewords), allows four single-precision operations to be performed simultaneously 11/3/2020 Spring 2011 -- Lecture #13 24

SSE/SSE 2 Floating Point Instructions xmm: one operand is a 128 -bit SSE 2 register mem/xmm: other operand is in memory or an SSE 2 register {SS} Scalar Single precision FP: one 32 -bit operand in a 128 -bit register {PS} Packed Single precision FP: four 32 -bit operands in a 128 -bit register {SD} Scalar Double precision FP: one 64 -bit operand in a 128 -bit register {PD} Packed Double precision FP, or two 64 -bit operands in a 128 -bit register {A} 128 -bit operand is aligned in memory {U} means the 128 -bit operand is unaligned in memory {H} means move the high half of the 128 -bit operand {L} means move the low half of the 128 -bit operand 11/3/2020 Spring 2011 -- Lecture #13 25

Example: Add Two Single Precision FP Vectors Computation to be performed: vec_res. x = v 1. x + v 2. x; vec_res. y = v 1. y + v 2. y; vec_res. z = v 1. z + v 2. z; vec_res. w = v 1. w + v 2. w; SSE Instruction Sequence: mov a ps : move from mem to XMM register, memory aligned, packed single precision add ps : add from mem to XMM register, packed single precision mov a ps : move from XMM register to mem, memory aligned, packed single precision movaps address-of-v 1, %xmm 0 // v 1. w | v 1. z | v 1. y | v 1. x -> xmm 0 addps address-of-v 2, %xmm 0 // v 1. w+v 2. w | v 1. z+v 2. z | v 1. y+v 2. y | v 1. x+v 2. x -> xmm 0 movaps %xmm 0, address-of-vec_res 11/3/2020 Spring 2011 -- Lecture #13 26

Example: Image Converter • Converts BMP (bitmap) image to a YUV (color space) image format: – Read individual pixels from the BMP image, convert pixels into YUV format – Can pack the pixels and operate on a set of pixels with a single instruction • E. g. , bitmap image consists of 8 bit monochrome pixels – Pack these pixel values in a 128 bit register (8 bit * 16 pixels), can operate on 16 values at a time – Significant performance boost 11/3/2020 Fall 2010 -- Lecture #18 28

Example: Image Converter • FMADDPS – Multiply and add packed single precision floating point instruction • One of the typical operations computed in transformations (e. g. , DFT of FFT) N P = ∑ f(n) × x(n) n=1 11/3/2020 Spring 2011 -- Lecture #13 29

Example: Image Converter Floating point numbers f(n) and x(n) in src 1 and src 2; p in dest; C implementation for N = 4 (128 bits): for (int i =0; i< 4; i++) p = p + src 1[i] * src 2[i]; Regular x 86 instructions for the inner loop: //src 1 is on the top of the stack; src 1 * src 2 -> src 1 fmul DWORD PTR _src 2$[%esp+148] //p = ST(1), src 1 = ST(0); ST(0)+ST(1) -> ST(1); ST-Stack Top faddp %ST(0), %ST(1) (Note: Destination on the right in x 86 assembly) Number regular x 86 Fl. Pt. instructions executed: 4 * 2 = 8 11/3/2020 Spring 2011 -- Lecture #13 30

Example: Image Converter Floating point numbers f(n) and x(n) in src 1 and src 2; p in dest; C implementation for N = 4 (128 bits): for (int i =0; i< 4; i++) p = p + src 1[i] * src 2[i]; • SSE 2 instructions for the inner loop: //xmm 0 = p, xmm 1 = src 1[i], xmm 2 = src 2[i] mulps %xmm 1, %xmm 2 // xmm 2 * xmm 1 -> xmm 2 addps %xmm 2, %xmm 0 // xmm 0 + xmm 2 -> xmm 0 • Number regular instructions executed: 2 SSE 2 instructions vs. 8 x 86 • SSE 5 instruction accomplishes same in one instruction: fmaddps %xmm 0, %xmm 1, %xmm 2, %xmm 0 // xmm 2 * xmm 1 + xmm 0 -> xmm 0 // multiply xmm 1 x xmm 2 paired single, // then add product paired single to sum in xmm 0 • Number regular instructions executed: 1 SSE 5 instruction vs. 8 x 86 11/3/2020 Spring 2011 -- Lecture #13 31

Intel SSE Intrinsics • Intrinsics are C functions and procedures for putting in assembly language, including SSE instructions – With intrinsics, can program using these instructions indirectly – One-to-one correspondence between SSE instructions and intrinsics 11/3/2020 Spring 2011 -- Lecture #13 32

Example SSE Intrinsics Instrinsics: Corresponding SSE instructions: • Vector data type: _m 128 d • Load and store operations: _mm_load_pd MOVAPD/aligned, packed double _mm_store_pd MOVAPD/aligned, packed double _mm_loadu_pd MOVUPD/unaligned, packed double _mm_storeu_pd MOVUPD/unaligned, packed double • Load and broadcast across vector _mm_load 1_pd MOVSD + shuffling/duplicating • Arithmetic: _mm_add_pd ADDPD/add, packed double _mm_mul_pd MULPD/multiple, packed double 02/09/2010 11/3/2020 CS 267 Spring 2011 Lecture -- Lecture 7#13 33 33

Example: 2 x 2 Matrix Multiply Definition of Matrix Multiply: 2 Ci, j = (A×B)i, j = ∑ Ai, k× Bk, j k=1 A 1, 2 B 1, 1 B 1, 2 x A 2, 1 11/3/2020 A 2, 2 C 1, 1=A 1, 1 B 1, 1 + A 1, 2 B 2, 1 C 1, 2=A 1, 1 B 1, 2+A 1, 2 B 2, 2 C 2, 1=A 2, 1 B 1, 1 + A 2, 2 B 2, 1 C 2, 2=A 2, 1 B 1, 2+A 2, 2 B 2, 2 = B 2, 1 B 2, 2 Spring 2011 -- Lecture #13 34

Example: 2 x 2 Matrix Multiply • Using the XMM registers – 64 -bit/double precision/two doubles per XMM reg C 1, 1 C 2 C 1, 2 C 2, 2 A A 1, i A 2, i B 1 Bi, 1 B 2 Bi, 2 11/3/2020 Stored in memory in Column order Spring 2011 -- Lecture #13 35

Example: 2 x 2 Matrix Multiply • Initialization C 1 0 0 C 2 0 0 A A 1, 1 A 2, 1 _mm_load_pd: Stored in memory in Column order B 1, 1 B 2 B 1, 2 _mm_load 1_pd: SSE instruction that loads a double word and stores it in the high and low double words of the XMM register • I=1 11/3/2020 Spring 2011 -- Lecture #13 36

Example: 2 x 2 Matrix Multiply • Initialization C 1 0 0 C 2 0 0 A A 1, 1 A 2, 1 _mm_load_pd: Load 2 doubles into XMM reg, Stored in memory in Column order B 1, 1 B 2 B 1, 2 _mm_load 1_pd: SSE instruction that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM) • I=1 11/3/2020 Spring 2011 -- Lecture #13 37

Example: 2 x 2 Matrix Multiply • First iteration intermediate result C 1 0+A 1, 1 B 1, 1 0+A 2, 1 B 1, 1 C 2 0+A 1, 1 B 1, 2 0+A 2, 1 B 1, 2 • I=1 c 1 = _mm_add_pd(c 1, _mm_mul_pd(a, b 1)); c 2 = _mm_add_pd(c 2, _mm_mul_pd(a, b 2)); SSE instructions first do parallel multiplies and then parallel adds in XMM registers A A 1, 1 A 2, 1 _mm_load_pd: Stored in memory in Column order B 1, 1 B 2 B 1, 2 _mm_load 1_pd: SSE instruction that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM) 11/3/2020 Spring 2011 -- Lecture #13 38

Example: 2 x 2 Matrix Multiply • First iteration intermediate result C 1 0+A 1, 1 B 1, 1 0+A 2, 1 B 1, 1 C 2 0+A 1, 1 B 1, 2 0+A 2, 1 B 1, 2 • I=2 c 1 = _mm_add_pd(c 1, _mm_mul_pd(a, b 1)); c 2 = _mm_add_pd(c 2, _mm_mul_pd(a, b 2)); SSE instructions first do parallel multiplies and then parallel adds in XMM registers A A 1, 2 A 2, 2 _mm_load_pd: Stored in memory in Column order B 1 B 2, 2 B 2, 2 _mm_load 1_pd: SSE instruction that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM) 11/3/2020 Spring 2011 -- Lecture #13 39

Example: 2 x 2 Matrix Multiply • Second iteration intermediate result C 1, 1 A 1, 1 B 1, 1+A 1, 2 B 2, 1 C 2, 1 A 2, 1 B 1, 1+A 2, 2 B 2, 1 A 1, 1 B 1, 2+A 1, 2 B 2, 2 C 1, 2 A 2, 1 B 1, 2+A 2, 2 B 2, 2 C 2, 2 A A 1, 2 A 2, 2 _mm_load_pd: Stored in memory in Column order B 1 B 2, 2 B 2, 2 _mm_load 1_pd: SSE instruction that loads a double word and stores it in the high and low double words of the XMM register (duplicates value in both halves of XMM) C 1 C 2 • I=2 11/3/2020 c 1 = _mm_add_pd(c 1, _mm_mul_pd(a, b 1)); c 2 = _mm_add_pd(c 2, _mm_mul_pd(a, b 2)); SSE instructions first do parallel multiplies and then parallel adds in XMM registers Spring 2011 -- Lecture #13 40

Live Example: 2 x 2 Matrix Multiply Definition of Matrix Multiply: 2 Ci, j = (A×B)i, j = ∑ Ai, k× Bk, j k=1 A 1, 2 B 1, 1 B 1, 2 x C 1, 1=A 1, 1 B 1, 1 + A 1, 2 B 2, 1 C 1, 2=A 1, 1 B 1, 2+A 1, 2 B 2, 2 = A 2, 1 A 2, 2 B 2, 1 B 2, 2 C 2, 1=A 2, 1 B 1, 1 + A 2, 2 B 2, 1 C 2, 2=A 2, 1 B 1, 2+A 2, 2 B 2, 2 1 0 1 3 C 1, 1= 1*1 + 0*2 = 1 C 1, 2= 1*3 + 0*4 = 3 C 2, 1= 0*1 + 1*2 = 2 C 2, 2= 0*3 + 1*4 = 4 x 0 11/3/2020 1 = 2 4 Spring 2011 -- Lecture #13 41

Example: 2 x 2 Matrix Multiply (Part 1 of 2) #include <stdio. h> // header file for SSE compiler intrinsics #include <emmintrin. h> // NOTE: vector registers will be represented in comments as v 1 = [ a | b] // where v 1 is a variable of type __m 128 d and a, b are doubles int main(void) { // allocate A, B, C aligned on 16 -byte boundaries double A[4] __attribute__ ((aligned (16))); double B[4] __attribute__ ((aligned (16))); double C[4] __attribute__ ((aligned (16))); int lda = 2; int i = 0; // declare several 128 -bit vector variables __m 128 d c 1, c 2, a, b 1, b 2; 11/3/2020 // Initialize A, B, C for example /* A = (note column order!) 10 01 */ A[0] = 1. 0; A[1] = 0. 0; A[2] = 0. 0; A[3] = 1. 0; /* B = (note column order!) 13 24 */ B[0] = 1. 0; B[1] = 2. 0; B[2] = 3. 0; B[3] = 4. 0; /* C = (note column order!) 00 00 */ C[0] = 0. 0; C[1] = 0. 0; C[2] = 0. 0; C[3] = 0. 0; Spring 2011 -- Lecture #13 42

Example: 2 x 2 Matrix Multiply (Part 2 of 2) // used aligned loads to set // c 1 = [c_11 | c_21] c 1 = _mm_load_pd(C+0*lda); // c 2 = [c_12 | c_22] c 2 = _mm_load_pd(C+1*lda); for (i = 0; i < 2; i++) { /* a = i = 0: [a_11 | a_21] i = 1: [a_12 | a_22] */ a = _mm_load_pd(A+i*lda); /* b 1 = i = 0: [b_11 | b_11] i = 1: [b_21 | b_21] */ b 1 = _mm_load 1_pd(B+i+0*lda); /* b 2 = i = 0: [b_12 | b_12] i = 1: [b_22 | b_22] */ b 2 = _mm_load 1_pd(B+i+1*lda); 11/3/2020 /* c 1 = i = 0: [c_11 + a_11*b_11 | c_21 + a_21*b_11] i = 1: [c_11 + a_21*b_21 | c_21 + a_22*b_21] */ c 1 = _mm_add_pd(c 1, _mm_mul_pd(a, b 1)); /* c 2 = i = 0: [c_12 + a_11*b_12 | c_22 + a_21*b_12] i = 1: [c_12 + a_21*b_22 | c_22 + a_22*b_22] */ c 2 = _mm_add_pd(c 2, _mm_mul_pd(a, b 2)); } // store c 1, c 2 back into C for completion _mm_store_pd(C+0*lda, c 1); _mm_store_pd(C+1*lda, c 2); // print C printf("%g, %gn", C[0], C[2], C[1], C[3]); return 0; } Spring 2011 -- Lecture #13 43

Inner loop from gcc –O -S L 2: movapd movddup mulpd addpd addq cmpq jne movapd 11/3/2020 (%rax, %rsi), %xmm 1 (%rdx), %xmm 0 %xmm 1, %xmm 0, %xmm 3 16(%rdx), %xmm 0, %xmm 1, %xmm 2 $16, %rax $8, %rdx $32, %rax L 2 %xmm 3, (%rcx) %xmm 2, (%rdi) //Load aligned A[i, i+1]->m 1 //Load B[j], duplicate->m 0 //Multiply m 0*m 1 ->m 0 //Add m 0+m 3 ->m 3 //Load B[j+1], duplicate->m 0 //Multiply m 0*m 1 ->m 1 //Add m 1+m 2 ->m 2 // rax+16 -> rax (i+=2) // rdx+8 -> rdx (j+=1) // rax == 32? // jump to L 2 if not equal //store aligned m 3 into C[k, k+1] //store aligned m 2 into C[l, l+1] Spring 2011 -- Lecture #13 44

Performance-Driven ISA Extensions • Subword parallelism, used primarily for multimedia applications – Intel MMX: multimedia extension • 64 -bit registers can hold multiple integer operands – Intel SSE: Streaming SIMD extension • 128 -bit registers can hold several floating-point operands • Adding instructions that do more work per cycle – – 11/3/2020 Shift-add: replace two instructions with one (e. g. , multiply by 5) Multiply-add: replace two instructions with one (x : = c + a �b) Multiply-accumulate: reduce round-off error (s : = s + a �b) Conditional copy: to avoid some branches (e. g. , in if-then-else) Spring 2011 -- Lecture #13 Slide 45

Big Idea: Amdahl’s (Heartbreaking) Law • Speedup due to enhancement E is Speedup w/ E = Exec time w/o E -----------Exec time w/ E • Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected Execution Time w/ E = Execution Time w/o E [ (1 -F) + F/S] Speedup w/ E = 1 / [ (1 -F) + F/S ] 11/3/2020 Fall 2010 -- Lecture #17 46

Big Idea: Amdahl’s Law Speedup = Example: the execution time of half of the program can be accelerated by a factor of 2. What is the program speed-up overall? 11/3/2020 Fall 2010 -- Lecture #17 47

Big Idea: Amdahl’s Law Speedup = Non-speed-up part 1 (1 - F) + F S Speed-up part Example: the execution time of half of the program can be accelerated by a factor of 2. What is the program speed-up overall? 1 0. 5 + 0. 5 2 11/3/2020 = 1 = 0. 5 + 0. 25 Fall 2010 -- Lecture #17 1. 33 48

Big Idea: Amdahl’s Law If the portion of the program that can be parallelized is small, then the speedup is limited The non-parallel portion limits the performance 11/3/2020 Fall 2010 -- Lecture #17 49

Example #1: Amdahl’s Law Speedup w/ E = 1 / [ (1 -F) + F/S ] • Consider an enhancement which runs 20 times faster but which is only usable 25% of the time Speedup w/ E = 1/(. 75 +. 25/20) = 1. 31 • What if its usable only 15% of the time? Speedup w/ E = 1/(. 85 +. 15/20) = 1. 17 • Amdahl’s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar! • To get a speedup of 90 from 100 processors, the percentage of the original program that could be scalar would have to be 0. 1% or less Speedup w/ E = 1/(. 001 +. 999/100) = 90. 99 11/3/2020 Fall 2010 -- Lecture #17 51

Example #2: Amdahl’s Law Speedup w/ E = 1 / [ (1 -F) + F/S ] • Consider summing 10 scalar variables and two 10 by 10 matrices (matrix sum) on 10 processors Speedup w/ E = 1/(. 091 +. 909/10) = 1/0. 1819 = 5. 5 • What if there are 100 processors ? Speedup w/ E = 1/(. 091 +. 909/100) = 1/0. 10009 = 10. 0 • What if the matrices are 100 by 100 (or 10, 010 adds in total) on 10 processors? Speedup w/ E = 1/(. 001 +. 999/10) = 1/0. 1009 = 9. 9 • What if there are 100 processors ? 11/3/2020 Speedup w/ E = 1/(. 001 +. 999/100) = 1/0. 01099 = 91 Fall 2010 -- Lecture #17 54

Strong and Weak Scaling • To get good speedup on a multiprocessor while keeping the problem size fixed is harder than getting good speedup by increasing the size of the problem. – Strong scaling: when speedup can be achieved on a parallel processor without increasing the size of the problem – Weak scaling: when speedup is achieved on a parallel processor by increasing the size of the problem proportionally to the increase in the number of processors • Load balancing is another important factor: every processor doing same amount of work – Just 1 unit with twice the load of others cuts speedup almost in half 11/3/2020 Fall 2010 -- Lecture #17 55

Review • Flynn Taxonomy of Parallel Architectures – – SIMD: Single Instruction Multiple Data MIMD: Multiple Instruction Multiple Data SISD: Single Instruction Single Data (unused) MISD: Multiple Instruction Single Data • Intel SSE SIMD Instructions – One instruction fetch that operates on multiple operands simultaneously – 128/64 bit XMM registers • SSE Instructions in C – Embed the SSE machine instructions directly into C programs through use of intrinsics – Achieve efficiency beyond that of optimizing compiler 11/3/2020 Spring 2011 -- Lecture #13 56