Computer Architecture SIMD and GPUs Part III and
- Slides: 62
Computer Architecture: SIMD and GPUs (Part III) (and briefly VLIW, DAE, Systolic Arrays) Prof. Onur Mutlu Carnegie Mellon University
A Note on This Lecture n n These slides are partly from 18 -447 Spring 2013, Computer Architecture, Lecture 20: GPUs, VLIW, DAE, Systolic Arrays Video of the part related to only SIMD and GPUs: q http: //www. youtube. com/watch? v=vr 5 hb. Skb 1 Eg&list=PL 5 PHm 2 j kk. Xmid. JOd 59 REog 9 j. Dn. PDTG 6 IJ&index=20 2
Last Lecture n n SIMD Processing GPU Fundamentals 3
Today n Wrap up GPUs VLIW n If time permits n q q q Decoupled Access Execute Systolic Arrays Static Scheduling 4
Approaches to (Instruction-Level) Concurrency n Pipelined execution n n n Out-of-order execution Dataflow (at the ISA level) SIMD Processing VLIW Systolic Arrays Decoupled Access Execute 5
Graphics Processing Units SIMD not Exposed to Programmer (SIMT)
Review: High-Level View of a GPU 7
Review: Concept of “Thread Warps” and SIMT n Warp: A set of threads that execute the same instruction n (on different data elements) SIMT (Nvidia-speak) All threads run the same kernel n Warp: The threads that run lengthwise in a woven fabric … Thread Warp Common PC Scalar Thread W X Y Scalar Thread Z Thread Warp 3 Thread Warp 8 Thread Warp 7 SIMD Pipeline 8
Review: Loop Iterations as Threads for (i=0; i < N; i++) C[i] = A[i] + B[i]; Vectorized Code Scalar Sequential Code load Time Iter. 1 add store load Iter. 2 load Iter. 1 load add store Iter. 2 Vector Instruction add store Slide credit: Krste Asanovic 9
Review: SIMT Memory Access n Same instruction in different threads uses thread id to index and access different data elements Let’s assume N=16, block. Dim=4 4 blocks + 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 + Slide credit: Hyesoon Kim + + +
Review: Sample GPU SIMT Code (Simplified) CPU code for (ii = 0; ii < 100; ++ii) { C[ii] = A[ii] + B[ii]; } CUDA code // there are 100 threads __global__ void Kernel. Function(…) { int tid = block. Dim. x * block. Idx. x + thread. Idx. x; int var. A = aa[tid]; int var. B = bb[tid]; C[tid] = var. A + var. B; } Slide credit: Hyesoon Kim
Review: Sample GPU Program (Less Simplified) Slide credit: Hyesoon Kim 12
Review: Latency Hiding with “Thread Warps” n Warp: A set of threads that execute the same instruction (on different data elements) n Fine-grained multithreading Thread Warp 7 RF ALU q ALU n SIMD Pipeline Decode RF n Warps available for scheduling I-Fetch q RF One instruction per thread in pipeline at a time (No branch prediction) q Interleave warp execution to hide latencies Register values of all threads stay in register file No OS context switching Memory latency hiding Thread Warp 3 Thread Warp 8 D-Cache All Hit? Data Writeback Warps accessing memory hierarchy Miss? Thread Warp 1 Thread Warp 2 Thread Warp 6 Graphics has millions of pixels Slide credit: Tor Aamodt 13
Review: Warp-based SIMD vs. Traditional SIMD n Traditional SIMD contains a single thread q q q n Lock step Programming model is SIMD (no threads) SW needs to know vector length ISA contains vector/SIMD instructions Warp-based SIMD consists of multiple scalar threads executing in a SIMD manner (i. e. , same instruction executed by all threads) q q Does not have to be lock step Each thread can be treated individually (i. e. , placed in a different warp) programming model not SIMD n SW does not need to know vector length n Enables memory and branch latency tolerance ISA is scalar vector instructions formed dynamically Essentially, it is SPMD programming model implemented on SIMD hardware 14
Review: SPMD n Single procedure/program, multiple data q n Each processing element executes the same procedure, except on different data elements q n This is a programming model rather than computer organization Procedures can synchronize at certain points in program, e. g. barriers Essentially, multiple instruction streams execute the same program q q q Each program/procedure can 1) execute a different control-flow path, 2) work on different data, at run-time Many scientific applications programmed this way and run on MIMD computers (multiprocessors) Modern GPUs programmed in a similar way on a SIMD computer 15
Branch Divergence Problem in Warp-based SIMD n SPMD Execution on SIMD Hardware q NVIDIA calls this “Single Instruction, Multiple Thread” (“SIMT”) execution A Thread Warp B C D F Common PC Thread 1 2 3 4 E G Slide credit: Tor Aamodt 16
Control Flow Problem in GPUs/SIMD n GPU uses SIMD pipeline to save area on control logic. q n Group scalar threads into warps Branch divergence occurs when threads inside warps branch to different execution paths. Slide credit: Tor Aamodt Branch Path A Path B 17
Branch Divergence Handling (I) Stack A/1111 A Reconv. PC B/1111 B C/1001 C E E TOS TOS D/0110 D F Active Mask 1111 0110 1001 Common PC Thread 1 2 3 4 G/1111 G A A B G E D C E Thread Warp E/1111 E Next PC B C D E G A Time Slide credit: Tor Aamodt 18
Branch Divergence Handling (II) A; if (some condition) { B; } else { C; } D; A One per warp TOS Control Flow Stack Next PC Recv PC Amask D A -1111 B D 1110 C D D 0001 Execution Sequence B C D Slide credit: Tor Aamodt A 1 1 C 0 0 0 1 B 1 1 1 0 D 1 1 Time 19
Dynamic Warp Formation n n Idea: Dynamically merge threads executing the same instruction (after branch divergence) Form new warp at divergence q Enough threads branching to each path to create full new warps 20
Dynamic Warp Formation/Merging n Idea: Dynamically merge threads executing the same instruction (after branch divergence) Branch Path A Path B n Fung et al. , “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow, ” MICRO 2007. 21
Dynamic Warp Formation Example A x/1111 y/1111 A x/1110 y/0011 B x/1000 Execution of Warp x at Basic Block A x/0110 C y/0010 D y/0001 F E Legend A x/0001 y/1100 Execution of Warp y at Basic Block A D A new warp created from scalar threads of both Warp x and y executing at Basic Block D x/1110 y/0011 x/1111 G y/1111 A A B B C C D D E E F F G G A A Baseline Time Dynamic Warp Formation A A B B C D E E F G G A A Time Slide credit: Tor Aamodt 22
What About Memory Divergence? n n Modern GPUs have caches Ideally: Want all threads in the warp to hit (without conflicting with each other) Problem: One thread in a warp can stall the entire warp if it misses in the cache. Need techniques to q q Tolerate memory divergence Integrate solutions to branch and memory divergence 23
NVIDIA Ge. Force GTX 285 n n NVIDIA-speak: q 240 stream processors q “SIMT execution” Generic speak: q 30 cores q 8 SIMD functional units per core Slide credit: Kayvon Fatahalian 24
NVIDIA Ge. Force GTX 285 “core” 64 KB of storage for fragment contexts (registers) … = SIMD functional unit, control shared across 8 units = multiply-add = multiply Slide credit: Kayvon Fatahalian = instruction stream decode = execution context storage 25
NVIDIA Ge. Force GTX 285 “core” 64 KB of storage for thread contexts (registers) … n n n Groups of 32 threads share instruction stream (each group is a Warp) Up to 32 warps are simultaneously interleaved Up to 1024 thread contexts can be stored Slide credit: Kayvon Fatahalian 26
NVIDIA Ge. Force GTX 285 Tex … … … … … … … Tex … … Tex … 30 cores on the GTX 285: 30, 720 threads Slide credit: Kayvon Fatahalian 27
VLIW and DAE
Remember: SIMD/MIMD Classification of Computers n n n Mike Flynn, “Very High Speed Computing Systems, ” Proc. of the IEEE, 1966 SISD: Single instruction operates on single data element SIMD: Single instruction operates on multiple data elements q q n MISD? Multiple instructions operate on single data element q n Array processor Vector processor Closest form: systolic array processor? MIMD: Multiple instructions operate on multiple data elements (multiple instruction streams) q q Multiprocessor Multithreaded processor 29
SISD Parallelism Extraction Techniques n We have already seen q q n Superscalar execution Out-of-order execution Are there simpler ways of extracting SISD parallelism? q q VLIW (Very Long Instruction Word) Decoupled Access/Execute 30
VLIW
VLIW (Very Long Instruction Word) n A very long instruction word consists of multiple independent instructions packed together by the compiler q n n Packed instructions can be logically unrelated (contrast with SIMD) Idea: Compiler finds independent instructions and statically schedules (i. e. packs/bundles) them into a single VLIW instruction Traditional Characteristics q q q Multiple functional units Each instruction in a bundle executed in lock step Instructions in a bundle statically aligned to be directly fed into the functional units 32
VLIW Concept n Fisher, “Very Long Instruction Word architectures and the ELI-512, ” ISCA 1983. q ELI: Enormously longword instructions (512 bits) 33
SIMD Array Processing vs. VLIW n Array processor 34
VLIW Philosophy n Philosophy similar to RISC (simple instructions and hardware) q n Except multiple instructions in parallel RISC (John Cocke, 1970 s, IBM 801 minicomputer) q Compiler does the hard work to translate high-level language code to simple instructions (John Cocke: control signals) n q n And, to reorder simple instructions for high performance Hardware does little translation/decoding very simple VLIW (Fisher, ISCA 1983) q q Compiler does the hard work to find instruction level parallelism Hardware stays as simple and streamlined as possible n n Executes each instruction in a bundle in lock step Simple higher frequency, easier to design 35
VLIW Philosophy (II) Fisher, “Very Long Instruction Word architectures and the ELI-512, ” ISCA 1983. 36
Commercial VLIW Machines n n Multiflow TRACE, Josh Fisher (7 -wide, 28 -wide) Cydrome Cydra 5, Bob Rau Transmeta Crusoe: x 86 binary-translated into internal VLIW TI C 6000, Trimedia, STMicro (DSP & embedded processors) q n Most successful commercially Intel IA-64 q q Not fully VLIW, but based on VLIW principles EPIC (Explicitly Parallel Instruction Computing) Instruction bundles can have dependent instructions A few bits in the instruction format specify explicitly which instructions in the bundle are dependent on which other ones 37
VLIW Tradeoffs n Advantages + No need for dynamic scheduling hardware simple hardware + No need for dependency checking within a VLIW instruction simple hardware for multiple instruction issue + no renaming + No need for instruction alignment/distribution after fetch to different functional units simple hardware n Disadvantages -- Compiler needs to find N independent operations -- If it cannot, inserts NOPs in a VLIW instruction -- Parallelism loss AND code size increase -- Recompilation required when execution width (N), instruction latencies, functional units change (Unlike superscalar processing) -- Lockstep execution causes independent operations to stall -- No instruction can progress until the longest-latency instruction completes 38
VLIW Summary n n VLIW simplifies hardware, but requires complex compiler techniques Solely-compiler approach of VLIW has several downsides that reduce performance -- Too many NOPs (not enough parallelism discovered) -- Static schedule intimately tied to microarchitecture -- Code optimized for one generation performs poorly for next -- No tolerance for variable or long-latency operations (lock step) ++ Most compiler optimizations developed for VLIW employed in optimizing compilers (for superscalar compilation) q Enable code optimizations ++ VLIW successful in embedded markets, e. g. DSP 39
DAE
Decoupled Access/Execute n Motivation: Tomasulo’s algorithm too complex to implement q 1980 s before HPS, Pentium Pro Idea: Decouple operand access and execution via two separate instruction streams that communicate via ISA-visible queues. n Smith, “Decoupled Access/Execute Computer Architectures, ” ISCA 1982, ACM TOCS 1984. n 41
Decoupled Access/Execute (II) n Compiler generates two instruction streams (A and E) q Synchronizes the two upon control flow instructions (using branch queues) 42
Decoupled Access/Execute (III) n Advantages: + Execute stream can run ahead of the access stream and vice versa + If A takes a cache miss, E can perform useful work + If A hits in cache, it supplies data to lagging E + Queues reduce the number of required registers + Limited out-of-order execution without wakeup/select complexity n Disadvantages: -- Compiler support to partition the program and manage queues -- Determines the amount of decoupling -- Branch instructions require synchronization between A and E -- Multiple instruction streams (can be done with a single one, though) 43
Astronautics ZS-1 n n Single stream steered into A and X pipelines Each pipeline inorder Smith et al. , “The ZS-1 central processor, ” ASPLOS 1987. Smith, “Dynamic Instruction Scheduling and the Astronautics ZS-1, ” IEEE Computer 1989. 44
Astronautics ZS-1 Instruction Scheduling n Dynamic scheduling q q q A and X streams are issued/executed independently Loads can bypass stores in the memory unit (if no conflict) Branches executed early in the pipeline n n n To reduce synchronization penalty of A/X streams Works only if the register a branch sources is available Static scheduling q Move compare instructions as early as possible before a branch n q q So that branch source register is available when branch is decoded Reorder code to expose parallelism in each stream Loop unrolling: n Reduces branch count + exposes code reordering opportunities 45
Loop Unrolling n Idea: Replicate loop body multiple times within an iteration + Reduces loop maintenance overhead q Induction variable increment or loop condition test + Enlarges basic block (and analysis scope) q Enables code optimization and scheduling opportunities -- What if iteration count not a multiple of unroll factor? (need extra code to detect this) -- Increases code size 46
Systolic Arrays 47
Why Systolic Architectures? n n Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory Similar to an assembly line q q q n Different people work on the same car Many cars are assembled simultaneously Can be two-dimensional Why? Special purpose accelerators/architectures need q q q Simple, regular designs (keep # unique parts small and regular) High concurrency high performance Balanced computation and I/O (memory access) 48
Systolic Architectures n H. T. Kung, “Why Systolic Architectures? , ” IEEE Computer 1982. Memory: heart PEs: cells Memory pulses data through cells 49
Systolic Architectures n n Basic principle: Replace a single PE with a regular array of PEs and carefully orchestrate flow of data between the PEs achieve high throughput w/o increasing memory bandwidth requirements Differences from pipelining: q q q Array structure can be non-linear and multi-dimensional PE connections can be multidirectional (and different speed) PEs can have local memory and execute kernels (rather than a piece of the instruction) 50
Systolic Computation Example n Convolution q q Used in filtering, pattern matching, correlation, polynomial evaluation, etc … Many image processing tasks 51
Systolic Computation Example: Convolution n y 1 = w 1 x 1 + w 2 x 2 + w 3 x 3 y 2 = w 1 x 2 + w 2 x 3 + w 3 x 4 y 3 = w 1 x 3 + w 2 x 4 + w 3 x 5 52
Systolic Computation Example: Convolution n Worthwhile to implement adder and multiplier separately to allow overlapping of add/mul executions 53
More Programmability n Each PE in a systolic array q q q n Can store multiple “weights” Weights can be selected on the fly Eases implementation of, e. g. , adaptive filtering Taken further q q q Each PE can have its own data and instruction memory Data memory to store partial/temporary results, constants Leads to stream processing, pipeline parallelism n More generally, staged execution 54
Pipeline Parallelism 55
File Compression Example 56
Systolic Array n Advantages q q q n Makes multiple uses of each data item reduced need for fetching/refetching High concurrency Regular design (both data and control flow) Disadvantages q q Not good at exploiting irregular parallelism Relatively special purpose need software, programmer support to be a general purpose model 57
The WARP Computer n n n n HT Kung, CMU, 1984 -1988 Linear array of 10 cells, each cell a 10 Mflop programmable processor Attached to a general purpose host machine HLL and optimizing compiler to program the systolic array Used extensively to accelerate vision and robotics tasks Annaratone et al. , “Warp Architecture and Implementation, ” ISCA 1986. Annaratone et al. , “The Warp Computer: Architecture, Implementation, and Performance, ” IEEE TC 1987. 58
The WARP Computer 59
The WARP Computer 60
Systolic Arrays vs. SIMD n Food for thought… 61
Some More Recommended Readings n Recommended: q q q Fisher, “Very Long Instruction Word architectures and the ELI 512, ” ISCA 1983. Huck et al. , “Introducing the IA-64 Architecture, ” IEEE Micro 2000. Russell, “The CRAY-1 computer system, ” CACM 1978. Rau and Fisher, “Instruction-level parallel processing: history, overview, and perspective, ” Journal of Supercomputing, 1993. Faraboschi et al. , “Instruction Scheduling for Instruction Level Parallel Processors, ” Proc. IEEE, Nov. 2001. 62
- Simd in computer architecture
- Simd architecture ppt
- Analyzing and leveraging decoupled l1 caches in gpus
- Gpu sql database
- Understanding the efficiency of ray traversal on gpus
- Fast bvh construction on gpus
- Hamlet act iii scene iii
- Organization and architecture difference
- Bus architecture in computer architecture
- Sse simd
- Simd units
- Simd extensions
- Intel simd instruction set
- Simd postcode lookup 2021
- Associative array processing in parallel computing
- Multiple instruction single data example
- Systolic array vs simd
- Simd vs gpu
- Sisd in computer architecture
- Nprg054
- Register in computer organization
- Va professional standards board
- Part whole model subtraction
- Part to part ratio definition
- Brainpop ratios
- Technical descriptions
- It is the heart of the entire beverage operation
- The phase of the moon you see depends on ______.
- Part to part variation
- Call and return architecture in software engineering
- Computer organization and architecture 10th solution
- Virtual lab computer organization
- Introduction to computer organization and architecture
- Timing and control in computer architecture
- Evolution and interpretation of computer architecture
- Digital design and computer architecture: arm edition
- Computer organization & architecture: themes and variations
- Computer organization and architecture 10th edition
- Linear pipeline and non linear pipeline
- Digital design and computer architecture
- What is mux in computer architecture
- Digital design and computer architecture
- Digital design and computer architecture
- Assembly language computer architecture
- Computer arithmetic
- Computer organisation and architecture
- Hazard detection and resolution in computer architecture
- Pipelined datapath
- 1s complement
- Cs341 umb
- Digital design and computer architecture
- Digital logic and computer architecture
- Verb 99 noun 62
- Digital design and computer architecture: arm edition
- Abc in software architecture
- Modular architecture vs integrated architecture
- Integral and modular architecture
- Computer part of your life grade 11
- Computer part 3
- Prywer
- Computer components
- Computer parts images
- Computer aided part programming