COMPUTER ARCHITECTURE CS 6354 MultiCores Samira Khan University

  • Slides: 46
Download presentation
COMPUTER ARCHITECTURE CS 6354 Multi-Cores Samira Khan University of Virginia Feb 18, 2016 The

COMPUTER ARCHITECTURE CS 6354 Multi-Cores Samira Khan University of Virginia Feb 18, 2016 The content and concept of this course are adapted from CMU ECE 740

AGENDA • Logistics • Review from last lecture • Issues in out of order

AGENDA • Logistics • Review from last lecture • Issues in out of order execution • Latency Tolerance in Oo. O • Multi-Core 2

LOGISTICS • Course feedback form: Feb 18 • Project proposal: Feb 18, will have

LOGISTICS • Course feedback form: Feb 18 • Project proposal: Feb 18, will have 4 late days • Use the latex template • Follow the format of the sample proposals • Each group will prepare one document, but upload the file individually via Collab 3

REVIEW: MEMORY DEPENDENCE HANDLING • When do you schedule a load instruction in an

REVIEW: MEMORY DEPENDENCE HANDLING • When do you schedule a load instruction in an OOO engine? – Problem: A younger load can have its address ready before an older store’s address is known – Known as the memory disambiguation problem or the unknown address problem • Approaches – Conservative: Stall the load until all previous stores have computed their addresses (or even retired from the machine) – Aggressive: Assume load is independent of unknown-address stores and schedule the load right away – Intelligent: Predict (with a more sophisticated predictor) if the load is dependent on the/any unknown address store 4

REVIEW: MEMORY DISAMBIGUATION • Option 1: Assume load dependent on all previous stores +

REVIEW: MEMORY DISAMBIGUATION • Option 1: Assume load dependent on all previous stores + No need for recovery -- Too conservative: delays independent loads unnecessarily • Option 2: Assume load independent of all previous stores + Simple and can be common case: no delay for independent loads -- Requires recovery and re-execution of load and dependents on misprediction • Option 3: Predict the dependence of a load on an outstanding store + More accurate. Load store dependencies persist over time -- Still requires recovery/re-execution on misprediction – Alpha 21264 : Initially assume load independent, delay loads found to be dependent – Moshovos et al. , “Dynamic speculation and synchronization of data dependences, ” ISCA 1997. – Chrysos and Emer, “Memory Dependence Prediction Using Store Sets, ” ISCA 1998. 5

REVIEW: SCHEDULING OF LOAD DEPENDENTS • Assume load will hit + No delay for

REVIEW: SCHEDULING OF LOAD DEPENDENTS • Assume load will hit + No delay for dependents (load hit is the common case) -- Need to squash and re-schedule if load actually misses • Assume load will miss (i. e. schedule when load data ready) + No need to re-schedule (simpler logic) -- Significant delay for load dependents if load hits • Predict load hit/miss + No delay for dependents on accurate prediction -- Need to predict and re-schedule on misprediction • Yoaz et al. , “Speculation Techniques for Improving Load Related Instruction Scheduling, ” ISCA 1999. 6

WHAT TO DO WITH DEPENDENTS ON A LOAD MISS? (I) • A load miss

WHAT TO DO WITH DEPENDENTS ON A LOAD MISS? (I) • A load miss can take hundreds of cycles • If there are many dependent instructions on a load miss, these can clog the scheduling window • Independent instructions cannot be allocated reservation stations and scheduling can stall 7

SMALL WINDOWS: FULL-WINDOW STALLS • When a long-latency instruction is not complete, it blocks

SMALL WINDOWS: FULL-WINDOW STALLS • When a long-latency instruction is not complete, it blocks retirement. • Incoming instructions fill the instruction window. • Once the window is full, processor cannot place new instructions into the window. – This is called a full-window stall. • A full-window stall prevents the processor from making progress in the execution of the program. 8

SMALL WINDOWS: FULL-WINDOW STALLS 8 -entry instruction window: Oldest LOAD R 1 mem[R 5]

SMALL WINDOWS: FULL-WINDOW STALLS 8 -entry instruction window: Oldest LOAD R 1 mem[R 5] L 2 Miss! Takes 100 s of cycles. BEQ R 1, R 0, target ADD R 2, 8 LOAD R 3 mem[R 2] MUL R 4, R 3 ADD R 4, R 5 Independent of the L 2 miss, executed out of program order, but cannot be retired. STOR mem[R 2] R 4 ADD R 2, 64 LOAD R 3 mem[R 2] Younger instructions cannot be executed because there is no space in the instruction window. The processor stalls until the L 2 Miss is serviced. • L 2 cache misses are responsible for most full-window stalls. 9

IMPACT OF L 2 CACHE MISSES L 2 Misses 512 KB L 2 cache,

IMPACT OF L 2 CACHE MISSES L 2 Misses 512 KB L 2 cache, 500 -cycle DRAM latency, aggressive stream-based prefetcher Data averaged over 147 memory-intensive benchmarks on a high-end x 86 processor model 10

IMPACT OF L 2 CACHE MISSES L 2 Misses 500 -cycle DRAM latency, aggressive

IMPACT OF L 2 CACHE MISSES L 2 Misses 500 -cycle DRAM latency, aggressive stream-based prefetcher Data averaged over 147 memory-intensive benchmarks on a high-end x 86 processor model 11

THE PROBLEM • Out-of-order execution requires large instruction windows to tolerate today’s main memory

THE PROBLEM • Out-of-order execution requires large instruction windows to tolerate today’s main memory latencies. • As main memory latency increases, instruction window size should also increase to fully tolerate the memory latency. • Building a large instruction window is a challenging task if we would like to achieve – Low power/energy consumption (tag matching logic, ld/st buffers) – Short cycle time (access, wakeup/select latencies) – Low design and verification complexity 12

EFFICIENT SCALING OF INSTRUCTION WINDOW SIZE • One of the major research issues in

EFFICIENT SCALING OF INSTRUCTION WINDOW SIZE • One of the major research issues in out of order execution • How to achieve the benefits of a large window with a small one (or in a simpler way)? – Runahead execution? • Upon L 2 miss, checkpoint architectural state, speculatively execute only for prefetching, re-execute when data ready – Continual flow pipelines? • Upon L 2 miss, deallocate everything belonging to an L 2 miss dependent, reallocate/re-rename and re-execute upon data ready – Dual-core execution? • One core runs ahead and does not stall on L 2 misses, feeds another core that commits instructions 13

RUNAHEAD EXECUTION (I) • A technique to obtain the memory-level parallelism benefits of a

RUNAHEAD EXECUTION (I) • A technique to obtain the memory-level parallelism benefits of a large instruction window • When the oldest instruction is a long-latency cache miss: – Checkpoint architectural state and enter runahead mode • In runahead mode: – Speculatively pre-execute instructions – The purpose of pre-execution is to generate prefetches – L 2 -miss dependent instructions are marked INV and dropped • Runahead mode ends when the original miss returns – Checkpoint is restored and normal execution resumes • Mutlu et al. , “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors, ” HPCA 2003. 14

Perfect Caches: Load 1 Hit Compute Runahead Example Load 2 Hit Compute Small Window:

Perfect Caches: Load 1 Hit Compute Runahead Example Load 2 Hit Compute Small Window: Load 2 Miss Load 1 Miss Compute Stall Compute Miss 1 Stall Miss 2 Runahead: Load 1 Miss Compute Load 2 Miss Runahead Miss 1 Load 1 Hit Load 2 Hit Compute Saved Cycles Miss 2 15

BENEFITS OF RUNAHEAD EXECUTION Instead of stalling during an L 2 cache miss: •

BENEFITS OF RUNAHEAD EXECUTION Instead of stalling during an L 2 cache miss: • Pre-executed loads and stores independent of L 2 -miss instructions generate very accurate data prefetches: – For both regular and irregular access patterns • Instructions on the predicted program path are prefetched into the instruction/trace cache and L 2. • Hardware prefetcher and branch predictor tables are trained using future access information. 16

RUNAHEAD EXECUTION (III) • Advantages: + Very accurate prefetches for data/instructions (all cache levels)

RUNAHEAD EXECUTION (III) • Advantages: + Very accurate prefetches for data/instructions (all cache levels) + Follows the program path + Simple to implement, most of the hardware is already built in + Uses the same thread context as main thread, no waste of context + No need to construct a pre-execution thread • Disadvantages/Limitations: -- Extra executed instructions -- Limited by branch prediction accuracy -- Cannot prefetch dependent cache misses. -- Effectiveness limited by available “memory-level parallelism” (MLP) -- Prefetch distance limited by memory latency • Implemented in IBM POWER 6, Sun “Rock” 17

WHAT TO DO WITH DEPENDENTS ON A LOAD MISS? (II) • Idea: Move miss-dependent

WHAT TO DO WITH DEPENDENTS ON A LOAD MISS? (II) • Idea: Move miss-dependent instructions into a separate buffer – Example: Pentium 4’s “scheduling loops” – Lebeck et al. , “A Large, Fast Instruction Window for Tolerating Cache Misses, ” ISCA 2002. • But, dependents still hold on to the physical registers • Cannot scale the size of the register file indefinitely since it is on the critical path • Possible solution: Deallocate physical registers of dependents – Difficult to re-allocate. See Srinivasan et al, “Continual Flow Pipelines, ” ASPLOS 2004. Can you think of any other solution? 18

GENERAL ORGANIZATION OF AN OOO PROCESSOR n Smith and Sohi, “The Microarchitecture of Superscalar

GENERAL ORGANIZATION OF AN OOO PROCESSOR n Smith and Sohi, “The Microarchitecture of Superscalar Processors, ” Proc. IEEE, Dec. 1995. 19

A MODERN OOO DESIGN: INTEL PENTIUM 4 20 Boggs et al. , “The Microarchitecture

A MODERN OOO DESIGN: INTEL PENTIUM 4 20 Boggs et al. , “The Microarchitecture of the Pentium 4 Processor, ” Intel Technology Journal, 2001.

INTEL PENTIUM 4 SIMPLIFIED Mutlu+, “Runahead Execution, ” HPCA 2003. 21

INTEL PENTIUM 4 SIMPLIFIED Mutlu+, “Runahead Execution, ” HPCA 2003. 21

ALPHA 21264 Kessler, “The Alpha 21264 Microprocessor, ” IEEE Micro, March-April 1999. 22

ALPHA 21264 Kessler, “The Alpha 21264 Microprocessor, ” IEEE Micro, March-April 1999. 22

MIPS R 10000 Yeager, “The MIPS R 10000 Superscalar Microprocessor, ” IEEE Micro, April

MIPS R 10000 Yeager, “The MIPS R 10000 Superscalar Microprocessor, ” IEEE Micro, April 1996 23

IBM POWER 4 • Tendler et al. , “POWER 4 system microarchitecture”, IBM J

IBM POWER 4 • Tendler et al. , “POWER 4 system microarchitecture”, IBM J R&D, 2002. 24

IBM POWER 4 • • • 2 cores, out-of-order execution 100 -entry instruction window

IBM POWER 4 • • • 2 cores, out-of-order execution 100 -entry instruction window in each core 8 -wide instruction fetch, issue, execute Large, local+global hybrid branch predictor 1. 5 MB, 8 -way L 2 cache Aggressive stream based prefetching 25

n IBM POWER 5 Kalla et al. , “IBM Power 5 Chip: A Dual-Core

n IBM POWER 5 Kalla et al. , “IBM Power 5 Chip: A Dual-Core Multithreaded Processor, ” IEEE Micro 2004. 26

RECOMMENDED READINGS • Out-of-order execution processor designs • Kessler, “The Alpha 21264 Microprocessor, ”

RECOMMENDED READINGS • Out-of-order execution processor designs • Kessler, “The Alpha 21264 Microprocessor, ” IEEE Micro, March-April 1999. • Boggs et al. , “The Microarchitecture of the Pentium 4 Processor, ” Intel Technology Journal, 2001. • Yeager, “The MIPS R 10000 Superscalar Microprocessor, ” IEEE Micro, April 1996 • Tendler et al. , “POWER 4 system microarchitecture, ” IBM Journal of Research and Development, January 2002. 27

AND MORE READINGS… • Stark et al. , “On Pipelining Dynamic Scheduling Logic, ”

AND MORE READINGS… • Stark et al. , “On Pipelining Dynamic Scheduling Logic, ” MICRO 2000. • Brown et al. , “Select-free Instruction Scheduling Logic, ” MICRO 2001. • Palacharla et al. , “Complexity-effective Superscalar Processors, ” ISCA 1997. 28

MULTIPLE CORES ON CHIP • Simpler and lower power than a single large core

MULTIPLE CORES ON CHIP • Simpler and lower power than a single large core • Large scale parallelism on chip Intel Core i 7 AMD Barcelona 8 cores IBM Cell BE IBM POWER 7 Intel SCC Tilera TILE Gx 8+1 cores 8 cores 4 cores Sun Niagara II 8 cores Nvidia Fermi 448 “cores” 48 cores, networked 100 cores, networked 29

MOORE’S LAW Moore, “Cramming more components onto integrated circuits, ” Electronics, 1965. 30

MOORE’S LAW Moore, “Cramming more components onto integrated circuits, ” Electronics, 1965. 30

31

31

MULTI-CORE • Idea: Put multiple processors on the same die. • Technology scaling (Moore’s

MULTI-CORE • Idea: Put multiple processors on the same die. • Technology scaling (Moore’s Law) enables more transistors to be placed on the same die area • What else could you do with the die area you dedicate to multiple processors? – Have a bigger, more powerful core – Have larger caches in the memory hierarchy – Simultaneous multithreading – Integrate platform components on chip (e. g. , network interface, memory controllers) 32

WHY MULTI-CORE? • Alternative: Bigger, more powerful single core – Larger superscalar issue width,

WHY MULTI-CORE? • Alternative: Bigger, more powerful single core – Larger superscalar issue width, larger instruction window, more execution units, large trace caches, large branch predictors, etc + Improves single-thread performance transparently to programmer, compiler - Very difficult to design (Scalable algorithms for improving single-thread performance elusive) - Power hungry – many out-of-order execution structures consume significant power/area when scaled. Why? - Diminishing returns on performance - Does not significantly help memory-bound application performance (Scalable algorithms for this elusive) 33

LARGE SUPERSCALAR VS. MULTI-CORE • Olukotun et al. , “The Case for a Single-Chip

LARGE SUPERSCALAR VS. MULTI-CORE • Olukotun et al. , “The Case for a Single-Chip Multiprocessor, ” ASPLOS 1996. 34

MULTI-CORE VS. LARGE SUPERSCALAR • Multi-core advantages + Simpler cores more power efficient, lower

MULTI-CORE VS. LARGE SUPERSCALAR • Multi-core advantages + Simpler cores more power efficient, lower complexity, easier to design and replicate, higher frequency (shorter wires, smaller structures) + Higher system throughput on multiprogrammed workloads reduced context switches + Higher system throughput in parallel applications • Multi-core disadvantages - Requires parallel tasks/threads to improve performance (parallel programming) - Resource sharing can reduce single-thread performance - Shared hardware resources need to be managed - Number of pins limits data supply for increased demand 35

LARGE SUPERSCALAR VS. MULTI-CORE • Olukotun et al. , “The Case for a Single-Chip

LARGE SUPERSCALAR VS. MULTI-CORE • Olukotun et al. , “The Case for a Single-Chip Multiprocessor, ” ASPLOS 1996. • Technology push – Instruction issue queue size limits the cycle time of the superscalar, Oo. O processor diminishing performance • Quadratic increase in complexity with issue width – Large, multi-ported register files to support large instruction windows and issue widths reduced frequency or longer RF access, diminishing performance • Application pull – Integer applications: little parallelism? – FP applications: abundant loop-level parallelism – Others (transaction proc. , multiprogramming): CMP better fit 36

COMPARISON POINTS… 37

COMPARISON POINTS… 37

WHY MULTI-CORE? • Alternative: Bigger caches + Improves single-thread performance transparently to programmer, compiler

WHY MULTI-CORE? • Alternative: Bigger caches + Improves single-thread performance transparently to programmer, compiler + Simple to design - Diminishing single-thread performance returns from cache size. Why? - Multiple levels complicate memory hierarchy 38

CACHE VS. CORE 39

CACHE VS. CORE 39

WHY MULTI-CORE? • Alternative: (Simultaneous) Multithreading • Idea: Dispatch instructions from multiple threads in

WHY MULTI-CORE? • Alternative: (Simultaneous) Multithreading • Idea: Dispatch instructions from multiple threads in the same cycle (to keep multiple execution units utilized) – Hirata et al. , “An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads, ” ISCA 1992. – Yamamoto et al. , “Performance Estimation of Multistreamed, Superscalar Processors, ” HICSS 1994. – Tullsen et al. , “Simultaneous Multithreading: Maximizing On. Chip Parallelism, ” ISCA 1995. 40

BASIC SUPERSCALAR OOO PIPELINE Fetch Decode/ Map Queue Reg Read Execute Dcache/S tore Buffer

BASIC SUPERSCALAR OOO PIPELINE Fetch Decode/ Map Queue Reg Read Execute Dcache/S tore Buffer Reg Write Retire PC Register Map Regs Icache Dcach e Regs Threadblind 41

SMT PIPELINE • Physical register file needs to become larger. Why? Fetch Decode/ Map

SMT PIPELINE • Physical register file needs to become larger. Why? Fetch Decode/ Map Queue Reg Read Execute Dcache/S tore Buffer Reg Write Retire PC Register Map Regs Icache Dcach e Regs 42

WHY MULTI-CORE? • Alternative: (Simultaneous) Multithreading + Exploits thread-level parallelism (just like multi-core) +

WHY MULTI-CORE? • Alternative: (Simultaneous) Multithreading + Exploits thread-level parallelism (just like multi-core) + Good single-thread performance with SMT + No need to have an entire core for another thread + Parallel performance aided by tight sharing of caches - Scalability is limited: need bigger register files, larger issue width (and associated costs) to have many threads complex with many threads - Parallel performance limited by shared fetch bandwidth - Extensive resource sharing at the pipeline and memory system reduces both single-thread and parallel application performance 43

WHY MULTI-CORE? • Alternative: Integrate platform components on chip instead + Speeds up many

WHY MULTI-CORE? • Alternative: Integrate platform components on chip instead + Speeds up many system functions (e. g. , network interface cards, Ethernet controller, memory controller, I/O controller) - Not all applications benefit (e. g. , CPU intensive code sections) 44

WHY MULTI-CORE? • Other alternatives? – Dataflow? – Vector processors (SIMD)? – Integrating DRAM

WHY MULTI-CORE? • Other alternatives? – Dataflow? – Vector processors (SIMD)? – Integrating DRAM on chip? – Reconfigurable logic? (general purpose? ) 45

COMPUTER ARCHITECTURE CS 6354 Multi-Cores Samira Khan University of Virginia Feb 18, 2016 The

COMPUTER ARCHITECTURE CS 6354 Multi-Cores Samira Khan University of Virginia Feb 18, 2016 The content and concept of this course are adapted from CMU ECE 740