CPE 631 Multithreading ThreadLevel Parallelism Within a Processor

  • Slides: 23
Download presentation
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic milenka@ece. uah. edu http: //www. ece. uah. edu/~milenka UAH-CPE 631

Outline n n n n Trends in microarchitecture Exploiting thread-level parallelism Exploiting TLP within

Outline n n n n Trends in microarchitecture Exploiting thread-level parallelism Exploiting TLP within a processor Resource sharing Performance implications Design challenges Intel’s HT technology AM La. CASA 2

Trends in microarchitecture n Higher clock speeds n n n ILP: Instruction Level Parallelism

Trends in microarchitecture n Higher clock speeds n n n ILP: Instruction Level Parallelism n n n AM La. CASA To achieve high clock frequency make pipeline deeper (superpipelining) Events that disrupt pipeline (branch mispredictions, cache misses, etc) become very expensive in terms of lost clock cycles n Extract parallelism in a single program Superscalar processors have multiple execution units working in parallel Challenge to find enough instructions that can be executed concurrently Out-of-order execution => instructions are sent to execution units based on instruction dependencies rather than program order 3

Trends in microarchitecture n Cache hierarchies n n Processor-memory speed gap Use caches to

Trends in microarchitecture n Cache hierarchies n n Processor-memory speed gap Use caches to reduce memory latency Multiple levels of caches: smaller and faster closer to the processor core Thread-level Parallelism n Multiple programs execute concurrently n n AM La. CASA Web-servers have an abundance of software threads Users: surfing the web, listening to music, encoding/decoding video streams, etc. 4

Exploiting thread-level parallelism n CMP – Chip Multiprocessing n n n Time-slice multithreading n

Exploiting thread-level parallelism n CMP – Chip Multiprocessing n n n Time-slice multithreading n AM La. CASA Multiple processors, each with a full set of architectural resources, reside on the same die Processors may share an on-chip cache or each can have its own cache Examples: HP Mako, IBM Power 4 Challenges: Power, Die area (cost) n n Processor switches between software threads after a predefined time slice Can minimize the effects of long lasting events Still, some execution slots are wasted 5

Multithreading Within a Processor n n Until now, we have executed multiple threads of

Multithreading Within a Processor n n Until now, we have executed multiple threads of an application on different processors – can multiple threads execute concurrently on the same processor? Why is this desireable? n n n Why does this make sense? n AM La. CASA inexpensive – one CPU, no external interconnects no remote or coherence misses (more capacity misses) n most processors can’t find enough work – peak IPC is 6, average IPC is 1. 5! threads can share resources we can increase threads without a corresponding linear increase in area 6

What Resources are Shared? n n n AM La. CASA n Multiple threads are

What Resources are Shared? n n n AM La. CASA n Multiple threads are simultaneously active (in other words, a new thread can start without a context switch) For correctness, each thread needs its own PC, its own logical regs (and its own mapping from logical to phys regs) For performance, each thread could have its own ROB (so that a stall in one thread does not stall commit in other threads), I-cache, branch predictor, D-cache, etc. (for low interference), although note that more sharing better utilization of resources Each additional thread costs a PC, rename table, and ROB – cheap! 7

Approaches to Multithreading Within a Processor n Fine-grained multithreading: switches threads on every clock

Approaches to Multithreading Within a Processor n Fine-grained multithreading: switches threads on every clock cycle n n n Course-grained multithreading: switches threads only on costly stalls (e. g. , L 2 stalls) n AM La. CASA n n Pro: hide latency of from both short and long stalls Con: Slows down execution of the individual threads ready to go Pros: no switching each clock cycle, no slow down for ready-to-go threads Con: limitations in hiding shorter stalls Simultaneous Multithreading: exploits TLP at the same time it exploits ILP 8

How Resources are Shared? Each box represents an issue slot for a functional unit.

How Resources are Shared? Each box represents an issue slot for a functional unit. Peak thruput is 4 IPC. Thread 1 Thread 2 Thread 3 Cycles Thread 4 Idle Superscalar Coarse-grained Fine-Grained Simultaneous Multithreading • • AM La. CASA • Superscalar processor has high under-utilization – not enough work every cycle, especially when there is a cache miss Fine-grained multithreading can only issue instructions from a single thread in a cycle – can not find max work every cycle, but cache misses can be tolerated Simultaneous multithreading can issue instructions from any thread every cycle – has the highest probability of finding work for every issue slot 9

Resource Sharing Thread-1 R 1 + R 2 R 3 R 1 + R

Resource Sharing Thread-1 R 1 + R 2 R 3 R 1 + R 4 R 5 R 1 + R 3 P 73 P 1 + P 2 P 74 P 73 + P 4 P 75 P 73 + P 74 Instr Fetch Instr Rename R 2 R 1 + R 2 R 5 R 1 + R 2 R 3 R 5 + R 3 P 76 P 33 + P 34 P 77 P 33 + P 76 P 78 P 77 + P 35 Issue Queue P 73 P 1 + P 2 P 74 P 73 + P 4 P 75 P 73 + P 74 P 76 P 33 + P 34 P 77 P 33 + P 76 P 78 P 77 + P 35 Thread-2 Register File AM La. CASA FU FU 10

Performance Implications of SMT n n n AM La. CASA n Single thread performance

Performance Implications of SMT n n n AM La. CASA n Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) – this effect can be mitigated by trying to prioritize one thread While fetching instructions, thread priority can dramatically influence total throughput – a widely accepted heuristic (ICOUNT): fetch such that each thread has an equal share of processor resources With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2 -4 Alpha 21464 and Intel Pentium 4 are examples of SMT 11

Design Challenges n How many threads? n n n Processor front-end (instruction fetch) n

Design Challenges n How many threads? n n n Processor front-end (instruction fetch) n n n AM n La. CASA n Many to find enough parallelism However, mixing many threads will compromise execution of individual threads Fetch as far as possible in a single thread (to maximize thread performance) However, this limits the number of instructions available for scheduling from other threads Larger register files (multiple contexts) Minimize clock cycle time Cache conflicts 12

Pentium 4: Hyperthreading architecture n n One physical processor appears as multiple logical processors

Pentium 4: Hyperthreading architecture n n One physical processor appears as multiple logical processors HT implementation on Net. Burst microarchitecture has 2 logical processors Architectural State AM La. CASA Processor execution resources Architectural state: - general purpose registers - control registers - APIC: advanced programmable interrupt controller 13

Pentium 4: Hyperthreading architecture n Main processor resources are shared n n Duplicated resources

Pentium 4: Hyperthreading architecture n Main processor resources are shared n n Duplicated resources n n n AM La. CASA caches, branch predictors, execution units, buses, control logic n register alias tables (map the architectural registers to physical rename registers) next instruction pointer and associated control logic return stack pointer instruction streaming buffer and trace cache fill buffers 14

Pentium 4: Die Size and Complexity AM La. CASA 15

Pentium 4: Die Size and Complexity AM La. CASA 15

Pentium 4: Resources sharing schemes n Partition – dedicate equal resources to each logical

Pentium 4: Resources sharing schemes n Partition – dedicate equal resources to each logical processors n n Threshold – flexible resource sharing with a limit on maximum resource usage n AM n La. CASA Good when expect high utilization and somewhat unpredicatable Good for small resources with bursty utilization and when the micro-ops stay in the structure for short predictable periods Full sharing – flexible with no limits n Good for large structures, with variable working-set sizes 16

Pentium 4: Shared vs. partitioned queues shared partitioned AM La. CASA 17

Pentium 4: Shared vs. partitioned queues shared partitioned AM La. CASA 17

Net. Burst Pipeline threshold AM La. CASA partitioned 18

Net. Burst Pipeline threshold AM La. CASA partitioned 18

Pentium 4: Shared vs. partitioned resources n Partitioned n n Threshold n n n

Pentium 4: Shared vs. partitioned resources n Partitioned n n Threshold n n n AM La. CASA E. g. , major pipeline queues Puts a threshold on the number of resource entries a logical processor can have E. g. , scheduler Fully shared resources n E. g. , caches n n Modest interference Benefit if we have shared code and/or data 19

Pentium 4: Scheduler occupancy AM La. CASA 20

Pentium 4: Scheduler occupancy AM La. CASA 20

Pentium 4: Shared vs. partitioned cache AM La. CASA 21

Pentium 4: Shared vs. partitioned cache AM La. CASA 21

Pentium 4: Performance Improvements AM La. CASA 22

Pentium 4: Performance Improvements AM La. CASA 22

Multi-Programmed Speedup AM La. CASA 23

Multi-Programmed Speedup AM La. CASA 23