Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall

  • Slides: 28
Download presentation
Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer

Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National Taiwan University

Concept • Data Access Latency – Cache misses (L 1, L 2) – Memory

Concept • Data Access Latency – Cache misses (L 1, L 2) – Memory latency (remote, local) – Often unpredictable • Multithreading (MT) – Tolerate or mask long and often unpredictable latency operations by switching to another context, which is able to do useful work.

Why Multithreading Today? • ILP is exhausted, TLP is in. • Large performance gap

Why Multithreading Today? • ILP is exhausted, TLP is in. • Large performance gap bet. MEM and PROC. • Too many transistors on chip • More existing MT applications Today. • Multiprocessors on a single chip. • Long network latency, too.

Classical Problem, 60’ & 70’ • • • I/O latency prompted multitasking IBM mainframes

Classical Problem, 60’ & 70’ • • • I/O latency prompted multitasking IBM mainframes Multitasking I/O processors Caches within disk controllers

Requirements of Multithreading • Storage need to hold multiple context’s PC, registers, status word,

Requirements of Multithreading • Storage need to hold multiple context’s PC, registers, status word, etc. • Coordination to match an event with a saved context • A way to switch contexts • Long latency operations must use resources not in use

Processor Utilization vs. Latency R = the run length to a long latency event

Processor Utilization vs. Latency R = the run length to a long latency event L = the amount of latency

Problem of 80’ • Problem was revisited due to the advent of graphics workstations

Problem of 80’ • Problem was revisited due to the advent of graphics workstations – Xerox Alto, TI Explorer – Concurrent processes are interleaved to allow for the workstations to be more responsive. – These processes could drive or monitor display, input, file system, network, user processing – Process switch was slow so the subsystems were microprogrammed to support multiple contexts

Scalable Multiprocessor (90’) • Dance hall – a shared interconnect with memory on one

Scalable Multiprocessor (90’) • Dance hall – a shared interconnect with memory on one side and processors on the other. • Or processors may have local memory

How do the processors communicate? • Shared Memory • Potential long latency on every

How do the processors communicate? • Shared Memory • Potential long latency on every load – Cache coherency becomes an issue – Examples include NYU’s Ultracomputer, IBM’s RP 3, BBN’s Butterfly, MIT’s Alewife, and later Stanford’s Dash. – Synchronization occurs through share variables, locks, flags, and semaphores. • Message Passing – Programmer deals with latency. This enables them to minimize the number of messages, while maximizing the size, and this scheme allows for delay minimization by sending a message so that it reaches the receiver at the time it expects it. – Examples include Intel’s PSC and Paragon, Caltech’s Cosmic Cube, and Thinking Machines’ CM-5 – Synchronization occurs through send and receive

Cycle-by-Cycle Interleaved Multithreading • Denelcor HEP 1 (1982), HEP 2 • Horizon, which was

Cycle-by-Cycle Interleaved Multithreading • Denelcor HEP 1 (1982), HEP 2 • Horizon, which was never built • Tera, MTA

Cycle-by-Cycle Interleaved Multithreading • Features – An instruction from a different context is launched

Cycle-by-Cycle Interleaved Multithreading • Features – An instruction from a different context is launched at each clock cycle – No interlocks or bypasses thanks to a non-blocking pipeline • Optimizations: – Leaving context state in proc (PC, register #, status) – Assigning tags to remote request and then matching it on completion

Challenges with this approach • I-Cache: – Instruction bandwidth – I-Cache misses: Since instructions

Challenges with this approach • I-Cache: – Instruction bandwidth – I-Cache misses: Since instructions are being grabbed from many different contexts, instruction locality is degraded and the I-cache miss rate rises. • Register file access time: – Register file access time increases due to the fact that the regfile had to significantly increase in size to accommodate many separate contexts. – In fact, the HEP and Tera use SRAM to implement the regfile, which means longer access times. • Single thread performance – Single thread performance significantly degraded since the context is forced to switch to a new thread even if none are available. • • Very high bandwidth network, which is fast and wide Retries on load empty or store full

Improving Single Thread Performance • Do more operations per instruction (VLIW) • Allow multiple

Improving Single Thread Performance • Do more operations per instruction (VLIW) • Allow multiple instructions to issue into pipeline from each context. – This could lead to pipeline hazards, so other safe instructions could be interleaved into the execution. – For Horizon & Tera, the compiler detects such data dependencies and the hardware enforces it by switching to another context if detected. • Switch on load • Switch on miss – Switching on load or miss will increase the context switch time.

Simultaneous Multithreading (SMT) • Tullsen, et. al. (U. of Washington), ISCA ‘ 95 •

Simultaneous Multithreading (SMT) • Tullsen, et. al. (U. of Washington), ISCA ‘ 95 • A way to utilize pipeline with increased parallelism from multiple threads.

Simultaneous Multithreading

Simultaneous Multithreading

SMT Architecture • Straightforward extension to conventional superscalar design. – multiple program counters and

SMT Architecture • Straightforward extension to conventional superscalar design. – multiple program counters and some mechanism by which the fetch unit selects one each cycle, – a separate return stack for each thread for predicting subroutine return destinations, – per-thread instruction retirement, instruction queue flush, and trap mechanisms, – a thread id with each branch target buffer entry to avoid predicting phantom branches, and – a larger register file, to support logical registers for all threads plus additional registers for register renaming. • The size of the register file affects the pipeline and the scheduling of load-dependent instructions.

SMT Performance Tullsen ‘ 96

SMT Performance Tullsen ‘ 96

Commercial Machines w/ MT Support • Intel Hyperthreding (HT) – Dual threads – Pentium

Commercial Machines w/ MT Support • Intel Hyperthreding (HT) – Dual threads – Pentium 4, XEON • Sun Cool. Threads – Ultra. SPARC T 1 – 4 -threads per core • IBM – POWER 5

IBM Power 5 http: //www. research. ibm. com/journal/rd/494/mathis. pdf

IBM Power 5 http: //www. research. ibm. com/journal/rd/494/mathis. pdf

IBM Power 5 http: //www. research. ibm. com/journal/rd/494/mathis. pdf

IBM Power 5 http: //www. research. ibm. com/journal/rd/494/mathis. pdf

SMT Summary • Pros: – Increased throughput w/o adding much cost – Fast response

SMT Summary • Pros: – Increased throughput w/o adding much cost – Fast response for multitasking environment • Cons: – Slower single processor performance

Multicore • Multiple processor cores on a chip – Chip multiprocessor (CMP) – Sun’s

Multicore • Multiple processor cores on a chip – Chip multiprocessor (CMP) – Sun’s Chip Multithreading (CMT) • Ultra. SPARC T 1 (Niagara) – Intel’s Pentium D – AMD dual-core Opteron • Also a way to utilize TLP, but – 2 cores 2 X costs – No good for single thread performacne • Can be used together with SMT

Chip Multithreading (CMT)

Chip Multithreading (CMT)

Sun Ultra. SPARC T 1 Processor http: //www. sun. com/servers/wp. jsp? tab=3&group=Cool. Threads%20 servers

Sun Ultra. SPARC T 1 Processor http: //www. sun. com/servers/wp. jsp? tab=3&group=Cool. Threads%20 servers

8 Cores vs 2 Cores • Is 8 -cores too aggressive? – Good for

8 Cores vs 2 Cores • Is 8 -cores too aggressive? – Good for server applications, given • Lots of threads • Scalable operating environment • Large memory space (64 bit) – Good for power efficiency • Simple pipeline design for each core – Good for availability – Not intended for PCs, gaming, etc

SPECWeb 2005 IBM X 346: 3 Ghz Xeon T 2000: 8 core 1. 0

SPECWeb 2005 IBM X 346: 3 Ghz Xeon T 2000: 8 core 1. 0 GHz T 1 Processor

Sun Fire T 2000 Server

Sun Fire T 2000 Server

Server Pricing • Ultra. SPARC • X 86 – Sun Fire T 1000 Server

Server Pricing • Ultra. SPARC • X 86 – Sun Fire T 1000 Server • 6 core 1. 0 GHz T 1 Processor • 2 GB memory, 1 x 80 GB disk • List price: $5, 745 – Sun Fire X 2100 Server • Dual core AMD Opteron 175 • 2 GB memory, 1 x 80 GB disk • List price: $2, 295 – Sun Fire T 2000 Server • 8 core 1. 0 GHz T 1 Processor • 8 GB DDR 2 memory, 2 X 73 GB disk • List price: $13, 395 – Sun Fire X 4200 Server • 2 x Dual core AMD Opteron 275 • 4 GB memory, 2 x 73 GB disk • List price: $7, 595