CSE 431 Computer Architecture Fall 2005 Lecture 28

  • Slides: 15
Download presentation
CSE 431 Computer Architecture Fall 2005 Lecture 28. CMPs & SMTs Mary Jane Irwin

CSE 431 Computer Architecture Fall 2005 Lecture 28. CMPs & SMTs Mary Jane Irwin ( www. cse. psu. edu/~mji ) www. cse. psu. edu/~cg 431 [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005] CSE 431 L 28 CMP&SMT. 1 Irwin, PSU, 2005

Review: Multiprocessor Basics q Q 1 – How do they share data? q Q

Review: Multiprocessor Basics q Q 1 – How do they share data? q Q 2 – How do they coordinate? q Q 3 – How scalable is the architecture? How many processors? # of Proc Communication Message passing 8 to 2048 model Shared NUMA 8 to 256 address UMA 2 to 64 Physical connection CSE 431 L 28 CMP&SMT. 2 Network 8 to 256 Bus 2 to 36 Irwin, PSU, 2005

CMP: Multiprocessors On One Chip q By placing multiple processors, their memories and the

CMP: Multiprocessors On One Chip q By placing multiple processors, their memories and the IN all on one chip, the latencies of chip-to-chip communication are drastically reduced l ARM multi-chip core Per-CPU aliased peripherals Configurable between 1 & 4 symmetric CPUs Private peripheral bus CSE 431 L 28 CMP&SMT. 3 Configurable # of hardware intr Private IRQ Interrupt Distributor CPU CPU Interface CPU L 1$s Snoop Control Unit Primary AXI R/W 64 -b bus I & D CCB 64 -b bus Optional AXI R/W 64 -b bus Irwin, PSU, 2005

Multithreading on A Chip q Find a way to “hide” true data dependency stalls,

Multithreading on A Chip q Find a way to “hide” true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions q Multithreading – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processor l Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread l The caches, TLBs, BHT, BTB can be shared (although the miss rates may increase if they are not sized accordingly) l The memory can be shared through virtual memory mechanisms l Hardware must support efficient thread context switching CSE 431 L 28 CMP&SMT. 4 Irwin, PSU, 2005

Types of Multithreading q q Fine-grain – switch threads on every instruction issue l

Types of Multithreading q q Fine-grain – switch threads on every instruction issue l Round-robin thread interleaving (skipping stalled threads) l Processor must be able to switch threads on every clock cycle l Advantage – can hide throughput losses that come from both short and long stalls l Disadvantage – slows down the execution of an individual thread since a thread that is ready to execute without stalls is delayed by instructions from other threads Coarse-grain – switches threads only on costly stalls (e. g. , L 2 cache misses) l Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual thread l Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss - Pipeline must be flushed and refilled on thread switches CSE 431 L 28 CMP&SMT. 5 Irwin, PSU, 2005

Multithreaded Example: Sun’s Niagara (Ultra. Sparc T 1) 1. 2 GHz 1. 0 GHz

Multithreaded Example: Sun’s Niagara (Ultra. Sparc T 1) 1. 2 GHz 1. 0 GHz Cache (I/D/L 2) 32 K/64 K/ (8 M external) 16 K/8 K/3 M Issue rate 4 issue 1 issue Pipe stages 14 stages 6 stages BHT entries 16 K x 2 -b None TLB entries 128 I/512 D 64 I/64 D Memory BW 2. 4 GB/s ~20 GB/s Transistors 29 million 200 million Power (max) 53 W CSE 431 L 28 CMP&SMT. 6 <60 W 4 -way MT SPARC pipe Clock rate 4 -way MT SPARC pipe 64 -b 4 -way MT SPARC pipe Data width 4 -way MT SPARC pipe Niagara 4 -way MT SPARC pipe Ultra III 4 -way MT SPARC pipe Eight fine grain multithreaded single-issue, in-order cores (no speculation, no dynamic branch prediction) 4 -way MT SPARC pipe q I/O shared funct’s Crossbar 4 -way banked L 2$ Memory controllers Irwin, PSU, 2005

Niagara Integer Pipeline q Cores are simple (single-issue, 6 stage, no branch prediction), small,

Niagara Integer Pipeline q Cores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficient Fetch Thrd Sel Decode Reg. File x 4 I$ ITLB Inst bufx 4 Thrd Sel Mux Decode Thread Select Logic Execute ALU Mul Shft Div Memory D$ DTLB Stbufx 4 WB Crossbar Interface Instr type Cache misses Traps & interrupts Resource conflicts PC logicx 4 From MPR, Vol. 18, #9, Sept. 2004 CSE 431 L 28 CMP&SMT. 7 Irwin, PSU, 2005

Simultaneous Multithreading (SMT) q A variation on multithreading that uses the resources of a

Simultaneous Multithreading (SMT) q A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (superscalar) to exploit both program ILP and threadlevel parallelism (TLP) l Most SS processors have more machine level parallelism than most programs can effectively use (i. e. , than have ILP) l With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them - Need separate rename tables (ROBs) for each thread - Need the capability to commit from multiple threads (i. e. , from multiple ROBs) in one cycle q Intel’s Pentium 4 SMT called hyperthreading l Supports just two threads (doubles the architecture state) CSE 431 L 28 CMP&SMT. 8 Irwin, PSU, 2005

Threading on a 4 -way SS Processor Example Coarse MT Fine MT SMT Issue

Threading on a 4 -way SS Processor Example Coarse MT Fine MT SMT Issue slots → Thread A Thread B Time → Thread C Thread D CSE 431 L 28 CMP&SMT. 9 Irwin, PSU, 2005

Multicore Xbox 360 – “Xenon” processor q q To provide game developers with a

Multicore Xbox 360 – “Xenon” processor q q To provide game developers with a balanced and powerful platform l Three SMT processors, 32 KB L 1 D$ & I$, 1 MB UL 2 cache l 165 M transistors total l 3. 2 Ghz Near-POWER ISA l 2 -issue, 21 stage pipeline, with 128 -bit registers l Weak branch prediction – supported by software hinting l In order instructions l Narrow cores – 2 INT units, 2 128 -bit VMX units, 1 of anything else An ATI-designed 500 MZ GPU w/ 512 MB of DDR 3 DRAM l 337 M transistors, 10 MB framebuffer l 48 pixel shader cores, each with 4 ALUs CSE 431 L 28 CMP&SMT. 10 Irwin, PSU, 2005

Xenon Diagram Core 1 Core 2 L 1 D L 1 I 1 MB

Xenon Diagram Core 1 Core 2 L 1 D L 1 I 1 MB UL 2 MC 0 512 MB DRAM CSE 431 L 28 CMP&SMT. 11 BIU/IO Intf MC 1 GPU SMC XMA Dec Core 0 DVD HDD Port Front USBs (2) Wireless MU ports (2 USBs) Rear USB (1) Ethernet IR Audio Out Flash Systems Control 3 D Core 10 MB EDRAM Video Out Analog Chip Video Out Irwin, PSU, 2005

The PS 3 “Cell” Processor Architecture q Composed of a Non-SMP Architecture l 234

The PS 3 “Cell” Processor Architecture q Composed of a Non-SMP Architecture l 234 M transistors @ 4 Ghz l 1 Power Processing Element, 8 “Synergistic” (SIMD) PE’s l 512 KB L 2 $ - Massively high bandwidth (200 GB/s) bus connects it to everything else l The PPE is strangely similar to one of the Xenon cores - Almost identical, really. Slight ISA differences, and fine-grained MT instead of real SMT l The real differences lie in the SPEs (21 M transistors each) - An attempt to ‘fix’ the memory latency problem by giving each processor complete control over it’s own 256 KB “scratchpad” – 14 M transistors – Direct mapped for low latency - 4 vector units per SPE, 1 of everything else – 7 M trans. CSE 431 L 28 CMP&SMT. 12 Irwin, PSU, 2005

How to make use of the SPEs CSE 431 L 28 CMP&SMT. 13 Irwin,

How to make use of the SPEs CSE 431 L 28 CMP&SMT. 13 Irwin, PSU, 2005

What about the Software? q q q Makes use of special IBM “Hypervisor” l

What about the Software? q q q Makes use of special IBM “Hypervisor” l Like an OS for OS’s l Runs both a real time OS (for sound) and non-real time (for things like AI) Software must be specially coded to run well l The single PPE will be quickly bogged down l Must make use of SPEs wherever possible l This isn’t easy, by any standard What about Microsoft? l Development suite identifies which 6 threads you’re expected to run l Four of them are Direct. X based, and handled by the OS l Only need to write two threads, functionally CSE 431 L 28 CMP&SMT. 14 Irwin, PSU, 2005

Next Lecture and Reminders q Next lecture - Reading assignment – none (or all)

Next Lecture and Reminders q Next lecture - Reading assignment – none (or all) q Reminders l Check grade posting on-line (by your midterm exam number) for correctness l Final exam (tentatively) schedule - Tuesday, December 13 th, 2: 30 -4: 20, 22 Deike CSE 431 L 28 CMP&SMT. 15 Irwin, PSU, 2005