EE 193 Parallel Computing Fall 2017 Tufts University
![EE 193: Parallel Computing Fall 2017 Tufts University Instructor: Joel Grodstein joel. grodstein@tufts. edu EE 193: Parallel Computing Fall 2017 Tufts University Instructor: Joel Grodstein joel. grodstein@tufts. edu](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-1.jpg)
![Caches • Why do we have caches? When do caches work well, and not Caches • Why do we have caches? When do caches work well, and not](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-2.jpg)
![Branch predictors • What is branch prediction? – Instead of stalling until you know Branch predictors • What is branch prediction? – Instead of stalling until you know](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-3.jpg)
![Out of order • What is OOO? – Forget about executing instructions in program Out of order • What is OOO? – Forget about executing instructions in program](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-4.jpg)
![SMT • What is SMT? – One core has several threads to choose from SMT • What is SMT? – One core has several threads to choose from](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-5.jpg)
![How far should we go? • How many transistors should we spend on big How far should we go? • How many transistors should we spend on big](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-6.jpg)
![How far should we go? • How many transistors should we spend on branch How far should we go? • How many transistors should we spend on branch](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-7.jpg)
![How far should we go? • How many transistors should we spend on OOO How far should we go? • How many transistors should we spend on OOO](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-8.jpg)
![So what do we do? • Keep adding more transistors? – Bigger caches, bigger So what do we do? • Keep adding more transistors? – Bigger caches, bigger](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-9.jpg)
![CPU vs. GPU, once more Haswell Server Nvidia Pascal P 100 # cores 18 CPU vs. GPU, once more Haswell Server Nvidia Pascal P 100 # cores 18](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-10.jpg)
![CPU vs. GPU, once more Haswell Server Nvidia Pascal P 100 # cores 18 CPU vs. GPU, once more Haswell Server Nvidia Pascal P 100 # cores 18](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-11.jpg)
![CPU vs. GPU, once more Haswell Server Nvidia Pascal P 100 # cores 18 CPU vs. GPU, once more Haswell Server Nvidia Pascal P 100 # cores 18](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-12.jpg)
![Power Implications of Parallelism • Consider doubling number of cores, same power budget – Power Implications of Parallelism • Consider doubling number of cores, same power budget –](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-13.jpg)
![Approaches to the serial problem • Rewrite serial programs so that they’re parallel – Approaches to the serial problem • Rewrite serial programs so that they’re parallel –](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-14.jpg)
![What kind of multicore? • We will build multicore CPUs – Everyone has conceded What kind of multicore? • We will build multicore CPUs – Everyone has conceded](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-15.jpg)
![What to remember • Caches: – – Write your code to have temporal and What to remember • Caches: – – Write your code to have temporal and](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-16.jpg)
- Slides: 16
![EE 193 Parallel Computing Fall 2017 Tufts University Instructor Joel Grodstein joel grodsteintufts edu EE 193: Parallel Computing Fall 2017 Tufts University Instructor: Joel Grodstein joel. grodstein@tufts. edu](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-1.jpg)
EE 193: Parallel Computing Fall 2017 Tufts University Instructor: Joel Grodstein joel. grodstein@tufts. edu What have we learned about architecture? 1
![Caches Why do we have caches When do caches work well and not Caches • Why do we have caches? When do caches work well, and not](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-2.jpg)
Caches • Why do we have caches? When do caches work well, and not so well? – While main-memory density has greatly increased over the years, its speed has not kept up with CPU speed – Caches let you put a small amount of high-speed memory near the CPU so it doesn't spend lots of time waiting for memory – They work well when your program has temporal and spatial locality. Otherwise, not. EE 193 Joel Grodstein 2
![Branch predictors What is branch prediction Instead of stalling until you know Branch predictors • What is branch prediction? – Instead of stalling until you know](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-3.jpg)
Branch predictors • What is branch prediction? – Instead of stalling until you know if a branch is taken, just make your best guess and execute your choice – If your guess turns out correct, then good. You run fast – If not, then undo everything you've done (at considerable energy cost), using a Reorder Buffer – Making your best guess usually relies on past history at each branch EE 193 Joel Grodstein 3
![Out of order What is OOO Forget about executing instructions in program Out of order • What is OOO? – Forget about executing instructions in program](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-4.jpg)
Out of order • What is OOO? – Forget about executing instructions in program order – Look ahead at the next 10 -100 instructions to find any of them whose operands are ready – Execute in dataflow order (i. e. . , whoever is ready) – If something goes wrong (like an earlier instruction takes an exception), then undo all instructions after the exception (using a reorder buffer) EE 193 Joel Grodstein 4
![SMT What is SMT One core has several threads to choose from SMT • What is SMT? – One core has several threads to choose from](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-5.jpg)
SMT • What is SMT? – One core has several threads to choose from – Whenever one thread would be stalled, it switches to another thread – Requires (essentially) a separate register file for each thread – Allows one core to execute many threads with minimal stalling – Does not help single-thread execution speed EE 193 Joel Grodstein 5
![How far should we go How many transistors should we spend on big How far should we go? • How many transistors should we spend on big](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-6.jpg)
How far should we go? • How many transistors should we spend on big caches? – Caches cost power & area, and don't do any computing – We already see pretty good hit rates – If you organize your code very cleverly, you can often get it to run fast without needing a lot of cache • But – taking the time to organize your code cleverly takes time – and time is money – and most customers prefer software to be cheap or free EE 194/Comp 140 Mark Hempstead 6
![How far should we go How many transistors should we spend on branch How far should we go? • How many transistors should we spend on branch](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-7.jpg)
How far should we go? • How many transistors should we spend on branch prediction? – The BP itself costs power and area, but does no computation. – A BP is necessary for good single-stream performance – We already saw diminishing returns on bigger BHTs – The difficult-to-predict branches are often data dependent; a better/bigger algorithm won't help much • It would be really nice if we just didn't care about single-stream performance – But we do – usually. EE 194/Comp 140 Mark Hempstead 7
![How far should we go How many transistors should we spend on OOO How far should we go? • How many transistors should we spend on OOO](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-8.jpg)
How far should we go? • How many transistors should we spend on OOO infrastructure? – A big ROB costs area and power, and doesn't do any computing – Instructions per cycle is hitting a wall; there's just not that much parallelism in most code (no matter how hard your OOO transistors try) • But – OOO makes a really big difference in single-stream performance. EE 194/Comp 140 Mark Hempstead 8
![So what do we do Keep adding more transistors Bigger caches bigger So what do we do? • Keep adding more transistors? – Bigger caches, bigger](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-9.jpg)
So what do we do? • Keep adding more transistors? – Bigger caches, bigger branch predictors, more OOO – Will cost more and more power for very little execution speed • Everybody stopped doing this 10 years ago • Instead: more cores – And, in fact, single-stream performance is no longer improving very quickly EE 193 Joel Grodstein 9
![CPU vs GPU once more Haswell Server Nvidia Pascal P 100 cores 18 CPU vs. GPU, once more Haswell Server Nvidia Pascal P 100 # cores 18](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-10.jpg)
CPU vs. GPU, once more Haswell Server Nvidia Pascal P 100 # cores 18 3840 Die area 660 mm 2 (22 nm) 610 mm 2 (16 nm) Frequency 2. 3 GHz normal 1. 3 GHz Max DRAM BW 100 GB/s 720 GB/s LLC size 2. 5 MB/core (L 3) 4 MB L 2/chip LLC-1 size 256 K/core(L 2), 64 B/c 64 KB/SM (64 cores) Registers/core 180 per 2 threads 1000 Power 165 watts 300 watts Company market cap $160 B $90 B • How can a GPU fit so many cores in the same area? – Their cores do not have OOO, speculation, BP, large caches, … • But won't a GPU have lousy single-thread performance? – Yes. That's not their market EE 193 Joel Grodstein 10
![CPU vs GPU once more Haswell Server Nvidia Pascal P 100 cores 18 CPU vs. GPU, once more Haswell Server Nvidia Pascal P 100 # cores 18](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-11.jpg)
CPU vs. GPU, once more Haswell Server Nvidia Pascal P 100 # cores 18 3840 Die area 660 mm 2 (22 nm) 610 mm 2 (16 nm) Frequency 2. 3 GHz normal 1. 3 GHz Max DRAM BW 100 GB/s 720 GB/s LLC size 2. 5 MB/core (L 3) 4 MB L 2/chip LLC-1 size 256 K/core(L 2), 64 B/c 64 KB/SM (64 cores) Registers/core 180 per 2 threads 1000 Power 165 watts 300 watts Company market cap $160 B $90 B • How can GPU get away with so little cache? – programmer is responsible for highly optimizing algorithms • But won’t the software be hard to write? – Yes EE 193 Joel Grodstein 11
![CPU vs GPU once more Haswell Server Nvidia Pascal P 100 cores 18 CPU vs. GPU, once more Haswell Server Nvidia Pascal P 100 # cores 18](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-12.jpg)
CPU vs. GPU, once more Haswell Server Nvidia Pascal P 100 # cores 18 3840 Die area 660 mm 2 (22 nm) 610 mm 2 (16 nm) Frequency 2. 3 GHz normal 1. 3 GHz Max DRAM BW 100 GB/s 720 GB/s LLC size 2. 5 MB/core (L 3) 4 MB L 2/chip LLC-1 size 256 K/core(L 2), 64 B/c 64 KB/SM (64 cores) Registers/core 180 per 2 threads 1000 Power 165 watts 300 watts Company market cap $160 B $90 B • Why does a GPU have so many registers? – Because it depends on SMT to get multi-thread performance EE 193 Joel Grodstein 12
![Power Implications of Parallelism Consider doubling number of cores same power budget Power Implications of Parallelism • Consider doubling number of cores, same power budget –](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-13.jpg)
Power Implications of Parallelism • Consider doubling number of cores, same power budget – By reducing clock frequency and voltage. . . but how much? • First, a few equations (approximate, first order) – Frequency ~ Voltage (higher voltage -> transistors switch faster) – Dynamic Power ~ Transistors * Frequency * Voltage 2 – Thus, Dynamic Power ~ Transistors * Frequency 3 • How? – Doubling number of cores (transistors) will double the power – Reducing frequency & voltage by 20% will cut power in half (0. 83 is 0. 5). . . – 2 x # of cores, each 20% slower → 1. 6 x performance increase • Parallelism is great – If we can write software for it! Copyright © 2010, Elsevier Inc. All rights Reserved 13
![Approaches to the serial problem Rewrite serial programs so that theyre parallel Approaches to the serial problem • Rewrite serial programs so that they’re parallel –](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-14.jpg)
Approaches to the serial problem • Rewrite serial programs so that they’re parallel – This is usually slow & error prone • Write translation programs that automatically convert serial programs into parallel programs. – This is very difficult to do. – Success has been limited. • No magic bullet has been found yet Copyright © 2010, Elsevier Inc. All rights Reserved 14
![What kind of multicore We will build multicore CPUs Everyone has conceded What kind of multicore? • We will build multicore CPUs – Everyone has conceded](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-15.jpg)
What kind of multicore? • We will build multicore CPUs – Everyone has conceded this – Even the Iphone X is a six-core chip. • But what kind of cores? – "Big" cores, with lots of cache, OOO, BP – Less power efficient, but gives you good single-thread performance – Intel Xeon has followed this route • Small cores: much less cache, OOO, BP – More power efficient; more cores can fit. So multithread performance is better – But single-thread performance is unacceptable, and programming is challenging – GPUs have taken this route, to the extreme. The one feature they've kept is SMT EE 193 Joel Grodstein 15
![What to remember Caches Write your code to have temporal and What to remember • Caches: – – Write your code to have temporal and](https://slidetodoc.com/presentation_image_h2/de1f7905ffde636df284d9f8510015bf/image-16.jpg)
What to remember • Caches: – – Write your code to have temporal and spatial locality Easier said than done sometimes! But extremely important And try to make it fit in cache (or break it into sub-problems) • Branch prediction: – big, power-hungry and usually still necessary for single-stream perf – data-dependent branches are slow • OOO: – big, power-hungry and usually still necessary for single-stream perf • SMT – Can reduce stalls without needing as much BP and OOO – Can only go so far before the regfile explodes EE 193 Joel Grodstein 16
Tufts blast
Conservation of mechanical energy examples
Nkb 193
Cs 193
Uiuc applied aerodynamics group
193 brand
Antolini catcher in the rye
Mei=mef
A 193 kg curtain needs to be raised
Art 193 clt
Cms-r-193
Item 193
Types of inverse
2017 dvhimss annual fall conference
Conventional computing and intelligent computing
Tufts hpc
Tufts anesthesiology residency