Where did it all start ENIAC picture from

Where are we ~today? Intel’s Haswell-EX 5. 6 billion transistors n 12 cores n

2 UDC ERC Vision E The chips got bigger but not faster!!!! Transistors 100000

2 UDCkeep Why? EMust power at ~100 W! ERC Vision Transistors (100, 000's) 100000

What is Power? For fixed Capacitance: Power V = Operating voltage F = Clock

What happened to Power? n Voltages used to go down: n n n 7

Voltages are not going down! CPU Supply Voltage 1. 2 1 Today 0. 8

New Moore’s Law: more cores Transistors (x 1000) Example: Intel family of processors 286

Conventional Server CPU (e. g. , Xeon) And with more cores, come simpler cores…

Terms of interest n n n 11 Multicores & Multiprocessors Cache Coherence Memory Ordering

Hardware Jargon n n Processor & core used interchangeably Cores run threads Each chip

Issues: Where we are going? n n n 13 How do we connect the

Review: Sources of Cache Misses n Computer Organization 3 C’s: Compulsory, conflict, capacity misses

Why Cache Coherence? n n n Initially x = 0 A and B both

Processor Issues Load Request load x cache Bus memory 16 data

Memory Responds cache Bus Got it! memory 17 data

Processor Issues Load Request load x data cache Bus memory 18 data

Other Processor Responds Got it data cache Bus memory 19 data

Modify Cached Data store x Oops! data cache Bus memory 20 data

Now Store Invariant: Only one changed value in the system! data cache data store

Memory Ordering u Programs are written assuming l l Instructions are executed in program

Example: Memory Ordering In the previous example: n Assume x and y have the

WHAT IF I TOLD YOU 24 That instructions are NOT executed in program order

Memory Ordering n Loads/stores are not atomic n n n Instructions are not executed

Synchronization n A spectrum of synchronization primitives is defined n n 26 Lock, compare

Interconnects n Bus n n Network n n n 27 Broadcast medium Connects a

Modern mobile chip: Heterogeneous n n n Specialized processors everywhere Maximize ops/sec Minimize joules/op

Modern tablet chip: CPU + GPU n A few CPUs n n One big

NUMA systems n Non-uniform latency to memory n n n First-level cache ~ 2

Example: AMD Quad Opteron n 32 All coherence and multiprocessing glue in processor chip

Summary n Multiprocessor architecture n n n n 33 Cache coherence Memory ordering Interconnects

Slides: 33

Download presentation

Where did it all start? ENIAC [picture from Wikipedia] n n n At Penn Lt Gillon, Eckert and Mauchley Cost $486, 804. 22, in 1946 (~$6. 3 M today) 5000 ops/second 19 K vacuum tubes Power = 200 K Watts 67 m 3, 27 tons(!) 2

Where are we ~today? Intel’s Haswell-EX 5. 6 billion transistors n 12 cores n 45 MB L 3 cache n 2. 5 GHz n Roughly 165 W Newer i 9: n 8 cores, 16 threads n 3. 6 ->5 GHz clock n TDP 95 W n 16 MB Cache n 3 core core core 662 mm 2

2 UDC ERC Vision E The chips got bigger but not faster!!!! Transistors 100000 (100, 000's) 10000 Performance (GOPS) 1000 10 ~15%/year 1 0. 01 > 50%/year 0. 001 1985 4 1990 1995 2000 2005 2010 2015 2020

2 UDCkeep Why? EMust power at ~100 W! ERC Vision Transistors (100, 000's) 100000 10000 Power (W) Performance (GOPS) Efficiency (GOPS/W) 1000 10 1 0. 01 0. 001 1985 1990 Era of Uniprocesors 5 1995 2000 c. 2005 2010 2015 Era of Multiprocessors 2020

What is Power? For fixed Capacitance: Power V = Operating voltage F = Clock frequency cpu V 2 F 6 CPU type Power Constraint Mobile < a few W Battery usage Laptop < 10 s of W Battery + heat Desktop/Server < 100 W Cooling Supercomputer < 10 s of W Cooling + electricity

What happened to Power? n Voltages used to go down: n n n 7 From 5 v (1970’s) to 1 v (2000’s) Power V 2 F But, voltage is squared! Great! Power went down: 25 x reduction! Gave us enough room to increase clock frequency

Voltages are not going down! CPU Supply Voltage 1. 2 1 Today 0. 8 Projection s 2001 0. 6 0. 4 0. 2 Slope =. 014 Slope =. 053 0 2001 2006 2011 2016 2021 2026 8 2013 [source: ITRS

New Moore’s Law: more cores Transistors (x 1000) Example: Intel family of processors 286 Intel 4004 8086 Quad Core Dual Core Pentium M Pentium 4 Pentium Pro 486 386 cpu 2 x transistors 2 x cores 9 cpu cpu

Conventional Server CPU (e. g. , Xeon) And with more cores, come simpler cores… n n n 10 Prius instead of Audi Each core fewer joules/op Need parallel software! Modern Multicore/ or GPU (e. g. , Tilera) With fixed voltages & clocks: n. Parallelism/concurrency is the only solution to more transistors n. Processor @ 100 W n. Lots of tiny “cores” (i. e. , CPUs) Manycore CPU

Terms of interest n n n 11 Multicores & Multiprocessors Cache Coherence Memory Ordering Synchronization Interconnects

Hardware Jargon n n Processor & core used interchangeably Cores run threads Each chip has multiple cores Cores may be heterogeneous n n n Each board can have multiple chips/sockets Each platform can have multiple boards n n 12 CPU cores vs. GPU cores vs. accelerators Mobile platforms usually a single chip/single board Datacenters 10’s to 100’s of thousands of boards

Issues: Where we are going? n n n 13 How do we connect the cores together? How do we make sure they see one copy of memory (or something we can understand & program)? How do we implement synchronization? How do we build cores that run multiple operations with a single instruction? How do we build cores that run multiple threads? What is inside a GPU?

Review: Sources of Cache Misses n Computer Organization 3 C’s: Compulsory, conflict, capacity misses n 4 type of miss: n th n Coherence: reads/writes from multiple processors => 4 C’s: Compulsory, conflict, capacity, coherence misses 14

Why Cache Coherence? n n n Initially x = 0 A and B both read and cache address x (with value = 0) A writes to x, new value is 42 n 15 Updates its cache, following reads by A return 42 n B re-reads x, should read 0 or 42? n How does B find out there was a change?

Processor Issues Load Request load x cache Bus memory 16 data

Memory Responds cache Bus Got it! memory 17 data

Processor Issues Load Request load x data cache Bus memory 18 data

Other Processor Responds Got it data cache Bus memory 19 data

Modify Cached Data store x Oops! data cache Bus memory 20 data

Now Store Invariant: Only one changed value in the system! data cache data store x cache Bus memory 21 data

Memory Ordering u Programs are written assuming l l Instructions are executed in program order Instruction execution is atomic load x store y load y cache Bus u 22 What are possible interleavings of the loads/stores?

Example: Memory Ordering In the previous example: n Assume x and y have the value 0 to begin with n Both stores write the value 1 into x and y n What are all possible state values read in various interleavings? 23

WHAT IF I TOLD YOU 24 That instructions are NOT executed in program order OR atomically?

Memory Ordering n Loads/stores are not atomic n n n Instructions are not executed in program order n n n Cores execute out of order Compiler moves instructions around Simple model is “Sequential Consistency” n n 25 Stores are written in buffers and not visible by all Caches allow multiple outstanding misses we need to define more realistic models …and to implement them

Synchronization n A spectrum of synchronization primitives is defined n n 26 Lock, compare and swap, etc Barrier Messages Higher level constructs: Transactional Memory (in Haswell, etc)

Interconnects n Bus n n Network n n n 27 Broadcast medium Connects a few cores/chips Few wires => low B/W Packet-switched medium Non-atomic transactions Scalable (lots of connectivity) Connecting more cores/chips Many wires => high B/W, but multiple steps => higher latency memory

Modern mobile chip: Heterogeneous n n n Specialized processors everywhere Maximize ops/sec Minimize joules/op Exploit parallelism Example cores: n n 28 Superscalar Graphics Media Security n. Vidia Tegra 2

Modern tablet chip: CPU + GPU n A few CPUs n n One big Vector accelerator n n n Multimedia Graphics Games SIMD computation Coherent shared memory n 29 General-purpose computation CPUs + Accelerator can communicate on-chip AMD Fusion

n core L 2 L 2 No. C L 2 L 2 core In core L 2 L 2 No. C L 2 L 2 core DRAM 30 core ct ter co nn e DRAM n core DRAM n Multiple CPUs/chip Multiple levels of memory Multiple levels of network DRAM Modern servers: NUMA (non-uniform access)

NUMA systems n Non-uniform latency to memory n n n First-level cache ~ 2 cycles ( => 4 !) Second-level cache ~ 10 cycles Third-level cache ~ 50 cycles Off-chip own DRAM ~ 150 cycles Off-chip others’ DRAM ~ 300 cycles Contention everywhere n n n On-chip network Off-chip network Cache ports, memory channels and cache/memory controllers Data placement is everything! 31

Example: AMD Quad Opteron n 32 All coherence and multiprocessing glue in processor chip & module Highly integrated, targeted at high volume Low latency, moderate bandwidth

Summary n Multiprocessor architecture n n n n 33 Cache coherence Memory ordering Interconnects Multithreading/Vector/GPU Lots of fundamental tradeoffs to consider in design Lots and lots of critical techniques to handle multiprocessor problems Lots of interesting examples of good/bad design choices