Where did it all start ENIAC picture from

  • Slides: 33
Download presentation

Where did it all start? ENIAC [picture from Wikipedia] n n n At Penn

Where did it all start? ENIAC [picture from Wikipedia] n n n At Penn Lt Gillon, Eckert and Mauchley Cost $486, 804. 22, in 1946 (~$6. 3 M today) 5000 ops/second 19 K vacuum tubes Power = 200 K Watts 67 m 3, 27 tons(!) 2

Where are we ~today? Intel’s Haswell-EX 5. 6 billion transistors n 12 cores n

Where are we ~today? Intel’s Haswell-EX 5. 6 billion transistors n 12 cores n 45 MB L 3 cache n 2. 5 GHz n Roughly 165 W Newer i 9: n 8 cores, 16 threads n 3. 6 ->5 GHz clock n TDP 95 W n 16 MB Cache n 3 core core core 662 mm 2

2 UDC ERC Vision E The chips got bigger but not faster!!!! Transistors 100000

2 UDC ERC Vision E The chips got bigger but not faster!!!! Transistors 100000 (100, 000's) 10000 Performance (GOPS) 1000 10 ~15%/year 1 0. 01 > 50%/year 0. 001 1985 4 1990 1995 2000 2005 2010 2015 2020

2 UDCkeep Why? EMust power at ~100 W! ERC Vision Transistors (100, 000's) 100000

2 UDCkeep Why? EMust power at ~100 W! ERC Vision Transistors (100, 000's) 100000 10000 Power (W) Performance (GOPS) Efficiency (GOPS/W) 1000 10 1 0. 01 0. 001 1985 1990 Era of Uniprocesors 5 1995 2000 c. 2005 2010 2015 Era of Multiprocessors 2020

What is Power? For fixed Capacitance: Power V = Operating voltage F = Clock

What is Power? For fixed Capacitance: Power V = Operating voltage F = Clock frequency cpu V 2 F 6 CPU type Power Constraint Mobile < a few W Battery usage Laptop < 10 s of W Battery + heat Desktop/Server < 100 W Cooling Supercomputer < 10 s of W Cooling + electricity

What happened to Power? n Voltages used to go down: n n n 7

What happened to Power? n Voltages used to go down: n n n 7 From 5 v (1970’s) to 1 v (2000’s) Power V 2 F But, voltage is squared! Great! Power went down: 25 x reduction! Gave us enough room to increase clock frequency

Voltages are not going down! CPU Supply Voltage 1. 2 1 Today 0. 8

Voltages are not going down! CPU Supply Voltage 1. 2 1 Today 0. 8 Projection s 2001 0. 6 0. 4 0. 2 Slope =. 014 Slope =. 053 0 2001 2006 2011 2016 2021 2026 8 2013 [source: ITRS

New Moore’s Law: more cores Transistors (x 1000) Example: Intel family of processors 286

New Moore’s Law: more cores Transistors (x 1000) Example: Intel family of processors 286 Intel 4004 8086 Quad Core Dual Core Pentium M Pentium 4 Pentium Pro 486 386 cpu 2 x transistors 2 x cores 9 cpu cpu

Conventional Server CPU (e. g. , Xeon) And with more cores, come simpler cores…

Conventional Server CPU (e. g. , Xeon) And with more cores, come simpler cores… n n n 10 Prius instead of Audi Each core fewer joules/op Need parallel software! Modern Multicore/ or GPU (e. g. , Tilera) With fixed voltages & clocks: n. Parallelism/concurrency is the only solution to more transistors n. Processor @ 100 W n. Lots of tiny “cores” (i. e. , CPUs) Manycore CPU

Terms of interest n n n 11 Multicores & Multiprocessors Cache Coherence Memory Ordering

Terms of interest n n n 11 Multicores & Multiprocessors Cache Coherence Memory Ordering Synchronization Interconnects

Hardware Jargon n n Processor & core used interchangeably Cores run threads Each chip

Hardware Jargon n n Processor & core used interchangeably Cores run threads Each chip has multiple cores Cores may be heterogeneous n n n Each board can have multiple chips/sockets Each platform can have multiple boards n n 12 CPU cores vs. GPU cores vs. accelerators Mobile platforms usually a single chip/single board Datacenters 10’s to 100’s of thousands of boards

Issues: Where we are going? n n n 13 How do we connect the

Issues: Where we are going? n n n 13 How do we connect the cores together? How do we make sure they see one copy of memory (or something we can understand & program)? How do we implement synchronization? How do we build cores that run multiple operations with a single instruction? How do we build cores that run multiple threads? What is inside a GPU?

Review: Sources of Cache Misses n Computer Organization 3 C’s: Compulsory, conflict, capacity misses

Review: Sources of Cache Misses n Computer Organization 3 C’s: Compulsory, conflict, capacity misses n 4 type of miss: n th n Coherence: reads/writes from multiple processors => 4 C’s: Compulsory, conflict, capacity, coherence misses 14

Why Cache Coherence? n n n Initially x = 0 A and B both

Why Cache Coherence? n n n Initially x = 0 A and B both read and cache address x (with value = 0) A writes to x, new value is 42 n 15 Updates its cache, following reads by A return 42 n B re-reads x, should read 0 or 42? n How does B find out there was a change?

Processor Issues Load Request load x cache Bus memory 16 data

Processor Issues Load Request load x cache Bus memory 16 data

Memory Responds cache Bus Got it! memory 17 data

Memory Responds cache Bus Got it! memory 17 data

Processor Issues Load Request load x data cache Bus memory 18 data

Processor Issues Load Request load x data cache Bus memory 18 data

Other Processor Responds Got it data cache Bus memory 19 data

Other Processor Responds Got it data cache Bus memory 19 data

Modify Cached Data store x Oops! data cache Bus memory 20 data

Modify Cached Data store x Oops! data cache Bus memory 20 data

Now Store Invariant: Only one changed value in the system! data cache data store

Now Store Invariant: Only one changed value in the system! data cache data store x cache Bus memory 21 data

Memory Ordering u Programs are written assuming l l Instructions are executed in program

Memory Ordering u Programs are written assuming l l Instructions are executed in program order Instruction execution is atomic load x store y load y cache Bus u 22 What are possible interleavings of the loads/stores?

Example: Memory Ordering In the previous example: n Assume x and y have the

Example: Memory Ordering In the previous example: n Assume x and y have the value 0 to begin with n Both stores write the value 1 into x and y n What are all possible state values read in various interleavings? 23

WHAT IF I TOLD YOU 24 That instructions are NOT executed in program order

WHAT IF I TOLD YOU 24 That instructions are NOT executed in program order OR atomically?

Memory Ordering n Loads/stores are not atomic n n n Instructions are not executed

Memory Ordering n Loads/stores are not atomic n n n Instructions are not executed in program order n n n Cores execute out of order Compiler moves instructions around Simple model is “Sequential Consistency” n n 25 Stores are written in buffers and not visible by all Caches allow multiple outstanding misses we need to define more realistic models …and to implement them

Synchronization n A spectrum of synchronization primitives is defined n n 26 Lock, compare

Synchronization n A spectrum of synchronization primitives is defined n n 26 Lock, compare and swap, etc Barrier Messages Higher level constructs: Transactional Memory (in Haswell, etc)

Interconnects n Bus n n Network n n n 27 Broadcast medium Connects a

Interconnects n Bus n n Network n n n 27 Broadcast medium Connects a few cores/chips Few wires => low B/W Packet-switched medium Non-atomic transactions Scalable (lots of connectivity) Connecting more cores/chips Many wires => high B/W, but multiple steps => higher latency memory

Modern mobile chip: Heterogeneous n n n Specialized processors everywhere Maximize ops/sec Minimize joules/op

Modern mobile chip: Heterogeneous n n n Specialized processors everywhere Maximize ops/sec Minimize joules/op Exploit parallelism Example cores: n n 28 Superscalar Graphics Media Security n. Vidia Tegra 2

Modern tablet chip: CPU + GPU n A few CPUs n n One big

Modern tablet chip: CPU + GPU n A few CPUs n n One big Vector accelerator n n n Multimedia Graphics Games SIMD computation Coherent shared memory n 29 General-purpose computation CPUs + Accelerator can communicate on-chip AMD Fusion

n core L 2 L 2 No. C L 2 L 2 core In

n core L 2 L 2 No. C L 2 L 2 core In core L 2 L 2 No. C L 2 L 2 core DRAM 30 core ct ter co nn e DRAM n core DRAM n Multiple CPUs/chip Multiple levels of memory Multiple levels of network DRAM Modern servers: NUMA (non-uniform access)

NUMA systems n Non-uniform latency to memory n n n First-level cache ~ 2

NUMA systems n Non-uniform latency to memory n n n First-level cache ~ 2 cycles ( => 4 !) Second-level cache ~ 10 cycles Third-level cache ~ 50 cycles Off-chip own DRAM ~ 150 cycles Off-chip others’ DRAM ~ 300 cycles Contention everywhere n n n On-chip network Off-chip network Cache ports, memory channels and cache/memory controllers Data placement is everything! 31

Example: AMD Quad Opteron n 32 All coherence and multiprocessing glue in processor chip

Example: AMD Quad Opteron n 32 All coherence and multiprocessing glue in processor chip & module Highly integrated, targeted at high volume Low latency, moderate bandwidth

Summary n Multiprocessor architecture n n n n 33 Cache coherence Memory ordering Interconnects

Summary n Multiprocessor architecture n n n n 33 Cache coherence Memory ordering Interconnects Multithreading/Vector/GPU Lots of fundamental tradeoffs to consider in design Lots and lots of critical techniques to handle multiprocessor problems Lots of interesting examples of good/bad design choices