Exascale Computing July 1 2016 Sung Bae Park
Exascale Computing July 1, 2016 Sung Bae Park ICR Revision/2016. 6. 29/2016. 6. 30/2016. 7. 1
Outline l Trend l Direction l Future
Trend
IT Waves VR l Contents & Service: Rich, High Quality Ubiquitous Web & Real Life VR l Network & Device: Variety of Topology & Network Infra 5 G Multimedia Software Network Device 3 D Computing & Memory 3 D VR VR Wide MV Stereoscopic Super MV Multi-View Note. Pad Hologram Web 3. 0 Standalone Web 1. 0 Web 2. 0 Web 4. 0 Computing & Memory Io. T Mobile Web Smart Real World Web. Smart Ubiquitous Web TV PC Big Data 1 G 2 G 3 GPhone 4 G 5 G HSPA MIMO AI Post. AMPS GSM UPN* OFDM (Analog) CDMA Computing & Memory 3 D Printer Single Core Multi Core HPP* CPU Multi GPGPU* Massive Core ADV CPU DSP Color FHD 2 D * UPN: ULP Personal Networking 1 D * GPGPU: General Purpose GPU 1980 1990 2000 2010 * HPP: Hybrid Parallel Processor 2017
Exascale Computing l More than Moore(100 x in 10 Years, 2 x/18 M): 1000 x for 10 Years, 2 x/12 M ※ 1 GFLOPS(’ 00 CPU) 1 TFLOPS(’ 10 GPU) 1 PFLOPS(‘ 20) 1 EFLOPS(‘ 30) l Crisis in Power, Efficiency and SW China Sunway Taihu. Light Processor Architecture Company Year Speed # of Cores/Chip FLOPS SW 26010 CPU Shanghai HPICDC 2016 1. 45 GHz 260 2. 6 T Xeon E 5 -2600 CPU Intel 2015 3. 7 GHz 18 1 T P 100 GPU Nvidia 2015 1. 5 GHz 3584 10 T
Direction
Crisis in Speed & Power 0. 4 V l 180 nm 1 GHz Samsung/DEC EV 6 CPU in 1999, 14 nm 4 GHz in 2016 Yet!!! l No power efficiency improvements with Si scaling due to unscalable VDD Data Center Power & Cost
PPA Crisis is Computing Efficiency Memory l Massive Cores on a Chip ※ Si Scaling Enables Integration Unit Changes: TR Gate/Cell Array Core Array l More Cores on a Chip, Less Efficiency of Computing Specific Applications Special Purpose Processor - DSP: Digital Signal Processor - GPU: Graphics Processor - NPU: Network Processor - NMP: Neuromorphic Processor Dedicated Hardware Cost Domain Specific ISA Massive Cores on a Chip - Massive CPU/GPU/DSP/HWs - Run-Time-Reconfigurable Massive Cores Reconfigurable Processor Many Cores Programmable Hardware - FPGA - Reconfigurable Systems Array Processor Homogeneous Chip-Multiprocessor Compiler General Applications General Purpose Processor Cost - PC/Server CPU - Mobile CPU General Purpose Controller - MCU
Crisis in SW API l Direct API for Minimum Memory Transfer l EPIC begging the CPU-like GPU for Differentiation and Productivity Programmable GPU: Market Begging 1 Core CPU 500 Core GPU 500 Core FPPA Parallel 200 5. 4 <6 Sequential 200 750 < 220 Easier HW beats Faster HW!
Future
Key Enabler for Next Wave l x 86 CPU: Highly Programmable & High Performance, but Power & Price l ARM+HW So. C: Low Power & Price, Medium Performance, but Programmability l Massive Cores: Highly Programmable & High Performance, Low Power & Price Smart So. C $250 B HW So. C $100 B PC CPU $50 B Inflection Point 1980 Drivers Obstacles 1990 Creative Consumer On Massive Cores Nokia Phone on ARM CPU IBM PC on Intel x 86 CPU 2000 • x 86 Binary Compatible Mass Infra for IHV/ISV 2010 2017 • High Performance 3 -4 GHz 6 -24 Core • Data: Low Power Low Price Dedicated HW IPs • CPU: Low Power, Low Price ARM Mass Infra for IHV/ISV • Extreme P 4 [Price, Power, Performance & Programmability] enabled by Massive Core based RTR* FPPA • Power ~100 W • Price ~$100 • Memory Bottleneck • HW IP: No Programmability • CPU/DSP: x 10 Power, Price & Memory Bottleneck than HW IPs • *Run-Time-Reconfigurable On-Chip Dynamic Compiler On-Chip Kernel • System SW as GCD, Map. Reduce
Smart So. C based on RTR FPPA l Run-Time-Reconfigurable Heterogeneous Field Programmable Processor Array l Run-Time-Reconfigurable 2 D 3 D Vector Accessible Memories Fine-Thread CPU • x 86 / ARM • On-Chip Dynamic Compiler • On-Chip Kernel Mid-Thread DSP • SIMD / Vector Massive-Thread GPU • SIMT X-Y Stack Register File • Multi-GHz MB Wide IO Reconfigurable Buses • Low Swing Wide I/O Reconfigurable Memories • Multi-GHz GB Wide I/O FPGA for Special IP & IO • HDMI, Serdes, . . Design Methodology • Structured Custom to So. C • PM: PG/CG w/ DVFS Tool Chains • Integrated Compiler • System Simulator Seamless Platform • Open OS to Std. Drivers • Open. CL, MPI, GCD • Total Solutions Device • 0. 4 V 3 m. A 1 p. A @14 nm Analog IPs • Low Swing Bus Drv/Rec • High-Q PLLs Package • 3 D Integration
100 m. W 1 TFLOPS for Exascale Computing in 2020 l More than Moore: Challenge to HW ASIC Level Massive Cores Exa Flop @230 KW 1/10, 000 Power Revolution in 10 -years FLOPS 1/100 from Scaling 100 T Additional 1/100 from Innovation in Exa Flop @70 MW m. W HW like Run-Time-Reconfigurable Computing S/100 1/30 Power Efficiency in 5 -years P LO Si Technology for 0. 1 V ELV Device & Circuits F 1/3 from Computing, 1/10 from Scaling 1 T Multi-GHz Multi-GB Reconfigurable Memory with 2 D 3 D Vectored Access 10 T Exa-byte/sec 3 D Integration 0. 1 0 S/1 OP W 0 m L TF 1 T Peta Flop @2. 3 MW W 0 m /10 PS 10 GP GPU O FL G 100 G W 0 m HW ASIC 0 S/1 OP PC CPU FL 1 G 10 G Mobile CPU 0 S/1 W 0 m OP 0. 1 L GF 1 G 2020 W 0 m 0 S/1 OP Moore’s Law x 2 / 18 -months (x 10 / 5 -years) FL G. 01 0 0. 1 1 10 100 2015 2010 2005 1000 Watts
Acknowledgements The author would like to thank Dan Dobberpuhl (Founder of Si. Byte, PASemi), David Ditzel(Founder of Transmeta), Jim Keller(DEC EV 6 Chief Architect), Anantha Chandrakasan (MIT), Dimitri Antoniadis (MIT), Li-Shiuan Peh (MIT), Shekhar Borkar(Intel, Fellow), Le Nguyen(Founder of AIT), Peter Song(Founder of Montalvo Systems) and Derek Lentz (GPU Architect) for their valuable comments and advices to enable this presentation.
Appendix
Movie Quality Virtual Reality Procedural Primitives Traced Deep Shadow Physically Plausible Shader Organized Point Clouds Fully SW Pipeline Procedural PQ Illumination PQ Shader CPU-like Programmable Random Dynamic Conditional RI Evaluation Facevarying Class Specifier Ambient Occlusion True Ri. Sphere Primitives Blobby Implicit Surfaces Open. GL/D 3 D API Polygon Rasterization Modeling GPU + HW GPGPU-like Fixed Programmable Regular Random Static Dynamic
Live Computer Vision
Wave Kanizsa Triangle
0. 4 V 3. 5 m. A/um MOSFET: Diffusion to Ballistic Intel 14 nm Fin. FET, 2014 IEDM Beiking Univ. 9 nm DG FET, 2015 IEEE EDSSC
0. 4 V 134 W 14 nm 42 GHz A-CPU
0. 2 V 1 m. A/um MOSFET: Ballistic to Tunneling Beiking Univ. 9 nm DG FET, 2015 IEEE EDSSC Chenmming Hu 40 nm, 2008 VLSI-TSA
FQHE: Zero Resistance Zero Power l Fractional Quantum Hall Effect @ Certain Magnetic Field l “Sharp Resonance” as Impedance Matching and/or Superconductor In 1980, Klaus von Klitzing [103] found that at temperatures of only a few Kelvin and high magnetic ¯eld (3 -10 Tesla), the Hall resistance did not vary linearly with the Field. Instead, he found that it varied in a stepwise fashion. It was also found that where the Hall resistance was °at, the longitudinal resistance disappeared. This dissipation free transport looked very similar to superconductivity. The Field at which the plateaus appeared, or where the longitudinal resistance vanished, quite surprisingly, was independent of the material, temperature, or other variables of the experiment, but only depended on a combination of fundamental constants -¹h=e 2. The quantization of resistivity seen in these early experiments came as a grand surprise and would lead to a new international standard of resistivity, the Klitzing, de¯ned as the Hall resistance measured at the fourth step. By 1982, semiconductor technology had greatly advanced and it became possible to produce interfaces of much higher quality than where available only a few years before. That same year, Horst Stormer and Dan Tsui [105] repeated Klitzing's earlier experiments with much cleaner samples and higher magnetic ¯elds. What they found was the same stepwise behavior as seen previously, but to everyone's surprise, steps also appeared at fractional ¯lling factors º = 1=3; 1=5; 2=5 : : : Strongly correlated systems are notoriously di±cult to understand, but in 1983, Robert Laughlin [106] proposed his now celebrated ansatz for a variational wavefunction which contained no free parameters: [Cooper Pairs to Molecules: J. N. Milstein]
PPA Crisis: Learn from Dedicated HW IP “Years of research in low-power embedded computing have shown only one design technique to reduce power: reduce waste. ” - Mark Horowitz, Stanford University & Rambus Inc. CPU GPU m. CPU DSP Power (W) 60 80 0. 6 0. 24 0. 12 0. 015 Performance, # of H. 264 1 2 0. 1 1 Area, mm 2 200 400 10 3 2 0. 5 PPA 8. 3 E-5 6. 3 E-5 1. 6 E-3 1. 4 4. 2 133. 3 1 0. 75 19 1. 7 E 4 5. 0 E 4 1. 6 E 6 Reduce Power: Reduce Wasted Si Area Wasted Computation Wasted Bandwidth Wasted Voltage Wasted Design Resources HW IP II
Make HW IP Programmable: Reconfigurable Computing l Reprogrammable FSM with Microcode + Domain Specific HW FU with ISA l Extreme RISC in Horizontal Control, and Extreme CISC in Vertical Data Radio ISA RC-FU - 4 G/5 G Modem - Channel Media ISA RC-FU - AV/Image - 3 D/Ray-Tracing/VR Intellig. ISA RC-FU - Recognition - Mining - Synthesis
Reconfigurable Memory l HW IP’s Outstanding PPA come from Implicit, Distributed, Stacked Queue Memory l Reconfigurable Memory for HW IP level PPA
H. 264 Luma Inter Prediction Algorithm Worst Case 16 x 16 mode (i position) Position i Vertical Filtering (6 tab Filter, 21 x 16) y = x 0– 5*x 1+20*x 2+20*x 3– 5*x 4+x 5 21 x 16 Pixel (16 bit) 21 x 21 Pixel (8 bit) Horizontal Filtering (6 tab Filter + Scaling, 17 x 16) z = ((y 0– 5*y 1+20*y 2+20*y 3– 5*y 4+y 5 )+512)>>10 21 x 16 Pixel (16 bit) 17 x 16 Pixel (8 bit) ¼ pel (16 x 16) r = (z 0+z 1+1)>>1 17 x 16 Pixel (8 bit) 16 x 16 Pixel (8 bit)
2 D Reconfigurable Memory l No Need to Calculate Address: Implicit/Local/Distributed 32 x 64 -Bit RF l X-Y Bi-Directional Random Access for Extreme Spatial Locality in Bit-Pixel Stream Applications X H. 264 FHD decoding – Lama Interpretation Vertical filtering Horizontal filtering X-Y Stack Y ¼ peel Total cycles = ~18 cycles Total cycles = 170 Load 64 b add Load 64 b add CAT_WIN Data Shuffling Data load & address generation Loop 예제: II=2, loop count=16 LOAD from stack; Load from Stack Round Sat Fir. Filter 64 b Round Sat Avg 64 b Store 64 b add Data Store & address generation Data Computation NONE Data STORE to stack; ; Computation Data Computation Store to Stack
3 D Reconfigurable Memory l Computer Vision, Virtual Reality, AI need all Depth Processing Pixel to Voxel Processing
Run-Time-Reconfigurable Computing PPA l On-Chip Run-Time Compiler & Kernel l One of the Biggest Overhead has been Reconfiguration Memory PPA l 3 D XPoint can accelerate RTR FPPA (Run-Time-Reconfigurable Field Programmable Processor Array) New York Univ. , 2011 IEEE CVPR
GPU Case l 5120 Core x 1 GHz X 2 FLOP/Core = 10 TFLOPS
Hide the Memory Latency l Massive Thread in GPU hide Long Latency but MUCH Limited Way! l Good for Massive Thread Applications but Very MUCH Damaging for CSP and/or Random Control Flow and/or Less Massive Parallel Thread Applications
Crisis in Memory Latency l CPU-like work: Small Thread - Detail, Sophisticated Rendering - Ray Tracing - Procedural Graphics ※ Random-Dynamic-Irregular Control Flow & Data Access ※ Severe Slow Down in PC/Mobile GPUs l Mobile GPU-like work: Med. Thread - Some Repetitive, Some Detail - Mid. Size of Regualr Pattern ※ Half Fixed-Static-Regular Half Random-Dynamic-Irregular Control Flow & Data Access ※ Optimized for Mobile GPUs l PC GPU-like work: Massive Thread - Repetitive, Simple Rendering - Large Chunk of Regualr Pattern - Rasterized Modeling ※ Fixed-Static-Regular Control Flow & Data Access ※ Outstanding Speed Up in PC GPUs
Random Thread Variations
On-Chip DRAM Memory l GHz Random Access On-Chip DRAM, but 5 x Larger Si Area than commercial Samsung 40 nm 2 Gb, 2011 ISSCC: 25 Mb/mm 2 IBM 45 nm SOI e. DRAM, 2010 CICC: 5 Mb/mm 2
Direct API: 10 1 File Copy for 10 x PPA SW Make Minimize the Memory Access by Freeway to Compute Algorithm Applications UI Many APIs OS Drivers Cores along with On-Chip API MMU to Minimize File Copy and Transfer (‘ 13. 9 AMD Mantle GPU)
- Slides: 35