LowPower Scientific Computing NDCA 2009 Ganesh Dasika Ankit

  • Slides: 20
Download presentation
Low-Power Scientific Computing NDCA 2009 Ganesh Dasika, Ankit Sethia, Trevor Mudge, Scott Mahlke University

Low-Power Scientific Computing NDCA 2009 Ganesh Dasika, Ankit Sethia, Trevor Mudge, Scott Mahlke University of Michigan Advanced Computer Architecture Laboratory University of Michigan Electrical Engineering and Computer Science

The Advent of the GPGPU • Growing popularity for scientific computing – – –

The Advent of the GPGPU • Growing popularity for scientific computing – – – Medical Imaging Astrophysics Weather Prediction EDA Financial instrument pricing • Commodity item • Increasingly programmable – Fermi – ARM/Mali ? – “Larrabee” ? 2 University of Michigan Electrical Engineering and Computer Science

Disadvantages of GPGPUs • Gap between computation and bandwidth – 933 GFLOPS : 142

Disadvantages of GPGPUs • Gap between computation and bandwidth – 933 GFLOPS : 142 GB/s bandwidth (0. 15 B of data per FLOP, ~26: 1 Compute: Mem Ratio) • Very high power consumption – Graphics-specific hardware – Several thread contexts Inefficiencies in all – Large register files and memories general-purpose – Fully general datapath architectures 3 University of Michigan Electrical Engineering and Computer Science

Goals • Architecture for improved power efficiency for high-performance scientific applications – Reduced data

Goals • Architecture for improved power efficiency for high-performance scientific applications – Reduced data center power – Improved portability for mobile devices – 100 s of GFLOPS for 10 -20 W • GPU-like structure to exploit SIMD • Domain-specific add-ons • System design for best memory/performance balancing 4 University of Michigan Electrical Engineering and Computer Science

Performance vs. Compute Power S 1070 GTX 295 IBM Cell GTX 280 Core i

Performance vs. Compute Power S 1070 GTX 295 IBM Cell GTX 280 Core i 7 Core 2 Cortex A 8 Pentium M 5 University of Michigan Electrical Engineering and Computer Science

High Throughput at Low Power • Medical domains – Image reconstruction • Communications, signal-processing

High Throughput at Low Power • Medical domains – Image reconstruction • Communications, signal-processing – Real-time FFT for GPS receivers – Parity-checking for Wi. MAX and Wi. Fi • Financial applications – Fluctuation analysis for various market indices – SDE or Monte Carlo-based pricing models 6 University of Michigan Electrical Engineering and Computer Science

Application Analysis • Primarily FP computation • Significant mem usage (approx 0. 9 B/instr)

Application Analysis • Primarily FP computation • Significant mem usage (approx 0. 9 B/instr) • Some complex AG 7 100% 75% 50% 25% 0% acfdtd 2 d sde volatility mbir I-ALU CF AGU Mem SFU FPU University of Michigan Electrical Engineering and Computer Science

Performance vs. Computer Power vs. Bandwidth S 1070 GTX 295 IBM Cell GTX 280

Performance vs. Computer Power vs. Bandwidth S 1070 GTX 295 IBM Cell GTX 280 GTX 295 GTX 280 Core i 7 Core 2 Cortex A 8 Pentium M 8 Bandwidth limited!! ~0. 15 B/instr vs ~0. 9 B/instr University of Michigan Electrical Engineering and Computer Science

eco-GPGPU • • FP MAC pipeline Shuffle-swizzle networks Co-processors for math functions Significantly less

eco-GPGPU • • FP MAC pipeline Shuffle-swizzle networks Co-processors for math functions Significantly less power than Nvidia GPUs 9 University of Michigan Electrical Engineering and Computer Science

Current Architecture • 500 MHz @ 65 nm • 64 -way SIMD • 64

Current Architecture • 500 MHz @ 65 nm • 64 -way SIMD • 64 GFLOPs • ~2. 5 W/core • 1 TFLOP @ 16 cores, 40 W 10 University of Michigan Electrical Engineering and Computer Science

Performance vs. Compute Power S 1070 eco-GPGPU IBM Cell GTX 295 GTX 280 Core

Performance vs. Compute Power S 1070 eco-GPGPU IBM Cell GTX 295 GTX 280 Core i 7 Core 2 Cortex A 8 Pentium M 11 University of Michigan Electrical Engineering and Computer Science

Memory System? • Multiple eco-GPGPU will eventually hit memory wall • GPGPUs use 1,

Memory System? • Multiple eco-GPGPU will eventually hit memory wall • GPGPUs use 1, 000 s of thread contexts to hide latency – Too much area – Too much power 12 University of Michigan Electrical Engineering and Computer Science

Options for Memory System • 3 D-stacked DRAM? – Increased B/W – Reduced latency

Options for Memory System • 3 D-stacked DRAM? – Increased B/W – Reduced latency – Multi-threading not necessary • Caches? – 32 -64 KB required assuming 200 -cycle DRAM latency – Helps when temporal locality required • Pre-fetching, streaming? – Most data accesses are streaming/stride-based – Addresses predictable • Compression? – Sparse-matrix data easily compressible – Somewhat application-specific 13 University of Michigan Electrical Engineering and Computer Science

Speedup from Streaming 35% 30% 25% 20% 15% 10% 5% 0% acfdtd 2 d

Speedup from Streaming 35% 30% 25% 20% 15% 10% 5% 0% acfdtd 2 d sde 14 volatility mbir average University of Michigan Electrical Engineering and Computer Science

Options for Memory System • 3 D-stacked DRAM? – Increased B/W – Reduced latency

Options for Memory System • 3 D-stacked DRAM? – Increased B/W – Reduced latency – Multi-threading not necessary • Caches? – 32 -64 KB required assuming 200 -cycle DRAM latency – Helps when temporal locality required • Pre-fetching, streaming? – Most data accesses are streaming/stride-based – Addresses predictable • Compression? – Sparse-matrix data easily compressible – Somewhat application-specific 15 University of Michigan Electrical Engineering and Computer Science

Data Compression in Medical Imaging • Normally for reducing disk space • Use compression

Data Compression in Medical Imaging • Normally for reducing disk space • Use compression to reduce data transfer instead • ~10: 1 loss-less compression 16 Rereconstructed after 12: 1 JPEG lossy compression of sinogram University of Michigan Electrical Engineering and Computer Science

JPEG-LS Compression Hardware Compression Ratio 12 vs 10 8 6 4 2 0 0

JPEG-LS Compression Hardware Compression Ratio 12 vs 10 8 6 4 2 0 0 • Low area footprint, compared to FPUs 200 400 600 Image rows per compression – ~0. 25 mm 2 • High throughput – ~250 Mpixels/sec • Very low power dissipation – ~2 m. W • Compressing more data => Better ratios – [De]Compress data at off-chip mem bus • 10 X more bandwidth 17 University of Michigan Electrical Engineering and Computer Science

Current/Future Work • Thorough analysis of scientific compute domains – % FP – Mem:

Current/Future Work • Thorough analysis of scientific compute domains – % FP – Mem: Compute ratios – Data access patterns • Improved GPU measurements – CUDA profiler to determine performance – Power measurements • Memory system options 18 University of Michigan Electrical Engineering and Computer Science

Conclusions • Low-power “supercomputing” an important direction of study in computer architecture • Current

Conclusions • Low-power “supercomputing” an important direction of study in computer architecture • Current solutions either over-designed or far too inefficient • Significant efficiency improvements: – Datapath optimizations – Reduce thread contexts – Improved memory systems 19 University of Michigan Electrical Engineering and Computer Science

Thank you! ? ? ? 20 University of Michigan Electrical Engineering and Computer Science

Thank you! ? ? ? 20 University of Michigan Electrical Engineering and Computer Science