IBM Research Feeding the Multicore Beast Its All
IBM Research Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept. © 2008
IBM Research Outline § History: Data challenge § Motivation for multicore § Implications for programmers § How Cell addresses these implications § Examples • 2 D/3 D FFT – Medical Imaging, Petroleum, general HPC… • Green’s Functions – Seismic Imaging (Petroleum) • String Matching – Network Processing: DPI & Intrusion Detections • Neural Networks – Finance 2 mpp@us. ibm. com © 2008
IBM Research Chapter 1: The Beast is Hungry! 3 mpp@us. ibm. com © 2008
IBM Research The Hungry Beast Data (“food”) Data Pipe Processor (“beast”) § Pipe too small = starved beast § Pipe big enough = well-fed beast § Pipe too big = wasted resources 4 mpp@us. ibm. com © 2008
IBM Research The Hungry Beast Data (“food”) Data Pipe Processor (“beast”) § Pipe too small = starved beast § Pipe big enough = well-fed beast § Pipe too big = wasted resources § If flops grow faster than pipe capacity… … the beast gets hungrier! 5 mpp@us. ibm. com © 2008
IBM Research Move the food closer § Example: Intel Tulsa – Xeon MP 7100 series – 65 nm, 349 mm 2, 2 Cores – 3. 4 GHz @ 150 W – ~54. 4 SP GFlops – http: //www. intel. com/products /processor/xeon/index. htm § Large cache on chip – ~50% of area – Keeps data close for efficient access § If the data is local, the beast is happy! – True for many algorithms 6 mpp@us. ibm. com © 2008
IBM Research What happens if the beast is still hungry? § If the data set doesn’t fit in cache – Cache misses Cache Data – Memory latency exposed – Performance degraded § Several important application classes don’t fit – Graph searching algorithms – Network security – Natural language processing – Bioinformatics – Many HPC workloads 7 mpp@us. ibm. com © 2008
IBM Research Make the food bowl larger § Cache size steadily increasing § Implications Cache Data – Chip real estate reserved for cache – Less space on chip for computes – More power required for fewer FLOPS 8 mpp@us. ibm. com © 2008
IBM Research Make the food bowl larger § Cache size steadily increasing § Implications Cache Data – Chip real estate reserved for cache – Less space on chip for computes – More power required for fewer FLOPS § But… – Important application working sets are growing faster – Multicore even more demanding on cache than uni-core 9 mpp@us. ibm. com © 2008
IBM Research Chapter 2: The Beast Has Babies 10 mpp@us. ibm. com © 2008
IBM Research Power Density – The fundamental problem 11 mpp@us. ibm. com © 2008
IBM Research What’s causing the problem? 65 n. M 1000 Gate Stack Power Density (W/cm 2) 100 10 1 0. 01 0. 001 1 Gate dielectric approaching a fundamental limit (a few atomic layers) 0. 1 0. 01 Gate Length (microns) Power, signal jitter, etc. . . 12 mpp@us. ibm. com © 2008
IBM Research Diminishing Returns on Frequency In a power-constrained environment, chip clock speed yields diminishing returns. The industry has moved to lower frequency multicore architectures. Frequency. Driven Design Points 13 mpp@us. ibm. com © 2008
IBM Research Power vs Performance Trade Offs We need to adapt our algorithms to get performance out of multicore 1. 45 . 85 14 mpp@us. ibm. com 1 1. 3 1. 7 © 2008
IBM Research Implications of Multicore § There are mouths to feed – Data movement will take center stage § Complexity of cores will stop increasing … and has started to decrease in some cases § Complexity increases will center around communication § Assumption – Achieving a significant % or peak performance is important 15 mpp@us. ibm. com © 2008
IBM Research Chapter 3: The Proper Care and Feeding of Hungry Beasts 16 mpp@us. ibm. com © 2008
IBM Research Cell/B. E. Processor: 200 GFLOPS (SP) @ ~70 W 17 mpp@us. ibm. com © 2008
IBM Research Feeding the Cell Processor § 8 SPEs each with – LS SPE SPU SPU SXU SXU LS LS MFC MFC – MFC – SXU SPU SXU 16 B/cycle § PPE EIB (up to 96 B/cycle) – OS functions 16 B/cycle PPE – Disk IO 16 B/cycle PPU – Network IO L 2 L 1 MIC 16 B/cycle (2 x) BIC PXU 32 B/cycle 16 B/cycle Dual XDRTM Flex. IOTM 64 -bit Power Architecture with VMX 18 mpp@us. ibm. com © 2008
IBM Research Cell Approach: Feed the beast more efficiently § Explicitly “orchestrate” the data flow between main memory and each SPE’s local store – Use SPE’s DMA engine to gather & scatter data between memory main memory and local store – Enables detailed programmer control of data flow • Get/Put data when & where you want it • Hides latency: Simultaneous reads, writes & computes – Avoids restrictive HW cache management • Unlikely to determine optimal data flow • Potentially very inefficient – Allows more efficient use of the existing bandwidth 19 mpp@us. ibm. com © 2008
IBM Research Cell Approach: Feed the beast more efficiently § Explicitly “orchestrate” the data flow between main memory and each SPE’s local store – Use SPE’s DMA engine to gather & scatter data between memory main memory and local store – Enables detailed programmer control of data flow • Get/Put data when & where you want it • Hides latency: Simultaneous reads, writes & computes – Avoids restrictive HW cache management • Unlikely to determine optimal data flow • Potentially very inefficient – Allows more efficient use of the existing bandwidth § BOTTOM LINE: It’s all about the data! 20 mpp@us. ibm. com © 2008
IBM Research Cell Comparison: ~4 x the FLOPS @ ~½ the power Both 65 nm technology (to scale) 21 mpp@us. ibm. com © 2008
IBM Research Memory Managing Processor vs. Traditional General Purpose Processor Cell AMD BE IBM 22 Intel mpp@us. ibm. com © 2008
IBM Research Examples of Feeding Cell § 2 D and 3 D FFTs § Seismic Imaging § String Matching § Neural Networks (function approximation) 23 mpp@us. ibm. com © 2008
IBM Research Feeding FFTs to Cell § SIMDized data § DMAs double buffered § Pass 1: For each buffer • • DMA Get buffer Do four 1 D FFTs in SIMD Transpose tiles DMA Put buffer § Pass 2: For each buffer • • 24 DMA Get buffer Do four 1 D FFTs in SIMD Transpose tiles DMA Put buffer mpp@us. ibm. com Tile Buffer Input Image Transposed Tile Transposed Buffer © 2008
IBM Research 3 D FFTs §Long stride trashes cache §Cell DMA allows prefetch Stride N 2 Single Element Data envelope N Stride 1 25 mpp@us. ibm. com © 2008
IBM Research Feeding Seismic Imaging to Cell Data Green’s Function (X, Y) § New G at each (x, y) § Radial symmetry of G reduces BW requirements 26 mpp@us. ibm. com © 2008
IBM Research Feeding Seismic Imaging to Cell SPE 0 27 SPE 1 mpp@us. ibm. com SPE 2 SPE 3 SPE 4 SPE 5 Data SPE 6 SPE 7 © 2008
IBM Research Feeding Seismic Imaging to Cell SPE 0 28 SPE 1 mpp@us. ibm. com SPE 2 SPE 3 SPE 4 SPE 5 Data SPE 6 SPE 7 © 2008
IBM Research Feeding Seismic Imaging to Cell 2 R+1 § For each X – Load next column of data – Load next column of indices – For each Y (X, Y) • Load Green’s functions • SIMDize Green’s functions • Compute convolution at (X, Y) R H – Cycle buffers 1 Data buffer Green’s Index buffer 2 29 mpp@us. ibm. com © 2008
IBM Research Feeding String Matching to Cell Sample Word List: “the” “that” “math” § Find (lots of) substrings in (long) string § Build graph of words & represent as DFA § Problem: Graph doesn’t fit in LS 30 mpp@us. ibm. com © 2008
IBM Research Feeding String Matching to Cell 31 mpp@us. ibm. com © 2008
IBM Research Hiding Main Memory Latency 32 mpp@us. ibm. com © 2008
IBM Research Software Multithreading 33 mpp@us. ibm. com © 2008
IBM Research Feeding Neural Networks to Cell F Output N Basis functions: dot product + nonlinearity Dx. N Matrix of parameters D Input dimensions X § Neural net function F(X) – RBF, MLP, KNN, etc. § If too big for LS, BW Bound 34 mpp@us. ibm. com © 2008
IBM Research Convert BW Bound to Compute Bound Merge § Split function over multiple SPEs § Avoids unnecessary memory traffic § Reduce compute time per SPE § Minimal merge overhead 35 mpp@us. ibm. com © 2008
IBM Research Moral of the Story: It’s All About the Data! § The data problem is growing: multicore § Intelligent software prefetching – Use DMA engines – Don’t rely on HW prefetching § Efficient data management 36 – Multibuffering: Hide the latency! – BW utilization: Make every byte count! – SIMDization: Make every vector count! – Problem/data partitioning: Make every core work! – Software multithreading: Keep every core busy! mpp@us. ibm. com © 2008
IBM Research Backup 37 mpp@us. ibm. com © 2008
IBM Research Abstract Technological obstacles have prevented the microprocessor industry from achieving increased performance through increased chip clock speeds. In a reaction to these restrictions, the industry has chosen the multicore processors path. Multicore processors promise tremendous GFLOPS performance but raise the challenge of how one programs them. In this talk, I will discuss the motivation for multicore, the implications to programmers and how the Cell/B. E. processors design addresses these challenges. As an example, I will review one or two applications that highlight the strengths of Cell. 38 mpp@us. ibm. com © 2008
- Slides: 38