Google Workloads for Consumer Devices Mitigating Data Movement

  • Slides: 1
Download presentation
Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand, Saugata Ghose, Youngsok

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, Onur Mutlu Google Consumer Workloads Data Movement Cost We look at widely-used Google workloads to identify major sources of energy consumption: 1 st key observation: 62. 7% of total system energy is spent on data movement CPU Tensor. Flow Mobile Google’s web browser Google’s machine learning framework 1 Understand the data movement related bottlenecks in modern consumer workloads 2 Analyze the benefits of processing-in-memory (PIM) to mitigate data movement cost Data Movement So. C Chrome Goals CPU DRAM L 2 L 1 Compute Unit Processing-in-Memory (PIM) 2 nd key observation: a significant fraction of data movement often comes from simple functions Video Playback and Video Capture 3 Investigate the PIM logic that can maximize energy efficiency given the limited area and energy budget in consumer devices Google’s video codec Browser Tab Switching Scrolling Texture Tiling Rasterization Invoke Compositing Texture tiling is a good candidate for PIM execution Tiled Texture Packing Reorders elements of matrices to minimize cache misses Up to 40% of the Packing’s data inference energy and movement accounts for 31% of inference up to 35. 3% of the execution time inference energy A simple data reorganization process that requires simple arithmetic floating point Quantization CPU-Only Display PIM-Core 0. 8 0. 6 0. 4 0. 2 Converts 32 -bit floating point to 8 -bit integers Up to 16. 8% of the Majority of quantization inference energy and energy comes from 16. 1% of inference data movement execution time A simple data conversion operation that requires shift, addition, and multiplication io im at ck Es t in g po er ot M D eb io n lo nt Pi n r la Fi tio lte n n io at b. Su Deblocking filter: a simple lowpass filter that attempts to remove discontinuity in pixels On average, energy consumption reduces by 49. 1% using PIM core and 55. 4% using PIM accelerator CPU-Only Captured video l. I ec C D integer xe om -p or C re Bl it tin ss io g g ilin e. T ur xt Te n 0 80. 4% of the data movement energy comes from sub-pixel interpolation and deblocking filter Sub-pixel interpolation: interpolates the value of pixels at non-integer location PIM-Acc 1 ol 63. 5% of the system energy is spent on data movement 54. 4% of the data movement energy comes from packing/unpacking and quantization Packed Matrix VP 9 Decoder ZRAM Evaluation Normalized Energy Compressed video No off-chip data movement Both functions can benefit from PIM execution and can be implemented at PIM logic Video Playback and Capture Prediction Compress ZRAM Other tasks Color blitting is feasible to implement in PIM logic 57. 3% of the inference energy is spent on data movement Matrix Write back Memset, simple arithmetic, and shift operations Tensor. Flow Mobile Inference high data movement Compress PIM Uncompressed Pages Other tasks iz Requires low-cost operations: Swap out N pages Uncompressed Pages Read N Pages CPU + PIM CPU nt Color blitting is a good candidate for PIM execution Memory ua Accounts for 19. 1% of the total system energy during scrolling time Q Generates a large amount of data movement 49. 1% of total data movement comes from texture tiling and color blitting CPU-Only Swap out N pages Color Blitting Texture Color Tiling Blitting 19. 6 GB of data move between CPU and ZRAM CPU Compressed VP 9 Encoder 59. 1% of the system energy is spent on data movement Majority of the data movement energy comes from motion estimation Motion estimation: compresses the frames using temporal redundancy between them PIM-Core PIM-Acc 1. 0 Normalized Runtime 77% of total energy consumption goes to data movement 0% DRAM Compression and decompression contribute to 18. 1% of the total system energy g Fraction of Total Energy (p. J) Compute 10% Inter- Mem connect Ctrl 1 2 7. 1% of the area available for PIM logic PIM core and PIM accelerator are feasible to implement in-memory texture tiling 20% LLC – A user opens 50 tabs (most-accessed websites) – Scrolls through each for a few second – Switches to the next tab PIM Accelerator 9. 4% of the area available for PIM logic 30% L 1 • To study data movement during tab switching, we perform an experiment: AVG 40% CPU ZRAM ck in Texture Tiling Requires simple primitives: memcopy, bitwise operations, and simple arithmetic operations Data Movement DRAM Compressed Tab CPU Linear Bitmap 41. 9% of page scrolling energy is spent on texture tiling and color blitting 18× 1012 15× 1012 12× 1012 9× 1012 6× 1012 3× 1012 0× 1012 • Chrome uses compression to reduce each tab’s memory footprint Pa Invoke Compositing Render Process (tab n) Render Process (tab 2) (tab 1) Texture Tiles PIM Core Texture Tiling Color Blitting Other Texture Tiles n Write Back Other 100% 80% 60% 40% 20% 0% Google Gmail Google Word- Twitter Ani. Calendar Press mation Docs high data movement io Conversion … – Context switch – Load the new page Conversion ss Read Bitmap Linear Bitmap compression Color Blitting Linear Bitmap • Main operations during tab switching: re Texture Tiling Fraction of Total Energy Compositing Chrome Process PIM idle Layout Rasterization Texture Tiling Render Tree CPU Memory Rasterization CSS Parser CSS CPU + PIM time -p Color Blitting HTML Parser HTML CPU-Only om Chrome Rendering Pipeline • Chrome employs a multi-process architecture 0. 8 0. 6 0. 4 0. 2 0. 0 Texture Tiling Comp- ression Chrome Browser Sub-Pixel Interpolation Motion Estimation Video Playback and Capture Tensor. Flow Mobile On average, energy consumption reduces by 44. 6% using PIM core and 54. 2% using PIM accelerator