An Overflowfree Quantized Memory Hierarchy in Generalpurpose Processors
- Slides: 39
An Overflow-free Quantized Memory Hierarchy in General-purpose Processors Marzieh Lenjani , Patricia Gonzalez , Elaheh Sadredini , M Arif Rahman, Mircea R. Stan
Characterization for design optimizations 2
OS and MMU @characteristics <…, …, …. > How about summarizing characterizations into an abstraction? 3
Characteristics as abstractions for memory-intensive application What is different about memory-intensive applications? Few computations per loaded datum 4
Cost of data movement in memory-intensive applications • Cost of each data movement >> Cost of each computation Memory to L 1 >1000 x more energy >100 x more latency An ADD operation [1]. Han, Song, et al. "EIE" ISCA 2016. • Memory-intensive application – Few computations per data movement Data movement dominates the execution time and energy consumption 5
Outline • Reducing the cost of data movement – Approach 1: Cache compression – Approach 2: Truncating LSBs – Approach 3: Quantization • • Quantization in general-purpose processors Characteristics as abstractions Evaluation Conclusion 6
Processor A cache block L 2 A compressed cache block Compressor/ decompressor LLC Memory Approach 1: Cache compression 7
Processor A cache block L 2 A compressed cache block Compressor/ decompressor LLC Memory Compressor/decompressor obliviously looks for value locality Problem 1: Complex compression/decompression 8
Metadata overhead A compressed cache block Compressor/ decompressor Metadata per block A cache block Problem 2: Significant metadata overhead 9
Variable block size Pointer to B 2 Pointer to B 1 B 3 B 2 B 1 L 3: Data array L 3: Tag array L 2 Problem 3: Complex address translation 10
Low value locality • High variation in LSB – Low value locality Sign Exponent (8 bits) Mantissa ( 23 bits) • Complex structures – Low value locality Struct element{ price float Price; bool type; 87. 89 float yield} type Unused yield True XXXXXXX 0. 2 Problem 4: Low compression ratio 11
Approach 2: Truncating LSBs • Approximate computing – Many applications can tolerate imprecise variables • [Example: image classification] Model Cat✓ Dog × Tolerate imprecise variables Shorter variables Less data movement 12
Approach 2: Truncating LSBs • Omitting Least Significant Bits (OLSB) [Jain et al. , MICRO 2016] – Complies with the exponent-mantissa format of the floating point standard Mantissa ( 23 bits) Sign Exponent (8 bits) Mantissa (14 bits) Eliminated bits The shorter variable 13
Problem 1: High cost with exponent Keeps the exponent intact to reduce error 1 bit 7 bits 8 bits 4 bits 3 bits High error High overhead 12 bits variables 14
Problem 2: Exponent’s impact on error Exponent =-126 Error= =(M-M’) × 10 -43 LOW Exponent =+127 Error=(M-M’)× 10+34 HIGH Sign Exponent (8 bits) Mantissa (14 bits) Error grows with the magnitude of value 15
Traditional reason for the exponent-mantissa format 0 -3. 4 × 1038 1. 2 × 10 -38 3. 4 × 1038 Minimum positive Maximum Advantage: supporting a wide range of values 16
Approach 3: Quantization ∆=0. 25 -3 -2 -1 110 101 2. 0 0 000 Error=∆=fixed Range of Values Error=∆=fixed 1. 25 Error=∆=fixed Converting floating point numbers to short integers 2. 75 +1 +2 +3 +001 +010 +011 Shorter bit-width and limited error 17
Problem : Overflows Minimum Maximum Values outside the supported range cause overflows 18
Quantization is getting very popular 19
Where to use quantization? • Applications that can tolerate overflows • Accelerators: – Can exploit non-standard bit-width (e. g. 4 -bit, 12 -bit, 18 -bit) 20
How about overflow-free quantization in general-purpose processors? 21
Quantization for general-purpose processors • Software-based Implementation – Can not exploit non-standard bit-width • (e. g. 4 -bit, 12 -bit, 17 -bit) – Significant Programmer’s effort – Significant performance overhead • For conversions – Prone to bugs and overflows 22
800 600 400 200 0 80 160 240 320 400 0 Activation values Open price Deep learning Financial apps 600 400 200 0 -3 -2 -1 0 1 2 3 600000 400000 200000 0 -14 -9 -4 1 6 11 800000 Frequency 1000 Frequency 120000 100000 80000 60000 40000 20000 0 -1, 25 -0, 75 -0, 25 0, 75 1, 25 Frequency Key observations Net change Sound amplitude Financial apps Signal processing Significant portion of values lie within a narrow range There are only few outliers outside the range 24
Quantization + overflow handling can benefit a wide range of applications Quantization + overflow handling can be employed in general-purpose processors 25
Goals: 1. Support for overflow-free quantization in general-purpose processors 2. Hardware-accelerated conversions 3. Minimal programmer’s effort 4. Flexible bit-width 26
Key mechanisms • Converting an array of values to a quantized bitstream • Representing values within the range with few bits – Fixed bit-width, e. g. , 4 bits • Assigning an outlier indicator with same bit-width – E. g. , “ 1111” • Replacing outliers with the indicator in the quantized bitstream • Storing the real value of outliers in a different dedicated space 27
How to avoid overflows? Processor Multiple de-quantized blocks L 1 A quantized block Converter L 2 Memory @quant <3, 0. 25, 1. 5> float price [1000]; 1. Computation: Convert to float before computation 2. Conversion: Retrieve outliers from the dedicated space 28
Quantization looks like cache compression Processor Multiple de-quantized blocks L 1 A quantized block Converter L 2 Memory @quant <3, 0. 25, 1. 5> float price [1000]; How to avoid cache compression problems? 29
A high-level abstraction can help MMU transfers quantization parameters, array’s boundaries, and layout to hardware 30
How do quantization parameters help? • A low-overhead compressor/decompressor – Quantizer/dequantize – Simple arithmetic-based conversions @quant <3, 0. 25, 1. 5> float price [1000]; • Instead of Obliviously looking for value locality 31
How do array’s boundaries help? • Keep the metadata for the whole array – Vs. keeping metadata per block (per page) 32
How does layout help? • Higher compression ratio for arrays of structure Struct element{ float Price; bool type; float yield} price type Unused yield 87. 89 True XXXXXXX 0. 2 9 1 4 bits bit 33
Address translation using precomputed (offline-computed) values • Precomputed values for low-overhead address translation – Reciprocate of the number of quantized words that fit in a block – Accumulated quantized bit-width of variables in a structure B 3 B 2 B 1 34
Evaluation 35
Absolute Error Quantization vs LSB truncation 12000 10000 8000 6000 4000 2000 0 16 OLSB Quantization provides 16 times lower absolute error
Quantization vs OLSB Up to 31% performance improvement over OLSB
N u m Performance improvement over OLSB (%) to achieve absolute error of 5% Performance Evaluation 12 10 8 6 4 2 0 4 16 8 Number of cores Performance improvement increases as the number of cores increases
Quantization vs cache compression Latency of access to the decoupled tag and data array nullifies the higher compression rate by Qnt+Comp
Conclusion • Most values lie within a narrow range – No need to have exponent and mantissa format except for outliers • General-purpose processors can exploit quantization • Quantization can be a low-overhead compression technique – Simpler compressor/decompressor – Less metadata – Low-overhead address translation – Higher compression ratio • For floating point variables and arrays of structures
- Virtual memory in memory hierarchy consists of
- Light and quantized energy worksheet
- Electrons in atoms section 1 light and quantized energy
- Quantized inertia
- Electrons in atoms section 1 light and quantized energy
- Section 1 light and quantized energy
- Memory hierarchy
- Memory hierarchy definition
- Large and fast: exploiting memory hierarchy
- Computer memory hierarchy
- Draw memory hierarchy
- Computer architecture
- Hierarchy of trends
- Memory hierarchy
- Memory hierarchy
- Memory hierarchy in computer architecture
- Computer memory hierarchy diagram
- Memory hierarchy pyramid
- Memory hierarchy in os
- Virtual memory tlb
- Computer memory hierarchy diagram
- Hit ratio in computer architecture
- Magnetic disk in memory hierarchy
- Memory hierarchy
- Explain memory hierarchy
- Comparison of word processors
- Networks and telecommunications presentation
- Programming massively parallel processors
- Processor history
- Intel vs amd advantages and disadvantages
- One pass macro processor algorithm
- Characterization of query processors
- Programming massively parallel processors, kirk et al.
- Programming massively parallel processors
- Gas processors association
- Handlers classification
- Embeded processors
- Vliw vs superscalar
- Parallel processors from client to cloud
- Difference between linear and nonlinear pipeline processors