An Overflowfree Quantized Memory Hierarchy in Generalpurpose Processors

  • Slides: 39
Download presentation
An Overflow-free Quantized Memory Hierarchy in General-purpose Processors Marzieh Lenjani , Patricia Gonzalez ,

An Overflow-free Quantized Memory Hierarchy in General-purpose Processors Marzieh Lenjani , Patricia Gonzalez , Elaheh Sadredini , M Arif Rahman, Mircea R. Stan

Characterization for design optimizations 2

Characterization for design optimizations 2

OS and MMU @characteristics <…, …, …. > How about summarizing characterizations into an

OS and MMU @characteristics <…, …, …. > How about summarizing characterizations into an abstraction? 3

Characteristics as abstractions for memory-intensive application What is different about memory-intensive applications? Few computations

Characteristics as abstractions for memory-intensive application What is different about memory-intensive applications? Few computations per loaded datum 4

Cost of data movement in memory-intensive applications • Cost of each data movement >>

Cost of data movement in memory-intensive applications • Cost of each data movement >> Cost of each computation Memory to L 1 >1000 x more energy >100 x more latency An ADD operation [1]. Han, Song, et al. "EIE" ISCA 2016. • Memory-intensive application – Few computations per data movement Data movement dominates the execution time and energy consumption 5

Outline • Reducing the cost of data movement – Approach 1: Cache compression –

Outline • Reducing the cost of data movement – Approach 1: Cache compression – Approach 2: Truncating LSBs – Approach 3: Quantization • • Quantization in general-purpose processors Characteristics as abstractions Evaluation Conclusion 6

Processor A cache block L 2 A compressed cache block Compressor/ decompressor LLC Memory

Processor A cache block L 2 A compressed cache block Compressor/ decompressor LLC Memory Approach 1: Cache compression 7

Processor A cache block L 2 A compressed cache block Compressor/ decompressor LLC Memory

Processor A cache block L 2 A compressed cache block Compressor/ decompressor LLC Memory Compressor/decompressor obliviously looks for value locality Problem 1: Complex compression/decompression 8

Metadata overhead A compressed cache block Compressor/ decompressor Metadata per block A cache block

Metadata overhead A compressed cache block Compressor/ decompressor Metadata per block A cache block Problem 2: Significant metadata overhead 9

Variable block size Pointer to B 2 Pointer to B 1 B 3 B

Variable block size Pointer to B 2 Pointer to B 1 B 3 B 2 B 1 L 3: Data array L 3: Tag array L 2 Problem 3: Complex address translation 10

Low value locality • High variation in LSB – Low value locality Sign Exponent

Low value locality • High variation in LSB – Low value locality Sign Exponent (8 bits) Mantissa ( 23 bits) • Complex structures – Low value locality Struct element{ price float Price; bool type; 87. 89 float yield} type Unused yield True XXXXXXX 0. 2 Problem 4: Low compression ratio 11

Approach 2: Truncating LSBs • Approximate computing – Many applications can tolerate imprecise variables

Approach 2: Truncating LSBs • Approximate computing – Many applications can tolerate imprecise variables • [Example: image classification] Model Cat✓ Dog × Tolerate imprecise variables Shorter variables Less data movement 12

Approach 2: Truncating LSBs • Omitting Least Significant Bits (OLSB) [Jain et al. ,

Approach 2: Truncating LSBs • Omitting Least Significant Bits (OLSB) [Jain et al. , MICRO 2016] – Complies with the exponent-mantissa format of the floating point standard Mantissa ( 23 bits) Sign Exponent (8 bits) Mantissa (14 bits) Eliminated bits The shorter variable 13

Problem 1: High cost with exponent Keeps the exponent intact to reduce error 1

Problem 1: High cost with exponent Keeps the exponent intact to reduce error 1 bit 7 bits 8 bits 4 bits 3 bits High error High overhead 12 bits variables 14

Problem 2: Exponent’s impact on error Exponent =-126 Error= =(M-M’) × 10 -43 LOW

Problem 2: Exponent’s impact on error Exponent =-126 Error= =(M-M’) × 10 -43 LOW Exponent =+127 Error=(M-M’)× 10+34 HIGH Sign Exponent (8 bits) Mantissa (14 bits) Error grows with the magnitude of value 15

Traditional reason for the exponent-mantissa format 0 -3. 4 × 1038 1. 2 ×

Traditional reason for the exponent-mantissa format 0 -3. 4 × 1038 1. 2 × 10 -38 3. 4 × 1038 Minimum positive Maximum Advantage: supporting a wide range of values 16

Approach 3: Quantization ∆=0. 25 -3 -2 -1 110 101 2. 0 0 000

Approach 3: Quantization ∆=0. 25 -3 -2 -1 110 101 2. 0 0 000 Error=∆=fixed Range of Values Error=∆=fixed 1. 25 Error=∆=fixed Converting floating point numbers to short integers 2. 75 +1 +2 +3 +001 +010 +011 Shorter bit-width and limited error 17

Problem : Overflows Minimum Maximum Values outside the supported range cause overflows 18

Problem : Overflows Minimum Maximum Values outside the supported range cause overflows 18

Quantization is getting very popular 19

Quantization is getting very popular 19

Where to use quantization? • Applications that can tolerate overflows • Accelerators: – Can

Where to use quantization? • Applications that can tolerate overflows • Accelerators: – Can exploit non-standard bit-width (e. g. 4 -bit, 12 -bit, 18 -bit) 20

How about overflow-free quantization in general-purpose processors? 21

How about overflow-free quantization in general-purpose processors? 21

Quantization for general-purpose processors • Software-based Implementation – Can not exploit non-standard bit-width •

Quantization for general-purpose processors • Software-based Implementation – Can not exploit non-standard bit-width • (e. g. 4 -bit, 12 -bit, 17 -bit) – Significant Programmer’s effort – Significant performance overhead • For conversions – Prone to bugs and overflows 22

800 600 400 200 0 80 160 240 320 400 0 Activation values Open

800 600 400 200 0 80 160 240 320 400 0 Activation values Open price Deep learning Financial apps 600 400 200 0 -3 -2 -1 0 1 2 3 600000 400000 200000 0 -14 -9 -4 1 6 11 800000 Frequency 1000 Frequency 120000 100000 80000 60000 40000 20000 0 -1, 25 -0, 75 -0, 25 0, 75 1, 25 Frequency Key observations Net change Sound amplitude Financial apps Signal processing Significant portion of values lie within a narrow range There are only few outliers outside the range 24

Quantization + overflow handling can benefit a wide range of applications Quantization + overflow

Quantization + overflow handling can benefit a wide range of applications Quantization + overflow handling can be employed in general-purpose processors 25

Goals: 1. Support for overflow-free quantization in general-purpose processors 2. Hardware-accelerated conversions 3. Minimal

Goals: 1. Support for overflow-free quantization in general-purpose processors 2. Hardware-accelerated conversions 3. Minimal programmer’s effort 4. Flexible bit-width 26

Key mechanisms • Converting an array of values to a quantized bitstream • Representing

Key mechanisms • Converting an array of values to a quantized bitstream • Representing values within the range with few bits – Fixed bit-width, e. g. , 4 bits • Assigning an outlier indicator with same bit-width – E. g. , “ 1111” • Replacing outliers with the indicator in the quantized bitstream • Storing the real value of outliers in a different dedicated space 27

How to avoid overflows? Processor Multiple de-quantized blocks L 1 A quantized block Converter

How to avoid overflows? Processor Multiple de-quantized blocks L 1 A quantized block Converter L 2 Memory @quant <3, 0. 25, 1. 5> float price [1000]; 1. Computation: Convert to float before computation 2. Conversion: Retrieve outliers from the dedicated space 28

Quantization looks like cache compression Processor Multiple de-quantized blocks L 1 A quantized block

Quantization looks like cache compression Processor Multiple de-quantized blocks L 1 A quantized block Converter L 2 Memory @quant <3, 0. 25, 1. 5> float price [1000]; How to avoid cache compression problems? 29

A high-level abstraction can help MMU transfers quantization parameters, array’s boundaries, and layout to

A high-level abstraction can help MMU transfers quantization parameters, array’s boundaries, and layout to hardware 30

How do quantization parameters help? • A low-overhead compressor/decompressor – Quantizer/dequantize – Simple arithmetic-based

How do quantization parameters help? • A low-overhead compressor/decompressor – Quantizer/dequantize – Simple arithmetic-based conversions @quant <3, 0. 25, 1. 5> float price [1000]; • Instead of Obliviously looking for value locality 31

How do array’s boundaries help? • Keep the metadata for the whole array –

How do array’s boundaries help? • Keep the metadata for the whole array – Vs. keeping metadata per block (per page) 32

How does layout help? • Higher compression ratio for arrays of structure Struct element{

How does layout help? • Higher compression ratio for arrays of structure Struct element{ float Price; bool type; float yield} price type Unused yield 87. 89 True XXXXXXX 0. 2 9 1 4 bits bit 33

Address translation using precomputed (offline-computed) values • Precomputed values for low-overhead address translation –

Address translation using precomputed (offline-computed) values • Precomputed values for low-overhead address translation – Reciprocate of the number of quantized words that fit in a block – Accumulated quantized bit-width of variables in a structure B 3 B 2 B 1 34

Evaluation 35

Evaluation 35

Absolute Error Quantization vs LSB truncation 12000 10000 8000 6000 4000 2000 0 16

Absolute Error Quantization vs LSB truncation 12000 10000 8000 6000 4000 2000 0 16 OLSB Quantization provides 16 times lower absolute error

Quantization vs OLSB Up to 31% performance improvement over OLSB

Quantization vs OLSB Up to 31% performance improvement over OLSB

N u m Performance improvement over OLSB (%) to achieve absolute error of 5%

N u m Performance improvement over OLSB (%) to achieve absolute error of 5% Performance Evaluation 12 10 8 6 4 2 0 4 16 8 Number of cores Performance improvement increases as the number of cores increases

Quantization vs cache compression Latency of access to the decoupled tag and data array

Quantization vs cache compression Latency of access to the decoupled tag and data array nullifies the higher compression rate by Qnt+Comp

Conclusion • Most values lie within a narrow range – No need to have

Conclusion • Most values lie within a narrow range – No need to have exponent and mantissa format except for outliers • General-purpose processors can exploit quantization • Quantization can be a low-overhead compression technique – Simpler compressor/decompressor – Less metadata – Low-overhead address translation – Higher compression ratio • For floating point variables and arrays of structures