Disciplined Approximate Computing From Language to Hardware and

  • Slides: 49
Download presentation
Disciplined Approximate Computing: From Language to Hardware and Beyond Adrian Sampson, Hadi Esmaelizadeh, 1

Disciplined Approximate Computing: From Language to Hardware and Beyond Adrian Sampson, Hadi Esmaelizadeh, 1 Michael Ringenburg, Reneé St. Amant, 2 Luis Ceze, Dan Grossman, Mark Oskin, Karin Strauss, 3 and Doug Burger 3 University of Washington Now at Georgia Tech 2 UT-Austin 3 Microsoft Research 1

Approximation with HW eyes. . .

Approximation with HW eyes. . .

Approximation with SW eyes. . . • Approximation is yet another complexity ‒ Most

Approximation with SW eyes. . . • Approximation is yet another complexity ‒ Most of a code base can’t handle it • But many expensive kernels are naturally error-tolerant ‒ Multiple “good enough” results

Must work across the stack… User Application Algorithm Language Compiler Architecture Circuits Devices •

Must work across the stack… User Application Algorithm Language Compiler Architecture Circuits Devices • From above: Where approximation acceptable • From below: More savings for less effort Working at one level is inherently limiting

Errors Energy

Errors Energy

Across the stack • Higher level knows pixel buffer vs. header bits and control

Across the stack • Higher level knows pixel buffer vs. header bits and control flow • Lower level knows how to save energy for approximate pixel buffer We are working on: 1. Expressing and monitoring disciplined approximation 2. Exploiting approximation in architectures and devices 3. Much, much more…

Plan for today Application language for where-toapproximate Ener. J Monitoring dynamic quality evaluation Compiler

Plan for today Application language for where-toapproximate Ener. J Monitoring dynamic quality evaluation Compiler traditional architecture Truffle Architecture. NPU neural networks as accelerators Circuits Approximate Storage Approximate Wi-Fi give high-level results mention in passing

Application language for where-toapproximate Ener. J: Approximate Data Types for Safe and General Low-Power

Application language for where-toapproximate Ener. J: Approximate Data Types for Safe and General Low-Power Computation, PLDI 2011

Safety ✓ Statically separate critical and non -critical program components Precise ✗Approximate Generality Various

Safety ✓ Statically separate critical and non -critical program components Precise ✗Approximate Generality Various approximation strategies encapsulated Disciplined Approximate Programming Storage Logic Algorithms λ Comunication

Type qualifiers @approx int a =. . . ; @precise int p =. .

Type qualifiers @approx int a =. . . ; @precise int p =. . . ; ✗ ✓ p = a; a = p;

Control flow: implicit flows @approx int a =. . . ; @precise int p

Control flow: implicit flows @approx int a =. . . ; @precise int p =. . . ; ✗ if (a == 10) { p = 2; }

Endorsement: escape hatch @approx int a = expensive(); @precise int p; ✓ p =

Endorsement: escape hatch @approx int a = expensive(); @precise int p; ✓ p = endorse(a); quick. Checksum(p); output(p);

Algorithmic approximation with Polymorphism class Float. Set { @context float[] nums =. . .

Algorithmic approximation with Polymorphism class Float. Set { @context float[] nums =. . . ; Also in the PLDI’ 11 paper: float mean() { calculate mean Formal semantics } Noninterference claim @approx float mean_APPROX() { Object-layout calculate meanissues of half } Simulation results } … @approx Float. Set some. Set =. . . ; some. Set. mean();

Application language for where-toapproximate Ener. J Monitoring [under submission] dynamic quality evaluation

Application language for where-toapproximate Ener. J Monitoring [under submission] dynamic quality evaluation

Controlling approximation • • Ener. J: approximate cannot interfere with precise But semantically, approximate

Controlling approximation • • Ener. J: approximate cannot interfere with precise But semantically, approximate results are unspecified “best effort” • • • “Worst effort” of “skip; return 0” is “correct” Only higher levels can measure quality: inherently app-specific Lower levels can make monitoring convenient

Quality of Result (Qo. R) monitoring Offline: Profile, auto-tune • Pluggable energy model Online:

Quality of Result (Qo. R) monitoring Offline: Profile, auto-tune • Pluggable energy model Online: Can react – recompute or decrease approximation-level • Handle situations not seen during testing • But needs to be cheap! First design space of online monitors • Identify several common patterns • And apps where they work well

Online monitoring approaches General Strawman: Rerun computations precisely; compare. More energy than no approximation.

Online monitoring approaches General Strawman: Rerun computations precisely; compare. More energy than no approximation. Sampling: Rerun some computations precisely; compare. Verification Functions: Run a cheap check after every computation (e. g. , within expected bounds, solves equation, etc. ) Fuzzy Memoization: Record past inputs and outputs of computations. Check that “similar” inputs produce “similar” outputs.

Results [See paper for energy model, benchmarks, etc. ] Application Monitor Type Ray Tracer

Results [See paper for energy model, benchmarks, etc. ] Application Monitor Type Ray Tracer Sampling Asteroids Verifier Triangle Verifier Sobel Fuzzy Memo FFT Fuzzy Memo

Results [See paper for energy model, benchmarks, etc. ] Application Monitor Type Precise Approx

Results [See paper for energy model, benchmarks, etc. ] Application Monitor Type Precise Approx No Monitor Ray Tracer Sampling 100% 67% Asteroids Verifier 100% 91% Triangle Verifier 100% 83% Sobel Fuzzy Memo 100% 86% FFT Fuzzy Memo 100% 73%

Results [See paper for energy model, benchmarks, etc. ] Application Monitor Type Precise Approx

Results [See paper for energy model, benchmarks, etc. ] Application Monitor Type Precise Approx Retained No Monitor Savings Ray Tracer Sampling 100% 67% 85% 44% Asteroids Verifier 100% 91% 95% 56% Triangle Verifier 100% 83% 87% 78% Sobel Fuzzy Memo 100% 86% 93% 49% FFT Fuzzy Memo 100% 73% 82% 70%

Application language for where-toapproximate Ener. J Monitoring Compiler traditional architecture Truffle Architecture Support for

Application language for where-toapproximate Ener. J Monitoring Compiler traditional architecture Truffle Architecture Support for Disciplined Approximate Programming, ASPLOS 2012 dynamic quality evaluation

Hardware support for disciplined approximate execution int p = 5; @approx int a =

Hardware support for disciplined approximate execution int p = 5; @approx int a = 7; for (int x = 0. . ) { a += func(2); @approx int z; z = p * 2; p += 4; } a /= 9; func 2(p); a += func(2); @approx int y; z = p * 22 + z; p += 10; VDDH Compiler Truffle Core VDDL

Approximation-aware ISA ld ld add st 0 x 04 r 1 0 x 08

Approximation-aware ISA ld ld add st 0 x 04 r 1 0 x 08 r 2 r 1 r 2 r 3 0 x 0 c r 3

Approximation-aware ISA ld ld add. a st. a 0 x 04 r 1 0

Approximation-aware ISA ld ld add. a st. a 0 x 04 r 1 0 x 08 r 2 r 1 r 2 r 3 0 x 0 c r 3 operations ALU storage registers caches main memory

Approximate semantics add. a r 1 r 2 r 3: some value writes the

Approximate semantics add. a r 1 r 2 r 3: some value writes the sum of r 1 and r 2 to r 3 Informally: r 3 gets something approximating Still guarantee: r 1 + r 2 • No other register modified • No floating point trap raised Actual error pattern depends on • Program counter doesn’t jump approximation technique, • No missiles launched microarchitecture, voltage, process, • … variation, …

Dual-voltage pipeline replicated functional units Fetch Decode Reg Read Branch Predictor Instruction Cache Execute

Dual-voltage pipeline replicated functional units Fetch Decode Reg Read Branch Predictor Instruction Cache Execute Memory Write Back Integer FU Data Cache Decoder ITLB Register File FP FU DTLB dual-voltage SRAM arrays 7– 24% energy saved on average (fft, game engines, raytracing, QR code readers, etc) (scope: processor + memory)

Diminishing returns on data Fetch Decode Reg Read Branch Predictor Instruction Cache ITLB Execute

Diminishing returns on data Fetch Decode Reg Read Branch Predictor Instruction Cache ITLB Execute Memory Write Back Integer FU Data Cache Decoder Register File FP FU DTLB • Instruction control cannot be approximated ‒ Can we eliminate instruction bookkeeping?

Application language for where-toapproximate Ener. J Monitoring dynamic quality evaluation Compiler traditional architecture Truffle

Application language for where-toapproximate Ener. J Monitoring dynamic quality evaluation Compiler traditional architecture Truffle Architecture. NPU Neural Acceleration for General-Purpose Approximate Programs, MICRO 2012 neural networks as accelerators

Using Neural Networks as Approximate Accelerators Very efficient hardware implementations! Trainable to mimic many

Using Neural Networks as Approximate Accelerators Very efficient hardware implementations! Trainable to mimic many computations! CPU Recall is imprecise. Fault tolerant [Temam, ISCA 2012] Map entire regions of code to an (approximate) accelerator: ‒ Get rid of instruction processing (von Neumann bottleneck)

Neural acceleration Program

Neural acceleration Program

Neural acceleration Find an approximable program component Program

Neural acceleration Find an approximable program component Program

Neural acceleration Find an approximable program component Program Compile the program and train a

Neural acceleration Find an approximable program component Program Compile the program and train a neural network

Neural acceleration Find an approximable program component Program Compile the program and train a

Neural acceleration Find an approximable program component Program Compile the program and train a neural network Execute on a fast Neural Processing Unit (NPU)

Example: Sobel filter @approx float grad(approx float[3][3] p) { … } @approx float dst[][];

Example: Sobel filter @approx float grad(approx float[3][3] p) { … } @approx float dst[][]; edge. Detection() void edge. Detection(a. Image &src, a. Image &dst) { for (int y = …) { for (int x = …) { dst[x][y] = grad(window(src, x, y)); } } }

Code region criteria √ Approximable √ Well-defined no side effects inputs and outputs small

Code region criteria √ Approximable √ Well-defined no side effects inputs and outputs small errors do not corrupt output

Code observation record(p); record(result); p [[NPU]] float grad(float[3][3] p) { … } + test

Code observation record(p); record(result); p [[NPU]] float grad(float[3][3] p) { … } + test cases void edge. Detection(Image &src, Image &dst) { for (int y = …) { for (int x = …) { dst[x][y] = grad(window(src, x, y)); } } } instrumented program => grad(p) 323, 231, 122, 93, 321, 49 49, 423, 293, 2 34, 129, 493, 49, 31, 11 21, 85, 47, 62, 21, 577 7, 55, 28, 96, 552, 921 5, 129, 493, 49, 31, 11 49, 423, 293, 2 34, 129, 72, 49, 5, 2 323, 231, 122, 93, 321, 49 6, 423, 293, 2 ➝ ➝ ➝ ➝ ➝ 53. 2 94. 2 1. 2 64. 2 18. 1 92. 2 6. 5 120 53. 2 49. 7 sample arguments & outputs

Training Inputs Backpropagation Training

Training Inputs Backpropagation Training

Training Inputs 70% 98% 99% faster less robust slower more accurate

Training Inputs 70% 98% 99% faster less robust slower more accurate

Neural Processing Unit (NPU) Core NPU

Neural Processing Unit (NPU) Core NPU

Software interface: ISA extensions input enq. d Core output deq. d enq. c deq.

Software interface: ISA extensions input enq. d Core output deq. d enq. c deq. c configuration NPU

A digital NPU Bus Scheduler scheduling Processing Engines input output

A digital NPU Bus Scheduler scheduling Processing Engines input output

A digital NPU scheduling multiply-add unit Bus Scheduler input neuron weights output accumulator sigmoid

A digital NPU scheduling multiply-add unit Bus Scheduler input neuron weights output accumulator sigmoid LUT Processing Engines Many other implementations FPGAs Analog input output Hybrid HW/SW SW only? . . .

How does the NN look in practice? edge detection triangle intersection 88 static instructions

How does the NN look in practice? edge detection triangle intersection 88 static instructions 56% of dynamic instructions 1, 079 static x 86 -64 instructions 97% of dynamic instructions 18 neurons 60 neurons 2 hidden layers

Results Summary 2. 3 x average speedup Ranges from 0. 8 x to 11.

Results Summary 2. 3 x average speedup Ranges from 0. 8 x to 11. 1 x Also in the MICRO paper: 3. 0 x average energy reductionlatency Sensitivity to communication Sensitivitybenefit to NN evaluation efficiency All benchmarks Sensitivity to PE count Quality loss belowstatistics 10% in all cases Benchmark All-software NN slowdown • As always, quality metrics application-specific

Ongoing effort Disciplined Approximate Programming (Ener. J, Ener. C, . . . ) Approximate

Ongoing effort Disciplined Approximate Programming (Ener. J, Ener. C, . . . ) Approximate Compiler X 86, ARM off-the-shelf Truffle: dual-Vdd u. Arch Approximate Accelerators Approximate compiler (no special HW) • Analog components • Storage via PCM • Wireless communication • Approximate HDL • Approximate Wireless Networks and Storage

Conclusion Goal: approximate computing across the system Key: Co-design programming model with approximation techniques:

Conclusion Goal: approximate computing across the system Key: Co-design programming model with approximation techniques: disciplined approximate programming Early results encouraging Approximate computing can potentially save our bacon in a post-Dennard era and be in the survival kit for dark silicon

Disciplined Approximate Computing: From Language to Hardware and Beyond University of Washington Adrian Sampson,

Disciplined Approximate Computing: From Language to Hardware and Beyond University of Washington Adrian Sampson, Hadi Esmaelizadeh, Michael Ringenburg, Reneé St. Amant, Luis Ceze, Dan Grossman, Mark Oskin, Karin Strauss, and Doug Burger Thanks!