Image processing language Halide Dmitry Kurtaev Internet of

  • Slides: 36
Download presentation
Image processing language Halide Dmitry Kurtaev Internet of Things Group

Image processing language Halide Dmitry Kurtaev Internet of Things Group

Agenda § Halide’s basics § Samples – RGB → Gray – Box filter (blurring)

Agenda § Halide’s basics § Samples – RGB → Gray – Box filter (blurring) – Histogram computation – K-means (color reduction) § Scheduling § Gamma correction on GPU in Halide § Ahead-of-time compilation Internet of Things Group 2

Halide § Programming language embedded in C++ (C++11 or higher) § Designed for image

Halide § Programming language embedded in C++ (C++11 or higher) § Designed for image processing tasks § LLVM compiler backend § Generates C code, Open. CL/CUDA kernels, Open. GL shaders § Cross-platform compilation with AVX/SSE/FMA/F 16 features § High-level scheduling § It isn’t Turing complete Internet of Things Group 3

Halide Algorithm, Scheduling Halide Intermediate representation (IR) LLVM Machine code Open. CL compiler CUDA

Halide Algorithm, Scheduling Halide Intermediate representation (IR) LLVM Machine code Open. CL compiler CUDA Internet of Things Group 4

Halide’s basic entities § Variables Halide: : Var x("x"), xo("xo"), xi("xi"); Halide: : RDom

Halide’s basic entities § Variables Halide: : Var x("x"), xo("xo"), xi("xi"); Halide: : RDom r(-1, 3); // r in [-1, 1] § Reduction domains Halide: : Expr e = sin((x + r) * 0. 1 f); § Expressions Halide: : Func f("f"); f(x) = sum(e); § Functions § Scheduling directives § Buffers f. split(x, xo, xi, 16). parallel(xo). vectorize(xi); f. compile_jit(Halide: : get_host_target()); Halide: : Buffer<float> output(1000); f. realize(output); Internet of Things Group 5

Halide pipeline § Write algorithm Halide: : Var x("x"), xo("xo"), xi("xi"); Halide: : RDom

Halide pipeline § Write algorithm Halide: : Var x("x"), xo("xo"), xi("xi"); Halide: : RDom r(-1, 3); // r in [-1, 1] § Schedule Halide: : Expr e = sin((x + r) * 0. 1 f); § Choose OS, architecture, features Halide: : Func f("f"); f(x) = sum(e); § Compile § Realize f. split(x, xo, xi, 16). parallel(xo). vectorize(xi); f. compile_jit(Halide: : get_host_target()); Halide: : Buffer<float> output(1000); f. realize(output); Internet of Things Group 6

RGB 2 Gray void rgb 2 gray(const uint 8_t* src, uint 8_t* dst, int

RGB 2 Gray void rgb 2 gray(const uint 8_t* src, uint 8_t* dst, int height, int width) { for (int i = 0; i < width * height; ++i) { dst[i] = (uint 8_t)(0. 299 f * src[i * 3 + 0] + // red 0. 587 f * src[i * 3 + 1] + // green 0. 114 f * src[i * 3 + 2]); // blue } } Internet of Things Group 7

RGB 2 Gray using TBB tbb: : parallel_for( tbb: : blocked_range<int>(0, height * width),

RGB 2 Gray using TBB tbb: : parallel_for( tbb: : blocked_range<int>(0, height * width), [&](tbb: : blocked_range<int> r) { uint 16_t red, green, blue, R 2 GRAY = 77, G 2 GRAY = 150, B 2 GRAY = 29; int begin = r. begin(); int end = r. end(); const uint 8_t* __restrict__ src. Data = src; uint 8_t* __restrict__ dst. Data = dst; for (int i = begin; i < end; ++i) { red = src. Data[i * 3]; green = src. Data[i * 3 + 1]; blue = src. Data[i * 3 + 2]; dst. Data[i] = (R 2 GRAY * red + G 2 GRAY * green + B 2 GRAY * blue) >> 8; } }); Internet of Things Group 8

RGB 2 Gray in Halide uint 16_t R 2 GRAY = 77, G 2

RGB 2 Gray in Halide uint 16_t R 2 GRAY = 77, G 2 GRAY = 150, B 2 GRAY = 29; Func f("rgb 2 gray"); auto input = Buffer<uint 8_t>: : make_interleaved(src, width, height, 3); Var x("x"), y("y"); Expr r = cast<uint 16_t>(input(x, y, Expr g = cast<uint 16_t>(input(x, y, Expr b = cast<uint 16_t>(input(x, y, f(x, y) = cast<uint 8_t>((R 2 GRAY * r 0)); 1)); 2)); + G 2 GRAY * g + B 2 GRAY * b) >> 8); Buffer<uint 8_t> output(dst, {width, height}); f. realize(output); Internet of Things Group 9

RGB 2 Gray efficiency comparison (1920 x 1280) Intel® Core™ i 5 -4460 CPU

RGB 2 Gray efficiency comparison (1920 x 1280) Intel® Core™ i 5 -4460 CPU @ 3. 20 GHz x 4 Open. CV (GNU 5. 4. 0) 0. 76 ms Internet of Things Group TBB (Intel® C++ Compiler) 0. 674 ms Halide (LLVM 5. 0. 1) 1. 315 ms 10

Scheduling Var x("x"), y("y"); Func f("f"); f(x, y) = 0; f. print_loop_nest(); f. trace_stores();

Scheduling Var x("x"), y("y"); Func f("f"); f(x, y) = 0; f. print_loop_nest(); f. trace_stores(); f. realize(10, 10); produce f: for y: for x: f(. . . ) =. . . Internet of Things Group 11

Scheduling Var x("x"), y("y"), yo("yo"), yi("yi"); Func f("f"); f(x, y) = 0; f. bound(x,

Scheduling Var x("x"), y("y"), yo("yo"), yi("yi"); Func f("f"); f(x, y) = 0; f. bound(x, 0, 10). bound(y, 0, 10). split(y, yo, yi, 5). parallel(yo); f. print_loop_nest(); f. trace_stores(); f. realize(10, 10); Internet of Things Group produce f: parallel y. yo: for y. yi in [0, 4]: for x: f(. . . ) =. . . 12

Scheduling Var x("x"), y("y"), yo("yo"), yi("yi"), xo("xo"), xi("xi"), tile("tile"); Func f("f"); f(x, y) =

Scheduling Var x("x"), y("y"), yo("yo"), yi("yi"), xo("xo"), xi("xi"), tile("tile"); Func f("f"); f(x, y) = 0; f. bound(x, 0, 10). bound(y, 0, 10). split(y, yo, yi, 5). split(x, xo, xi, 5). reorder(xi, yi, xo, yo). fuse(xo, yo, tile). parallel(tile); f. print_loop_nest(); f. trace_stores(); f. realize(10, 10); Internet of Things Group produce f: parallel x. xo. tile: for y. yi in [0, 4]: for x. xi in [0, 4]: f(. . . ) =. . . 13

Scheduling Var x("x"), y("y"), yo("yo"), yi("yi"); Func f("f"); f(x, y) = 0; f. bound(x,

Scheduling Var x("x"), y("y"), yo("yo"), yi("yi"); Func f("f"); f(x, y) = 0; f. bound(x, 0, 10). bound(y, 0, 10). split(y, yo, yi, 5). parallel(yo). vectorize(x, 4); f. print_loop_nest(); f. trace_stores(); f. realize(10, 10); Internet of Things Group produce f: parallel y. yo: for y. yi in [0, 4]: for x. x: vectorized x. v 8 in [0, 3]: f(. . . ) =. . . 14

Scheduling Var x("x"), y("y"), yo("yo"), yi("yi"); Func f("f"); f(x, y) = 0; f. estimate(x,

Scheduling Var x("x"), y("y"), yo("yo"), yi("yi"); Func f("f"); f(x, y) = 0; f. estimate(x, 0, 10). estimate(y, 0, 10); Pipeline(f). auto_schedule(get_host_target()); f. print_loop_nest(); f. trace_stores(); f. realize(10, 10); Internet of Things Group produce f: parallel y: parallel x. x_vo: vectorized x. x_vi in [0, 7]: f(. . . ) =. . . 15

RGB 2 Gray efficiency comparison (1920 x 1280) Intel® Core™ i 5 -4460 CPU

RGB 2 Gray efficiency comparison (1920 x 1280) Intel® Core™ i 5 -4460 CPU @ 3. 20 GHz x 4 Open. CV (GNU 5. 4. 0) 0. 76 ms TBB (Intel® C++ Compiler) 0. 674 ms f. split(y, yo, yi, 64). parallel(yo). vectorize(x, 8); 0. 796 ms Internet of Things Group Halide (LLVM 5. 0. 1) 1. 315 ms (serial) f. split(y, yo, yi, 64). split(x, xo, xi, 64). reorder(xi, yi, xo, yo). fuse(xo, yo, tile). parallel(tile). vectorize(x, 8); 1. 221 ms f. parallel(y). vectorize(x, 32); (auto scheduling) 0. 869 ms 16

Box filter in Halide Func f("box_filter"); auto input = Buffer<uint 8_t>: : make_interleaved(src, width,

Box filter in Halide Func f("box_filter"); auto input = Buffer<uint 8_t>: : make_interleaved(src, width, height, 3); Func padded = Boundary. Conditions: : constant_exterior(input, 0); Var x("x"), y("y"), c("c"); Func input_uint 16("input_uint 16"); input_uint 16(x, y, c) = cast<uint 16_t>(padded(x, y, c)); RDom r(-1, 3, -1, 3); Expr s = sum(input_uint 16(x + r. x, y + r. y, c)); float ratio = 1. 0 f / 9; f(x, y, c) = cast<uint 8_t>(s * ratio); f. output_buffer(). dim(0). set_stride(3). set_bounds(0, width); f. output_buffer(). dim(1). set_stride(3 * width). set_bounds(0, height); f. output_buffer(). dim(2). set_stride(1). set_bounds(0, 3); Internet of Things Group 17

Box filter efficiency comparison (1920 x 1280) Intel® Core™ i 5 -4460 CPU @

Box filter efficiency comparison (1920 x 1280) Intel® Core™ i 5 -4460 CPU @ 3. 20 GHz x 4 Open. CV (GNU 5. 4. 0) 3. 603 ms Internet of Things Group TBB (Intel® C++ Compiler) 4. 779 ms (is not well auto-vectorized) Halide (LLVM 5. 0. 1) 3. 784 ms 18

Scheduling: producer-consumer 81. 5 ms @ 1920 x 1280 Var x("x"), y("y"); Func producer("producer"),

Scheduling: producer-consumer 81. 5 ms @ 1920 x 1280 Var x("x"), y("y"); Func producer("producer"), consumer("consumer"); producer(x, y) = sin(x + y); consumer(x, y) = producer(x, y - 1) + producer(x - 1, y) + producer(x + 1, y) + producer(x, y + 1); consumer. realize(5, 5); producer is inlided to consumer ⇒ #sin – 125! Var x("x"), y("y"); Func consumer("consumer"); consumer(x, y) = sin(x + y - 1) + sin(x – 1 + y) + sin(x + y + 1); consumer. realize(5, 5); Internet of Things Group 19

Scheduling: producer-consumer 33. 4 ms @ 1920 x 1280 Var x("x"), y("y"); Func producer("producer"),

Scheduling: producer-consumer 33. 4 ms @ 1920 x 1280 Var x("x"), y("y"); Func producer("producer"), consumer("consumer"); producer(x, y) = sin(x + y); consumer(x, y) = producer(x, y - 1) + producer(x - 1, y) + producer(x + 1, y) + producer(x, y + 1); producer. compute_root(); producer. trace_loads(); producer. trace_stores(); consumer. realize(5, 5); producer Internet of Things Group consumer 20

Scheduling: producer-consumer 82. 5 ms @ 1920 x 1280 Var x("x"), y("y"); Func producer("producer"),

Scheduling: producer-consumer 82. 5 ms @ 1920 x 1280 Var x("x"), y("y"); Func producer("producer"), consumer("consumer"); producer(x, y) = sin(x + y); consumer(x, y) = producer(x, y - 1) + producer(x - 1, y) + producer(x + 1, y) + producer(x, y + 1); producer. compute_at(consumer, y); producer. trace_loads(); producer. trace_stores(); consumer. realize(5, 5); producer Internet of Things Group consumer 21

Scheduling: producer-consumer 28. 1 ms @ 1920 x 1280 Var x("x"), y("y"); Func producer("producer"),

Scheduling: producer-consumer 28. 1 ms @ 1920 x 1280 Var x("x"), y("y"); Func producer("producer"), consumer("consumer"); producer(x, y) = sin(x + y); consumer(x, y) = producer(x, y - 1) + producer(x - 1, y) + producer(x + 1, y) + producer(x, y + 1); producer. store_root(); producer. compute_at(consumer, y); producer. trace_loads(); producer. trace_stores(); consumer. realize(5, 5); producer Internet of Things Group consumer 22

Scheduling: producer-consumer 50. 5 ms @ 1920 x 1280 (need to be parallelized) Var

Scheduling: producer-consumer 50. 5 ms @ 1920 x 1280 (need to be parallelized) Var x("x"), y("y"); Func producer("producer"), consumer("consumer"); producer(x, y) = sin(x + y); consumer(x, y) = producer(x, y - 1) + producer(x - 1, y) + producer(x + 1, y) + producer(x, y + 1); producer. store_root(); producer. compute_at(consumer, x); producer. trace_loads(); producer. trace_stores(); consumer. realize(5, 5); producer Internet of Things Group consumer 23

Histogram computation in Halide void histogram(uint 8_t* src, int* dst, int height, int width)

Histogram computation in Halide void histogram(uint 8_t* src, int* dst, int height, int width) { static Func f("hist"); static Buffer<int> output(dst, {256, 3}); if (!f. defined()) { auto input = Buffer<uint 8_t>: : make_interleaved(src, width, height, 3); Var c("c"), i("i"); RDom r(0, width, 0, height); f(i, c) = 0; Expr lum = clamp(input(r. x, r. y, c), 0, 255); f(lum, c) += 1; f. estimate(i, 0, 256). estimate(c, 0, 3); Pipeline(f). auto_schedule(get_host_target()); } f. realize(output); } Internet of Things Group 24

K-means in Halide 256 colors Internet of Things Group 4 colors 25

K-means in Halide 256 colors Internet of Things Group 4 colors 25

K-means in Halide Func clusters. Func("clusters. Func"), kmeans("kmeans"); Buffer<uint 8_t> clusters({k}); Buffer<uint 8_t> input((uint

K-means in Halide Func clusters. Func("clusters. Func"), kmeans("kmeans"); Buffer<uint 8_t> clusters({k}); Buffer<uint 8_t> input((uint 8_t*)src, width * height); Func clusters. Map("clusters. Map"); Var x("x"); RDom r(0, k); clusters. Map(x) = argmin(abs(cast<int 16_t>(input(x)) cast<int 16_t>(clusters(r))))[0]; Internet of Things Group 26

K-means in Halide Var i("i"); RDom im(input); Func count("count"); count(i) = 0; Expr cluster.

K-means in Halide Var i("i"); RDom im(input); Func count("count"); count(i) = 0; Expr cluster. Id = clamp(clusters. Map(im), 0, k - 1); count(cluster. Id) += 1; Func s("s"); s(i) = 0; s(cluster. Id) += cast<uint 32_t>(input(im)); Internet of Things Group 27

K-means in Halide clusters. Func(i) = cast<uint 8_t>(s(i) / max(count(i), 1)); clusters. Func. estimate(i,

K-means in Halide clusters. Func(i) = cast<uint 8_t>(s(i) / max(count(i), 1)); clusters. Func. estimate(i, 0, k); kmeans(x) = clusters. Func(clamp(clusters. Map(x), 0, k - 1)); kmeans. estimate(x, 0, width * height); Pipeline({clusters. Func, kmeans}). auto_schedule(get_host_target()); for (int i = 0; i < k; ++i) clusters(i) = rand() % 256; for (int i = 0; i < 14; ++i) clusters. Func. realize(clusters); Buffer<uint 8_t> output((uint 8_t*)dst, width * height); kmeans. realize(output); Internet of Things Group 28

Run Halide code on GPU Internet of Things Group 29

Run Halide code on GPU Internet of Things Group 29

void gamma(const uint 8_t* src, uint 8_t* dst, int height, int width, float gamma,

void gamma(const uint 8_t* src, uint 8_t* dst, int height, int width, float gamma, bool gpu) { static Func f("correction"); static Buffer<uint 8_t> inp((uint 8_t*)src, 3, width, height), out(dst, 3, width, height); if (!f. defined()) { Var x("x"), y("y"), c("c"), xo("xo"), xi("xi"), yo("yo"), yi("yi"); f(c, x, y) = Halide: : cast<uint 8_t>(255 * pow(inp(c, x, y) * 1. 0 f / 255, gamma)); Halide: : Target t = Halide: : get_host_target(); if (gpu) { t. set_feature(Halide: : Target: : Open. CL); f. bound(x, 0, width). bound(y, 0, height). bound(c, 0, 3). split(x, xo, xi, 16). split(y, yo, yi, 16). reorder(xi, yi, c, xo, yo). gpu_blocks(c, xo, yo). gpu_threads(xi, yi); } else { f. estimate(x, 0, width). estimate(y, 0, height). estimate(c, 0, 3); Pipeline(f). auto_schedule(t); } f. compile_jit(t); } inp. set_host_dirty(); f. realize(out); out. copy_to_host(); } Internet of Things Group 30

Ahead-of-time compilation Func f("bgr 2 gray") Halide Internet of Things Group Intermediate representation (IR)

Ahead-of-time compilation Func f("bgr 2 gray") Halide Internet of Things Group Intermediate representation (IR) bgr 2 gray. a (*. lib) bgr 2 gray. o (*. obj) LLVM bgr 2 gray. h 31

Ahead-of-time compilation uint 16_t R 2 GRAY = 77, G 2 GRAY = 150,

Ahead-of-time compilation uint 16_t R 2 GRAY = 77, G 2 GRAY = 150, B 2 GRAY = 29; Func f("bgr 2 gray"); Image. Param input(Halide: : UInt(8), 3, "input"); input. dim(0). set_bounds_estimate(0, 3). set_stride(1); input. dim(1). set_bounds_estimate(0, 640). set_stride(3); input. dim(2). set_bounds_estimate(0, 480). set_stride(3*640); Var x("x"), y("y"); Expr b = cast<uint 16_t>(input(0, x, Expr g = cast<uint 16_t>(input(1, x, Expr r = cast<uint 16_t>(input(2, x, f(x, y) = cast<uint 8_t>((R 2 GRAY * r y)); + G 2 GRAY * g + B 2 GRAY * b) >> 8); f. estimate(x, 0, 640). estimate(y, 0, 480); Pipeline(f). auto_schedule(get_host_target()); f. compile_to_static_library("bgr 2 gray", {input}, "bgr 2 gray"); Internet of Things Group 32

#include <opencv 2/opencv. hpp> #include "bgr 2 gray. h" int main(int argc, char** argv)

#include <opencv 2/opencv. hpp> #include "bgr 2 gray. h" int main(int argc, char** argv) { cv: : Mat frame(480, 640, CV_8 UC 3), res(480, 640, CV_8 UC 1); halide_buffer_t inp. Buffer; inp. Buffer. type = halide_type_t(halide_type_uint, 8); inp. Buffer. host = frame. data; halide_dimension_t inp. Dims[] = {halide_dimension_t(0, 3, 1), halide_dimension_t(0, 640, 3), halide_dimension_t(0, 480, 3*640)}; inp. Buffer. dim = &inp. Dims[0]; inp. Buffer. dimensions = 3; halide_buffer_t out. Buffer; out. Buffer. type = halide_type_t(halide_type_uint, 8); out. Buffer. host = res. data; halide_dimension_t out. Dims[] = {halide_dimension_t(0, 640, 1), halide_dimension_t(0, 480, 640)}; out. Buffer. dim = &out. Dims[0]; out. Buffer. dimensions = 2; cv: : Video. Capture cap(0); while (cv: : wait. Key(1) < 0) { cap >> frame; if (frame. empty()) break; bgr 2 gray(&inp. Buffer, &out. Buffer); cv: : imshow("Output", res); } return 0; } Internet of Things Group 33

Summary § One code – many devices § Split algorithm and optimization § High-level

Summary § One code – many devices § Split algorithm and optimization § High-level computations management § Domain-specific compilation with no Halide dependency § Write algorithms “by definition” Internet of Things Group 34

Q&A Internet of Things Group 35

Q&A Internet of Things Group 35

Internet of Things Group 36

Internet of Things Group 36