Cell Applications Solutions Histogram Equalization with Cell Broadband

Cell Applications and Solutions Content n. Overview: Histogram Equalization n. Definitions n. Assumptions, Highlights n. Approach: Histogram Computation n. Approach: Transform Image n. Performance Results 2 IBM Confidential 9/30/2020 © 2006 IBM Research

Cell Applications and Solutions Overview: Histogram Equalization n One of the most significant part of Image Processing n Improves contrast by redistributing intensity distributions n Compute a uniform histogram Three stages: 1. Compute 2. Normalize 3. Transform 3 IBM Confidential 9/30/2020 © 2006 IBM Research

Cell Applications and Solutions Definitions First Stage: Computing the Histogram n Parse the input image n Count each distinct pixel value in the image n Ex. for 8 -bit pixels, the Max Pixel Value is 255, and array size is 256. Second Stage: Computing the normalized sum of histogram n Store the sum of all the histogram values n normalize by multiplying each element by (maximum-pixelvalue/number of pixels). Third Stage: Transforming input image into output image n Use the normalized array as a look up table for mapping the input image pixel value to the new set of values from stage 4 IBM Confidential 9/30/2020 © 2006 IBM Research

Cell Applications and Solutions Assumptions, Highlights Assumptions for demo: n 8 -bit color scale Approach Highlights: n Parallelize n Reduce dependencies n Loop unroll n SIMDize the code using vectors and SPE intrinsics 5 IBM Confidential 9/30/2020 © 2006 IBM Research

Cell Applications and Solutions Scalar Code Flow #define ROUND(v) (int)((v) + 0. 5) //!-- Round it to the closest integer #define __min(a, b) ( ((a) < (b)) ? (a) : (b) ) #define __max(a, b) ( ((a) > (b)) ? (a) : (b) ) #define BOUND(v) (unsigned char)(__min(255, __max((v), 0))) // 0 -255 { int size = PIXEL_DATA_SIZE; unsigned char map[size]; unsigned char src[size]; unsigned char dest[size]; unsigned int counts[256]; double sc; long v; int i, index; unsigned int sum=0; for(i=0; i < size; i++) { counts[i] = 0; src[i] = random() & 0 x. FF; } 6 IBM Confidential for (i=0; i<size; i++) { Compute counts[src[i]]++; Histogram } sc = PIXEL_MAX_VALUE / (double) IMAGE_SIZE; for (i = 0; i < size; i++) { sum += counts[i]; Normalized v = ROUND(sc * sum); sum of Histogram map[i] = BOUND(v); } for (i = 0; i < size; i++) { dest[i] = map[src[i]]; Transform } Histogram } 9/30/2020 © 2006 IBM Research

Cell Applications and Solutions Histogram Computation Vector unsigned char - load 16 bytes at a time to use the 128 bit register boundary Data Array Byte 0 Byte F 2 B 1 B 3 B 1 B 0 4 B 1 2 3 4 These 6 bits determine which of the 64 element array index it should go to 6 7 For ex. 01 10 11 Counter 0 vector unsigned int 00 01 10 Counter 1 vector unsigned int IBM Confidential 11 00 rd Slot ’ 10’ – 3 slot 64 64 Slots containing 32 bit counter value 7 110000 10 Counter 0[48] These two bits decide which slot to go into 64 64 00 5 01 10 11 Counter 2 vector unsigned int 00 01 10 11 Counter 3 vector unsigned int 64 element vector(128 bits) arrays – each containing 4 32 bit counters 4 of them are created to enable parallel computation and loop unrolling 9/30/2020 © 2006 IBM Research

Cell Applications and Solutions Code sections for Histogram computation unsigned int idx_0, idx_1, idx_2, idx_3; int slot_0, slot_1, slot_2, slot_3; vector unsigned char in; vector unsigned char *vdata; vector unsigned int *vcounts; vector unsigned int in_0, in_1, in_2, in_3; vector unsigned int cnts_0[64]; vector unsigned int cnts_1[64]; vector unsigned int cnts_2[64]; vector unsigned int cnts_3[64]; /* Roll the counters into the overall (external) count array. */ for (i=0; i<64; i+=4) { vector unsigned int sum 0, sum 1, sum 2, sum 3; sum 0 = spu_add(cnts_0[i], cnts_1[i]); sum 1 = spu_add(cnts_0[i+1], cnts_1[i+1]); sum 2 = spu_add(cnts_0[i+2], cnts_1[i+2]); sum 3 = spu_add(cnts_0[i+3], cnts_1[i+3]); vdata = (vector unsigned char *)(data); for (i=15; i<size; i+=16) { in = *vdata++; sum 0 = spu_add(sum 0, cnts_2[i]); sum 1 = spu_add(sum 1, cnts_2[i+1]); sum 2 = spu_add(sum 2, cnts_2[i+2]); sum 3 = spu_add(sum 3, cnts_2[i+3]); //!-- Loop Unroll 1: //!-- Handle the first 16 bytes from the input string in_0 = spu_and((vector unsigned int)(in), 0 x. FF); in_1 = spu_and(spu_rlmask((vector unsigned int)(in), -8), 0 x. FF); in_2 = spu_and(spu_rlmask((vector unsigned int)(in), -16), 0 x. FF); in_3 = spu_rlmask((vector unsigned int)(in), -24); vcounts[i] = spu_add(sum 0, cnts_3[i]); vcounts[i+1] = spu_add(sum 1, cnts_3[i+1]); vcounts[i+2] = spu_add(sum 2, cnts_3[i+2]); vcounts[i+3] = spu_add(sum 3, cnts_3[i+3]); } idx_0 = spu_extract(in_0, 0); idx_1 = spu_extract(in_1, 0); idx_2 = spu_extract(in_2, 0); idx_3 = spu_extract(in_3, 0); slot_0 = (0 - idx_0) << 2; slot_1 = (0 - idx_1) << 2; slot_2 = (0 - idx_2) << 2; slot_3 = (0 - idx_3) << 2; This is repeated four times idx_0 >>= 2; idx_1 >>= 2; idx_2 >>= 2; idx_3 >>= 2; cnts_0[idx_0] = spu_add(cnts_0[idx_0], cnts_1[idx_1] = spu_add(cnts_1[idx_1], cnts_2[idx_2] = spu_add(cnts_2[idx_2], cnts_3[idx_3] = spu_add(cnts_3[idx_3], spu_rlqwbyte(one, slot_0)); spu_rlqwbyte(one, slot_1)); spu_rlqwbyte(one, slot_2)); spu_rlqwbyte(one, slot_3)); //!– Repeat for 1, 2, 3, //!– Loop Unroll 2: --- The above code section rolls the 4 counters into one counter } 8 IBM Confidential 9/30/2020 © 2006 IBM Research

Cell Applications and Solutions Normalized Sum v = count[i] v 0 v 0 + v = count[i] X v 1 v 1 + 1. Compute the sum for the 64 vector entries 2. Multiply with the normalization constant 3. Clamp it to be 0 -255 4. Store in an character map LUT v = count[i] X X v 2 + v = count[i] X 9 X X v 3 IBM Confidential float sc = PIXEL_MAX_VALUE/ (float) IMAGE_SIZE; vector float vc = spu_splats((float)sc); float scr = 0. 5; vector float vr = spu_splats((float) scr); vector float vf 1, vf 2; vector unsigned char splat 0 = (vector unsigned char) {0, 1, 2, 3, 0, 1, 2, 3}; vector unsigned char splat 1 = (vector unsigned char) {128, 4, 5, 6, 7, 4, 5, 6, 7}; vector unsigned char splat 2 = (vector unsigned char){128, 128, 8, 9, 10, 11, 8, 9, 10, 11}; vector unsigned char splat 3 = (vector unsigned char){12, 13, 14, 15, 12, 13, 14, 15}; vector unsigned int mask 3 = (vector unsigned int){0, 0, 0, -1} //!-- TODO: Convert it so the computation is pipelined. TRACE("Print the final character map: n"); for(i=0; i<size; i++) { v = counts[i]; sum = spu_shuffle(sum, splat 3); v 0 = spu_shuffle(v, v, splat 0); v 1 = spu_shuffle(v, v, splat 1); v 2 = spu_shuffle(v, v, splat 2); v 3 = spu_and(v, mask 3); sum = spu_add(spu_add(sum, v 3), v 2), spu_add(v 1, v 0)); //!-- Normalize, round it vf 2 = spu_convtf(sum, 0); vf 1 = spu_madd(vf 2, vc, vr); mapvi[i] = spu_convtu(vf 1, 0); for(j=0; j<4; j++) { var = spu_extract(mapvi[i], j); map[k] = BOUND(var); //!-- TODO vectorize this TRACE("%d ", map[k]); k++; } 9/30/2020 © 2006 IBM Research

Cell Applications and Solutions Transform the image 0 - 15 16 - 31 32 - 47 48 - 63 64 - 79 80 - 95 96 - 111 112 - 127 234 - 239 240 - 255 Byte Shuffle using the MSB 5 bits Select using index bit 2 Select using index bit 1 Select using index bit 0 0 10 IBM Confidential 1 2 3 4 5 6 7 9/30/2020 © 2006 IBM Research

Cell Applications and Solutions Performance Results n. Environment: Ø Benchmark was written in C and using xlc compiler. Ø IBM Systemsim & Cell Blade was used to collect performance numbers. Ø Sample grayscale image (pieh 2. pgm) n Configuration: Ø Cell blade is running at 3. 2 GHz. Ø DMA operations are not counted in the calculation. Ø Performance numbers are derived from the cycles count collected on a single SPE. n. Performance numbers: Ø Histogram computation & image mapping(stage 1, 2, 3) combined at 0. 50 Gigapixels/second for 100 K 11 IBM Confidential 9/30/2020 © 2006 IBM Research