RASTER IMAGE PROCESSING ON THE TMS 320 C

  • Slides: 38
Download presentation
RASTER IMAGE PROCESSING ON THE TMS 320 C 6 X VLIW DSP Accumulator architecture

RASTER IMAGE PROCESSING ON THE TMS 320 C 6 X VLIW DSP Accumulator architecture Memory-register architecture Prof. Brian L. Evans in collaboration with Niranjan Damera-Venkata and Wade Schwartzkopf Embedded Signal Processing Laboratory The University of Texas at Austin, TX 78712 -1084 http: //signal. ece. utexas. edu/ Load-store architecture

Outline n Introduction n Color conversion n Interpolation n Halftoning n C/C++ coding tips

Outline n Introduction n Color conversion n Interpolation n Halftoning n C/C++ coding tips n Conclusion 2

Introduction n Raster scan n Raster image processing 4 Process one or more rows

Introduction n Raster scan n Raster image processing 4 Process one or more rows at a time 4 Pixel operations: color conversion, ordered dither halftoning 4 Local operations: JPEG coding, FIR filtering, interpolation, error diffusion halftoning 3

Raster Image Processing on the TMS 320 C 6 x n TMS 320 CC

Raster Image Processing on the TMS 320 C 6 x n TMS 320 CC 6 x works best with 16 -bit data 4 Bytes per image pixel: 1 for greyscale, 3 or 4 for color 4 Reduce processor performance or double memory n Number of 4800 -pixel rows (e. g. 8 in. at 600 dpi) of a greyscale image that can fit into memory ** C 6211 has 512 kbits of L 2 on-chip cache. All of it used for the image. 4

Color Spaces n RGB: Red Green Blue 4 Additive color 4 CRT displays n

Color Spaces n RGB: Red Green Blue 4 Additive color 4 CRT displays n YCr. Cb: Luminance Chrominance 4 Decouples intensity and color information 4 Digital image/video compression standards (digital TV) 4 Eye less sensitive to chrominance than luminance: subsample Cr/Cb without significant visual degradation n CMY(K): Cyan Magenta Yellow (Black) 4 Subtractive color 4 Printing and photography 4 Black ink used for improved color gamut, and faster drying and purer rendering for black and greys 5

ITU RGB to YCr. Cb Standards n Nested conversion formulas n 8 -bit format

ITU RGB to YCr. Cb Standards n Nested conversion formulas n 8 -bit format 4 Y = Cred R + Cgreen G + Cblue B 4 Y, R, G, B in [0, 255] 4 Cr = (R - Y) / (2 - 2 Cred) 4 Cr in [-128, 127] 4 Cb = (B - Y) / (2 - 2 Cblue) 4 Cb in [-128, 127] n Y is lossless, but Cr/Cb is clipped to [-128, 127] n Assume that RGB has been gamma corrected n Rec 601 -1 used with TIFF and JPEG standards 6

ITU YCr. Cb to RGB Standards n Nested conversion formulas n 8 -bit format

ITU YCr. Cb to RGB Standards n Nested conversion formulas n 8 -bit format 4 R = Y + (2 - 2 Cred) Cr 4 Y, R, G, B in [0, 255] 4 B = Y + (2 - 2 Cblue) Cb 4 Cr in [-128, 127] 4 G = (Y - Cblue B - Cred R) / Cgreen 4 Cb in [-128, 127] n Range of G is [-134, 390] for Rec 601 -1 n RGB values are clipped to [0, 255] and rounded http: //www. neuro. sfc. keio. ac. jp/~aly/polygon/info/color-space-faq. html http: //www. inforamp. net/~poynton/notes/colour_and_gamma/Color. FAQ. html 7

RGB/YCr. Cb Conversion in Floating Point n Nested formulas Y = 0. 2989 R

RGB/YCr. Cb Conversion in Floating Point n Nested formulas Y = 0. 2989 R + 0. 5866 G + 0. 1145 B R = Y + 1. 4022 Cr Cr = 0. 7132 (R - Y) B = Y + 1. 7710 Cb Cb = 0. 5647 (B - Y) G = 1. 7047 Y - 0. 1952 B - 0. 5647 R n Matrix multiplication 0. 2989 0. 5866 0. 1145 0. 5000 -0. 4183 -0. 0817 -0. 1688 -0. 3312 0. 5000 [Y Cr Cb]T = M [R G B]T n 1 1. 4022 -0. 7145 0 0 -0. 3458 1. 7710 [R G B]T = M [Y Cr Cb]T Round and clip each quantity to eight bits 8

RGB/YCr. Cb Conversion in Fixed Point n Multiplication by direction calculation 4 Quantize coefficients

RGB/YCr. Cb Conversion in Fixed Point n Multiplication by direction calculation 4 Quantize coefficients 4 Put coefficients in registers and stream pixels 4 Highly accurate under extended precision accumulation 4 Well-matched to DSPs and graphics cards n Multiplication by table lookup 4 Precalculate multiplications (floating point times byte) 4 Store in 5 (9) 256 -byte tables for nested (matrix) formulas means 5 (9) times increase in memory bandwidth and poor cache performance 4 Do not need extended precision accumulators 4 Well-matched to ASICs and microcontrollers n Additions: 4 (6) for nested (matrix) formulas 9

RGB/YCr. Cb Conversion on 16 -bit DSP n n n Move 8 -bit color

RGB/YCr. Cb Conversion on 16 -bit DSP n n n Move 8 -bit color quantity to upper 8 of 16 bits Use 16 x 16 multiplication, 32 -bit accumulation Nested formulas (coefficients scaled by 215 -1) Y = 9794 R + 19221 G + 3752 B R = Y + 2 (22973 Cr) Cr = 23369 (R - Y) B = Y + 2 (29015 Cb) Cb = 18504 (B - Y) G = 2 (27929) Y - 6396 B - 18504 R n Matrix multiplication (coefficients scaled by 215 -1) 9794 19221 3752 16384 -13706 -2677 -5531 -10852 16384 [Y Cr Cb]T = M [R G B]T 32767 2 (22973) 0 32767 -23412 -11331 32767 0 2 (29015) [R G B]T = M [Y Cr Cb]T 10

RGB to CMY and CMYK n RGB to CMY (ideal case) Division takes 1

RGB to CMY and CMYK n RGB to CMY (ideal case) Division takes 1 or 2 instructions per bit of precision in result 4 C = 255 - R 4 M = 255 - G 4 Y = 255 - B n CMY to CMYK RGB to CMYK 4 K = min(C, M, Y) K = 255 - max(R, G, B) 4 C = 255 (C - K) / (255 - K) C = 255 (m - R) / m 4 M = 255 (M - K) / (255 - K) M = 255 (m - G) / m 4 Y = 255 (Y - K) / (255 - K) Y = 255 (m - B) / m 4 2 -D lookup tables m = max(R, G, B) n R, G, B, C, M, Y, and K have a range of [0, 255] n Useful in printers and copiers 11

Matrix Computation Example ; ; ; ; ; Texas Instruments, INC. MATRIX VECTOR MULTIPLY

Matrix Computation Example ; ; ; ; ; Texas Instruments, INC. MATRIX VECTOR MULTIPLY ftp: //ftp. ti. com/pub/tms 320 bbs/c 67 xfiles/mvm. asm DESCRIPTION A[][] * B[] = C[] ARGUMENTS PASSED a[] -> A 4 b[] -> B 4 c[] -> A 6 rows -> B 6 columns -> A 8 CYCLES (n + 20)*m + 1 (m = # of rows, n = # of columns) 12

Matrix Computation Example (cont. ) *** begin piplining inner loop SUB ADD LDW SUB

Matrix Computation Example (cont. ) *** begin piplining inner loop SUB ADD LDW SUB . L 1 X. L 2. D 1 T 1. D 2 T 2. S 2 X rows, 1, ocntr bptr, 4, btmp *aptr++(4), aa 0 *bptr, bb 0 colms, 1, lcntr ; 1 ; load a[i] from memory load b[i] from memory load cntr = comumns - 1 [lcntr] || || LDW SUB ZERO . D 1 T 1. D 2 T 2. L 2. S 1. L 1 *aptr++(4), aa 0 *btmp++(4), bb 0 lcntr, 1, lcntr colms, 2, icntr sum 0 ; 2 ; 2 ; ; if(lcntr) load a[i] from memory if(lcntr) load b[i] from memory if(lcntr) lcntr -= 1 [lcntr] || [lcntr] LDW . D 1 T 1. D 2 T 2 *aptr++(4), aa 0 *btmp++(4), bb 0 ; 3 if(lcntr) load a[i] from memory if(lcntr) load b[i] from memory || [lcntr] SUB . L 2 lcntr, 1, lcntr ; 3 if(lcntr) lcntr -= 1 [lcntr] || [lcntr] LDW SUB . D 1 T 1. D 2 T 2. L 2 *aptr++(4), aa 0 *btmp++(4), bb 0 lcntr, 1, lcntr ; 4 ; 4 if(lcntr) load a[i] from memory if(lcntr) load b[i] from memory if(lcntr) lcntr -= 1 || || oloop: zero the running sum 13

Matrix Computation Example (cont. ) [lcntr] || LDW SUB B . D 1 T

Matrix Computation Example (cont. ) [lcntr] || LDW SUB B . D 1 T 1. D 2 T 2. L 2. S 2 *aptr++(4), aa 0 *btmp++(4), bb 0 lcntr, 1, lcntr iloop ; 5 ; 5 ; 1 if(lcntr) branch to load a[i] from memory load b[i] from memory lcntr -= 1 iloop [lcntr] || [icntr] || || [icntr] LDW SUB MPYSP B . D 1 T 1. D 2 T 2. L 1. M 1 X. S 2 *aptr++(4), aa 0 *btmp++(4), bb 0 lcntr, 1, lcntr icntr, 1, icntr aa 0, bb 0, mult 0 iloop ; 6 ; 6 ; 1 ; 2 if(lcntr) load a[i] from memory if(lcntr) load b[i] from memory if(lcntr) lcntr -= 1 if(icntr) icntr -= 1 mult 0 = a[i]*b[i] if(icntr) branch to iloop [lcntr] || [icntr] || || [icntr] LDW SUB MPYSP B . D 1 T 1. D 2 T 2. L 1. M 1 X. S 2 *aptr++(4), aa 0 *btmp++(4), bb 0 lcntr, 1, lcntr icntr, 1, icntr aa 0, bb 0, mult 0 iloop ; 7 ; 7 ; 2 ; 3 if(lcntr) load a[i] from memory if(lcntr) load b[i] from memory if(lcntr) lcntr -= 1 if(icntr) icntr -= 1 mult 0 = a[i]*b[i] if(icntr) branch to iloop 14

Matrix Computation Example (cont. ) [lcntr] || [icntr] || || [icntr] LDW SUB MPYSP

Matrix Computation Example (cont. ) [lcntr] || [icntr] || || [icntr] LDW SUB MPYSP B . D 1 T 1. D 2 T 2. L 1. M 1 X. S 2 *aptr++(4), aa 0 *btmp++(4), bb 0 lcntr, 1, lcntr icntr, 1, icntr aa 0, bb 0, mult 0 iloop ; 8 ; 8 ; 3 ; 4 if(lcntr) load a[i] from memory if(lcntr) load b[i] from memory if(lcntr) lcntr -= 1 if(icntr) icntr -= 1 mult 0 = a[i]*b[i] if(icntr) branch to iloop [lcntr] || [icntr] || || [icntr] LDW SUB MPYSP B . D 1 T 1. D 2 T 2. L 1. M 1 X. S 2 *aptr++(4), aa 0 *btmp++(4), bb 0 lcntr, 1, lcntr icntr, 1, icntr aa 0, bb 0, mult 0 iloop ; 9 ; 9 ; 4 ; 5 if(lcntr) load a[i] from memory if(lcntr) load b[i] from memory if(lcntr) lcntr -= 1 if(icntr) icntr -= 1 mult 0 = a[i]*b[i] if(icntr) branch to iloop LDW SUB MPYSP ADDSP B . D 1 T 1. D 2 T 2. L 2. S 1. M 1 X. L 1. S 2 *aptr++(4), aa 0 *btmp++(4), bb 0 lcntr, 1, lcntr icntr, 1, icntr aa 0, bb 0, mult 0, sum 0 iloop ; 10 ; 5 ; 1 ; 6 iloop: || || || [lcntr] [icntr] if(lcntr) load a[i] from memory if(lcntr) load b[i] from memory if(lcntr) lcntr -= 1 if(icntr) icntr -= 1 mult 0 = a[i]*b[i] sum 0 = sum 0+mult 0 if(icntr) branch to iloop 15

Matrix Computation Example (cont. ) ********* add up the running sums *** . S

Matrix Computation Example (cont. ) ********* add up the running sums *** . S 2. L 1 sum 0, temp 1, temp 2 sum 0, temp 1, temp 3 2 oloop temp 2, temp 3, sum 0 ; ; ; ; temp 1 = sum 0 temp 2 = temp 1 + sum 0 (2 nd sum 0) temp 1 = sum 0 (the 3 rd sum 0) temp 3 = temp 1 + sum 0 (4 th sum 0) wait for temp 3 if(ocntr) branch to oloop sum 0 = temp 2 + temp 3 MV . D 2 bptr, btmp ; reset *b to beginning of b || SUB . S 1. S 2 X colms, 2, icntr colms, 1, lcntr ; ; inner cntr = columns - 2 load cntr = comumns - 1 || LDW . D 1 T 1. D 2 T 2 *aptr++(4), aa 0 *btmp++(4), bb 0 ; 1 load a[i] from memory load b[i] from memory || [ocntr] STW SUB . D 1. L 1 sum 0, *cptr++(4) ocntr, 1, ocntr ; ; c[i] = sum 0 if(ocntr) ocntr -= 1 [ocntr] MV ADDSP NOP B ADDSP . D 1. L 1 *** [ocntr] 16

Nearest Neighbor Interpolation 1 1 Pixel to be interpolated (assigned a value of 0)

Nearest Neighbor Interpolation 1 1 Pixel to be interpolated (assigned a value of 0) Pixels of original image u Convolution mask for the interpolation 17

Nearest Neighbor Interpolation n Interpolation by pixel replication 4 Computationally simple 4 Aliasing /*

Nearest Neighbor Interpolation n Interpolation by pixel replication 4 Computationally simple 4 Aliasing /* v is the zoomed (interpolated) version of u */ v[m, n]=u[round(m/2), round(n/2)] n May be implemented as 2 -D FIR filter by H 4 Alternate pixels may be H= 1 1 skipped 18

Bilinear Interpolation n Interpolate rows then columns (or vice-versa) 4 Increased complexity 4 Reduced

Bilinear Interpolation n Interpolate rows then columns (or vice-versa) 4 Increased complexity 4 Reduced aliasing /* v is the zoomed (interpolated) version of u */ v 1[m, 2 n] = u[m, n] v 1[m, 2 n+1] = a 1*u[m, n]+a 2*u[m, n+1] v[2 m, n] = v 1[m, n] v[2 m+1, n] = b 1*v 1[m, n]+b 2*v 1[m+1, n] n May be implemented as a 2 -D FIR filter by H followed by a shift H= 1 2 4 2 2 1 1 >> 4 19

2 -D FIR Filter n Difference equation y(n) = 2 x(n 1, n 2)

2 -D FIR Filter n Difference equation y(n) = 2 x(n 1, n 2) + 3 x(n 1 -1, n 2) + x(n 1, n 2 -1) + x(n 1 -1, n 2 -1) n Flow graph a(m 1, m 2) m 2 m 1 n x(n 1, n 2) n 2 n 1 (rows) Vector dot product plus keep M 1 rows in memory and circularly buffer input 20

2 -D Filter Implementations n n Store M 1 x M 2 filter coefficients

2 -D Filter Implementations n n Store M 1 x M 2 filter coefficients in sequential memory (vector) of length M = M 1 M 2 For each output, form vector from N 1 x N 2 image 1 M 1 separate dot products of length M 2 as bytes 2 Form image vector by raster scanning image as bytes 3 Form image vector by raster scanning image as words Raster scan 21

2 -D FIR Implementation #1 on C 6 x ; registers: A 5=&a(0, 0)

2 -D FIR Implementation #1 on C 6 x ; registers: A 5=&a(0, 0) B 5=&x(n 1, n 2) B 7=M A 9=M 2 B 8=N 2 fir 2 d 1 MV. D 1 A 9, A 2 ; inner product length || SUB. D 2 B 8, B 7, B 10 ; offset to next row || CMPLT. L 1 B 7, A 9, A 1 ; A 1=no more rows to do || ZERO. S 1 A 4 ; initialize accumulator || SUB. S 2 B 7, A 9, B 7 ; number of taps left fir 1 LDBU. D 1 *A 5++, A 6 ; load a(m 1, m 2), zero fill || LDBU. D 2 *B 5++, B 6 ; load x(n 1 -m 1, n 2 -m 2) || MPYU. M 1 X A 6, B 6, A 3 ; A 3=a(m 1, m 2) x(n 1 -m 1, n 2 -m 2) || ADD. L 1 A 3, A 4 ; y(n 1, n 2) += A 3 ||[A 2] SUB. S 1 A 2, 1, A 2 ; decrement loop counter ||[A 2] B. S 2 fir 1 ; if A 2 != 0, then branch MV. D 1 A 9, A 2 ; inner product length || CMPLT. L 1 B 7, A 9, A 1 ; A 1=no more rows to do || ADD. L 2 B 5, B 10, B 5 ; advance to next image row ||[!A 1]B. S 1 fir 1 ; outer loop || SUB. S 2 B 7, A 9, B 7 ; count number of taps left ; A 4=y(n 1, n 2) 22

2 -D FIR Implementation #2 on C 6 x ; registers: A 5=&a(0, 0)

2 -D FIR Implementation #2 on C 6 x ; registers: A 5=&a(0, 0) B 5=&x(n 1, n 2) A 2=M B 7=M 2 B 8=N 2 fir 2 d 2 SUB. D 2 B 8, B 7, B 9 ; byte offset between rows || ZERO. L 1 A 4 ; initialize accumulator || SUB. L 2 B 7, 1, B 7 ; B 7 = num. Fil. Cols - 1 || ZERO. S 2 B 2 ; offset into image data fir 2 || || || LDBU. D 1 *A 5++, A 6 LDBU. D 2 *B 6[B 2], B 6 MPYU. M 1 X A 6, B 6, A 3 ADD. L 1 A 3, A 4 CMPLT. L 2 B 2, B 7, B 1 ADD. S 2 B 2, 1, B 2 [!B 1] ADD. L 2 ||[A 2] SUB. S 1 ||[A 2] B. S 2 ; A 4=y(n 1, n 2) B 2, B 9, B 2 A 2, 1, A 2 fir 2 ; ; ; load a(m 1, m 2), zero fill load x(n 1 -m 1, n 2 -m 2) A 3=a(m 1, m 2) x(n 1 -m 1, n 2 -m 2) y(n 1, n 2) += A 3 need to go to next row? incr offset into image ; move offset to next row ; decrement loop counter ; if A 2 != 0, then branch 23

2 -D FIR Implementation #3 on C 6 x ; registers: A 5=&a(0, 0)

2 -D FIR Implementation #3 on C 6 x ; registers: A 5=&a(0, 0) B 5=&x(n 1, n 2) A 2=M B 7=M 2 B 8=N 2 fir 2 d 3 ZERO. D 1 A 4 ; initialize accumulator #1 || SUB. D 2 B 8, B 7, B 9 ; index offset between rows || ZERO. L 2 B 2 ; offset into image data || MVKH. S 1 0 x. FF, A 8 ; mask to get lowest 8 bits || SHR. S 2 B 7, 1, B 7 ; divide by 2: 16 bit address || || ZERO SHR . D 2. L 1. L 2. S 1. S 2 B 4 A 6 B 6 A 2, 1, A 2 B 9, 1, B 9 ; ; ; initialize accumulator #2 current coefficient value current image value divide by 2: 16 bit address Initialization 24

2 -D FIR Implementation #3 on C 6 x (cont. ) fir 3 ||

2 -D FIR Implementation #3 on C 6 x (cont. ) fir 3 || || || LDHU. D 1 *A 5++, A 6 LDHU. D 2 *B 6[B 2], B 6 CMPLT. L 2 B 2, B 7, B 1 ADD. S 2 B 2, 1, B 2 ; ; load need incr || || || AND EXTU ; ; extract . L 1. L 2. S 1. S 2 A 6, A 8, A 6 B 6, A 8, B 6 A 6, 0, 8, A 9 B 6, 0, 8, B 9 a(m 1, m 2) a(m 1+1, m 2+1) two pixels of image x to go to next row? offset into image a(m 1, m 2) x(n 1 -m 1, n 2 -m 2) a(m 1+1, m 2+1) x(n 1 -m 1+1, n 2 -m 2+1) MPYHU. M 1 X A 6, B 6, A 3 ; A 3=a(m 1, m 2) x(n 1 -m 1, n 2 -m 2) || MPYHU. M 2 X A 9, B 3 ; B 3=a*x offset by 1 index || ADD. L 1 A 3, A 4 ; y(n 1, n 2) += A 3 || ADD. L 2 B 3, B 4 ; y(n 1+1, n 2+1) += B 3 ||[!B 1]ADD. D 2 B 2, B 9, B 2 ; move offset to next row ||[A 2] SUB. S 1 A 2, 1, A 2 ; decrement loop counter ||[A 2] B. S 2 fir 3 ; if A 2 != 0, then branch ; A 4=y(n 1, n 2) and B 4=y(n 1+1, n 2+1) Main Loop 25

FIR Filter Implementation on the C 6 x MVK. S 1 0 x 0001,

FIR Filter Implementation on the C 6 x MVK. S 1 0 x 0001, AMR ; modulo block size 2^2 MVKH. S 1 0 x 4000, AMR ; modulo addr register B 6 MVK. S 2 2, A 2 ; A 2 = 2 (four-tap filter) ZERO. L 1 A 4 ; initialize accumulators ZERO. L 2 B 4 ; initialize pointers A 5, B 6, and A 7 fir LDW. D 1 *A 5++, A 0 ; load a(n) and a(n+1) LDW. D 2 *B 6++, B 1 ; load x(n) and x(n+1) MPY. M 1 X A 0, B 1, A 3 ; A 3 = a(n) * x(n) MPYH. M 2 X A 0, B 1, B 3 ; B 3 = a(n+1) * x(n+1) ADD. L 1 A 3, A 4 ; yeven(n) += A 3 ADD. L 2 B 3, B 4 ; yodd(n) += B 3 [A 2] SUB. S 1 A 2, 1, A 2 ; decrement loop counter [A 2] B. S 2 fir ; if A 2 != 0, then branch ADD. L 1 A 4, B 4, A 4 ; Y = Yodd + Yeven STH. D 1 A 4, *A 7 ; *A 7 = Y Throughput of two multiply-accumulates per cycle 26

Ordered Dithering on a TMS 320 C 62 x periodic array of thresholds 1/8

Ordered Dithering on a TMS 320 C 62 x periodic array of thresholds 1/8 5/8 7/8 3/8 1/8 5/8 Throughput of two cycles ; remove next two lines if thresholds in linear array MVK. S 1 0 x 0001, AMR ; modulo block size 2^2 MVKH. S 1 0 x 4000, AMR ; modulo addr reg B 6 ; initialize A 6 and B 6. trip 100 ; minimum loop count dith: LDB. D 1 *A 6++, A 4 ; read pixel || LDB. D 2 *B 6++, B 4 ; read threshold || CMPGTU. L 1 x A 4, B 4, A 1 ; threshold pixel || ZERO. S 1 A 5 ; 0 if <= threshold [A 1] MVK. S 1 255, A 5 ; 255 if > threshold || STB. D 1 A 5, *A 6++ ; store result ||[B 0] SUB. L 2 B 0, 1, B 0 ; decrement counter ||[B 0] B. S 2 dith ; branch if not zero 27

More Efficient Ordered Dithering on the C 6 x || || || MVK SHL

More Efficient Ordered Dithering on the C 6 x || || || MVK SHL MVKH SHL . S 1. S 2 0 x 00 ff, A 8 0 x 0001, AMR A 8, 8, A 9 0 x 4000, AMR A 8, 16, A 10 A 8, 24, B 9 ; ; ; white pixel #1 modulo block size 2^2 white pixel #2 modulo addr reg. B 6 ; white pixel #3 white pixel #4 ; initialize ; A 2 number of pixels divided by 4 ; A 6 pointer to pixels (will be overwritten) ; B 6 pointer to thresholds dith 2: LDW. D 1 *A 6, A 4 ; read 4 pixels (bytes) LDW. D 2 *B 6++, B 4 ; read 4 thresholds EXTU. S 1 A 4, 24, A 12 ; extract pixel #2 EXTU. S 2 B 4, 24, B 12 ; extract threshold #2 ZERO. L 1 A 5 ; store output in A 5 CMPLTU. L 2 A 12, B 0 ; B 0 = (A 12 < B 12) Throughput of 1. 25 pixels Initialization 28

More Efficient Ordered Dithering on the C 6 x [!B 0] [!B 1] [!B

More Efficient Ordered Dithering on the C 6 x [!B 0] [!B 1] [!B 2] EXTU OR CMPLTU . S 1. S 2. L 1. L 2 A 4, 16, 24, A 13 B 4, 16, 24, B 13 A 5, A 8, A 5 A 13, B 1 ; ; extract pixel #2 extract threshold #2 output of pixel 1 B 1 = (A 13 < B 13) EXTU OR CMPLTU . S 1. S 2. L 1. L 2 A 4, 8, 24, A 14 B 4, 8, 24, B 14 A 5, A 9, A 5 A 14, B 2 ; ; extract pixel #3 extract threshold #3 output of pixels 1 -2 B 2 = (A 14 < B 14) EXTU OR CMPLTU . S 1. S 2. L 1 A 4, 0, 24, A 15 B 4, 0, 24, B 15 A 5, B 9, B 5 A 15, B 15, A 1 ; ; extract pixel #4 extract threshold #4 output of pixels 1 -3 B 2 = (A 15 < B 15) [!A 1] OR pixels 1 -4 STW [A 2] SUB [A 2] B . S 1 B 5, A 11, A 5. D 1 A 5, *A 6++. L 1 A 2, 1, A 2. L 2 dith 2 ; output of ; store results ; decrement loop count ; if A 2 != 0, branch 29

Floyd-Steinberg Error Diffusion n Noise-shaped feedback coder (2 -D sigma delta) error n Error

Floyd-Steinberg Error Diffusion n Noise-shaped feedback coder (2 -D sigma delta) error n Error filter H(z) 30

Floyd-Steinberg Error Diffusion n n C implementation color/grayscale error diffusion Replacing multiplications with adds

Floyd-Steinberg Error Diffusion n n C implementation color/grayscale error diffusion Replacing multiplications with adds and shifts 4 3*error = (error << 2) - error 4 5*error = (error << 2) + error 4 7*error = (error << 3) - error 4 Can reuse (error << 2) calculation n Replace division by 16 with adds and shifts 4 n >> 4 does not give right answer for negative n 4 Add offset of 24 -1 = 15 for negative n: (n + 15) >> 4 4 Alternative is to work with |error | n Combine nested for loops into one for loop that can be pipelined by the C 6 x tools 31

C/C++ Coding Tips n Local variables 4 Define only when and where needed to

C/C++ Coding Tips n Local variables 4 Define only when and where needed to assist compiler in mapping variables to registers (especially on C 6 x) 4 Give initial values to avoid uninitialized read errors 4 Choose names to indicate purpose and data type 4 In C, may only be defined at start of new environment 4 In C++, may be defined anywhere 4 Function arguments as local variables (may be updated) n Reading strings from files using fgets 4 Reads N characters or newline, whichever comes first 4 Does not guarantee that newline is read 4 Does not guarantee that string is null terminated n Define as many constants as possible 32

C/C++ Coding Tips int file. Has. Line(FILE *file. Ptr, char *search. Str) { char

C/C++ Coding Tips int file. Has. Line(FILE *file. Ptr, char *search. Str) { char buf. Str[128], *str. Ptr; int found. Flag; found. Flag = 0; while ( ! feof(file. Ptr) ) { str. Ptr = fgets(buf. Str, 127, file. Ptr); if (str. Ptr && strcmp(buf. Str, search. Str) == 0) { found. Flag = 1; break; } } return(found. Flag); } Not robust #define BUFLEN 128 Robust int file. Has. Line(FILE *file. Ptr, const char *search. Str) { int found. Flag = FALSE; while ( ! feof(file. Ptr) ) { char buf. Str[BUFLEN]; int buf. Str. Len = 0; char *str. Ptr = fgets(buf. Str, BUFLEN-1, file. Ptr); buf. Str[BUFLEN-1] = ‘’; buf. Str. Len = strlen(buf. Str); if ( buf. Str[buf. Str. Len-1] == ‘n’ ) buf. Str[buf. Str. Len - 1] = ‘’; if (str. Ptr && strcmp(buf. Str, search. Str) == 0) { found. Flag = TRUE; break; } } Differences return(found. Flag); in blue } 33

C/C++ Coding Tips n Allocating dynamic memory 4 Function mallocates but does not initialize

C/C++ Coding Tips n Allocating dynamic memory 4 Function mallocates but does not initialize values: use calloc (allocate/initialize) or memset (initialize) 4 In C++, new operator calls malloc and then calls the constructor for each created object 4 On failure, malloc and new return 0: when new fails, _new_handler is called if set (set by set_new_handler) n Deallocating dynamic memory 4 Function free crashes if passed a null pointer 4 In C++, delete operator first calls destructor of object(s) and then calls free: delete ignores null pointers 4 Use delete [] array. Ptr to deallocate an array 4 Zero pointer after deallocating it to prevent redeletion 4 Deallocate a pointer before reassigning it 34

C/C++ Coding Tips Filter: : Filter() { buf = 0; } Filter: : Allocate.

C/C++ Coding Tips Filter: : Filter() { buf = 0; } Filter: : Allocate. Buffer(int n) { buf = new int [n]; } Filter: : Deallocate. Buffer() { if (buf) delete buf; } Filter: : ~Filter() { Deallocate. Buffer(); } Not robust Filter: : Allocate. Buffer(int n) { Deallocate. Buffer(); buf = new int [n]; if (buf == 0) { cerr << “allocation failed”; exit(0); } memset(buf, 0, n* sizeof(int)); } Filter: : Deallocate. Buffer() { delete [] buf; buf = 0; } Robust (keep constructor and destructor) 35

C/C++ Coding Tips n Static string length #define KEYSTR “Market. Share” #define STATIC_STRLEN(s) (sizeof(s)

C/C++ Coding Tips n Static string length #define KEYSTR “Market. Share” #define STATIC_STRLEN(s) (sizeof(s) - 1) strncmp(str. Buf, KEYSTR, STATIC_STRLEN(KEYSTR)) == 0) n Dynamic string length #define IS_STRING_NULL(s) (! *(s)) #define IS_STRING_NOT_NULL (*(s)) char* robust. Get. String(char* buf. Str, int buf. Len, FILE* file. Ptr) { char *str. Ptr = fgets(buf. Str, buf. Len-1, file. Ptr); buf. Str[buf. Len-1] = ‘’; buf. Str. Len = strlen(buf. Str); if ( buf. Str[buf. Str. Len-1] == ‘n’ ) buf. Str[buf. Str. Len - 1] = ‘’; return(str. Ptr); } 36

Conclusion n Printer pipeline 4 RGB to YCr. Cb conversion 4 JPEG compression and

Conclusion n Printer pipeline 4 RGB to YCr. Cb conversion 4 JPEG compression and decompression 4 Document segmentation and enhancement 4 YCr. Cb to RGB to CMYK conversion 4 Interpolation (e. g nearest neighbor or bilinear) 4 Halftoning (e. g. ordered dither or error diffusion) n Split embedded software systems 4 C++ for non-real-time tasks: GUIs and file input/output 4 C for low-level image processing operations 4 ANSI C can be cross-compiled onto DSPs 4 Program C code to work with blocks or rows because embedded processors have little on-chip memory 37

Conclusion n Web resources 4 comp. dsp newsgroup: FAQ www. bdti. com/faq/dsp_faq. html 4

Conclusion n Web resources 4 comp. dsp newsgroup: FAQ www. bdti. com/faq/dsp_faq. html 4 embedded processors and systems: www. eg 3. com 4 on-line courses and DSP boards: www. techonline. com 4 software development: www. ece. utexas. edu/~bevans/talks/software_development 4 TI color laser printer x. Stream technology www. ti. com/sc/docs/dsps/xstream/index. htm n References 4 B. L. Evans, “Software Development in the Unix Environment”. http: //www. ece. utexas. edu/~bevans/talks/software_development/ 4 B. L. Evans, “EE 379 K-17 Real-Time DSP Laboratory, ” UT Austin. http: //www. ece. utexas. edu/~bevans/courses/realtime/ 4 B. L. Evans, “EE 382 C Embedded Software Systems, ” UT Austin. http: //www. ece. utexas. edu/~bevans/courses/ee 382 c/ 38