INTRODUCTION TO THE TMS 320 C 6000 VLIW

Outline n C 6000 instruction set architecture review n Vector dot product example n

TI TMS 320 C 6000 DSP Architecture (Review) Simplified Architecture Program RAM or Cache

TI TMS 320 C 6000 DSP Architecture (Review) n Address 8/16/32 bit data +

TI TMS 320 C 6000 DSP Architecture (Review) n . M multiplication unit 4

C 6000 Restrictions on Register Accesses n Each function unit has read/write ports 4

Other C 6000 Disadvantages n No ALU acceleration for bit stream manipulation 4 50%

FIR Filter n Difference equation (vector dot product) y(n) = 2 x(n) + 3

FIR Filter n Each tap requires z-1 z-1 4 Fetching data sample 4 Fetching

Example: Vector Dot Product (Unoptimized) n A vector dot product is common in filtering

Example: Vector Dot Product (Unoptimized) n Prologue 4 Initialize pointers: A 5 for a(n),

Example: Vector Dot Product (Unoptimized) Coefficients a(n) Data x(n) Using A data path only

Example: Vector Dot Product (Unoptimized) n Mo. Ve. Konstant 4 MVK. S 40, A

Pipelining n CPU operations 4 Fetch instruction from (on-chip) program memory 4 Decode instruction

Pipelining Sequential (Motorola 56000) Fetch Decode Read Execute Pipelined (Most conventional DSP processors) Fetch

TMS 320 C 6000 Pipeline n One instruction cycle every clock cycle n Deep

Program Fetch (F) n Program fetching consists of 4 phases 4 Generate fetch address

Decode Stage (D) n Decode stage consists of two phases 4 Dispatch instruction to

Vector Dot Product with Pipeline Effects ; clear A 4 and initialize MVK. S

Dispatch F F(2 -5) DP DC E 1 E 2 E 3 E 4

Execute (MVK done LDH in E 1) F DP DC E 1 E 2

Vector Dot Product with Pipeline Effects ; clear A 4 MVK loop LDH NOP

Optimized Vector Dot Product on the C 6000 n Prologue 4 Retime dot product

FIR Filter Implementation on the C 6000 MVK. S 1 0 x 0001, AMR

Selected TMS 320 C 6000 Fixed-Point DSPs C 6416 has Viterbi and Turbo decoder

Conclusion n Conventional digital signal processors 4 High performance vs. power consumption/cost/volume 4 Excel

Conclusion n Web resources 4 comp. dsp news group: FAQ www. bdti. com/faq/dsp_faq. html

Supplemental Slides FIR Filter on a TMS 320 C 5000 Coefficients Data COEFFP. set

Supplemental Slides TMS 320 C 6200 vs. Star. Core S 140 * Does not

Slides: 33

Download presentation

INTRODUCTION TO THE TMS 320 C 6000 VLIW DSP Accumulator architecture Memory-register architecture Prof. Brian L. Evans in collaboration with Niranjan Damera-Venkata and Magesh Valliappan Embedded Signal Processing Laboratory The University of Texas at Austin, TX 78712 -1084 http: //signal. ece. utexas. edu/ Load-store architecture

Outline n C 6000 instruction set architecture review n Vector dot product example n Pipelining n Finite impulse response filtering n Vector dot product example n Conclusion 2 -2

TI TMS 320 C 6000 DSP Architecture (Review) Simplified Architecture Program RAM or Cache Data RAM Addr Internal Buses Data . D 2 . M 1 . M 2 . L 1 . L 2 . S 1 . S 2 Regs (B 0 -B 15) Regs (A 0 -A 15) External Memory -Sync -Async . D 1 DMA Serial Port Host Port Boot Load Timers Control Regs Pwr Down C 6200 fixed point C 6400 fixed point C 6700 floating point CPU 2 -3

TI TMS 320 C 6000 DSP Architecture (Review) n Address 8/16/32 bit data + 64 -bit data on C 67 x n Load-store RISC architecture with 2 data paths 4 16 32 -bit registers per data path (A 0 -A 15 and B 0 -B 15) 4 48 instructions (C 6200) and 79 instructions (C 6700) n Two parallel data paths with 32 -bit RISC units 4 Data unit - 32 -bit address calculations (modulo, linear) 4 Multiplier unit - 16 bit x 16 bit with 32 -bit result 4 Logical unit - 40 -bit (saturation) arithmetic & compares 4 Shifter unit - 32 -bit integer ALU and 40 -bit shifter 4 Conditionally executed based on registers A 1 -2 & B 0 -2 4 Work with two 16 -bit halfwords packed into 32 bits 2 -4

TI TMS 320 C 6000 DSP Architecture (Review) n . M multiplication unit 4 16 bit x 16 bit signed/unsigned packed/unpacked n . L arithmetic logic unit 4 Comparisons and logic operations (and, or, and xor) 4 Saturation arithmetic and absolute value calculation n . S shifter unit 4 Bit manipulation (set, get, shift, rotate) and branching 4 Addition and packed addition n . D data unit 4 Load/store to memory 4 Addition and pointer arithmetic 2 -5

C 6000 Restrictions on Register Accesses n Each function unit has read/write ports 4 Data path 1 (2) units read/write A (B) registers 4 Data path 2 (1) can read one A (B) register per cycle n Two simultaneous memory accesses cannot use registers of same register file as address pointers n Limit of four 32 -bit reads per register per cycle n 40 -bit longs stored in adjacent even/odd registers 4 Extended precision accumulation of 32 -bit numbers 4 Only one 40 -bit result can be written per cycle 4 40 -bit read cannot occur in same cycle as 40 -bit write 4 4: 1 performance penalty using 40 -bit mode 2 -6

Other C 6000 Disadvantages n No ALU acceleration for bit stream manipulation 4 50% computation in MPEG-2 decoder spent on variable length decoding on C 6200 in C 4 C 6400 direct memory access controllers shred bit streams (for video conferencing & wireless basestations) n n n Branch in pipeline disables interrupts: Avoid branches by using conditional execution No hardware protection against pipeline hazards: Programmer and tools must guard against it Must emulate many conventional DSP features 4 No hardware looping: use register/conditional branch 4 No bit-reversed addressing: use fast algorithm by Elster 4 No status register: only saturation bit given by. L units 2 -7

FIR Filter n Difference equation (vector dot product) y(n) = 2 x(n) + 3 x(n - 1) + 4 x(n - 2) + 5 x(n - 3) n Signal flow graph x(n) z 2 -1 z 3 -1 z 4 Tapped delay line -1 5 y(n) n Dot product of inputs vector and coefficient vector n Store input in circular buffer, coefficients in array 2 -8

FIR Filter n Each tap requires z-1 z-1 4 Fetching data sample 4 Fetching coefficient 4 Fetching operand 4 Multiplying two numbers 4 Accumulating multiplication result 4 Shifting one sample in the delay line n Computing an FIR tap in one instruction cycle 4 Two data memory and one program memory accesses 4 Auto-increment or auto-decrement addressing modes 4 Modulo addressing to implement delay line as circular buffer 2 -9

Example: Vector Dot Product (Unoptimized) n A vector dot product is common in filtering n Store a(n) and x(n) into an array of N elements n C 6000 peaks at 8 RISC instructions/cycle 4 For 300 -MHz C 6000, RISC instructions per sample: 300, 000 for speech; 54, 421 for audio CD; and 290 for luminance NTSC video 4 Generally requires hand coding for peak performance n First dot product example will not be optimized 2 -10

Example: Vector Dot Product (Unoptimized) n Prologue 4 Initialize pointers: A 5 for a(n), A 6 for x(n), and A 7 for Y 4 Move the number of times to loop (N) into A 2 4 Set accumulator (A 4) to zero n Inner loop 4 Put a(n) into A 0 and x(n) into A 1 4 Multiply a(n) and x(n) 4 Accumulate multiplication result into A 4 4 Decrement loop counter (A 2) 4 Continue inner loop if counter is not zero n Epilogue 4 Store the result into Y 2 -11

Example: Vector Dot Product (Unoptimized) Coefficients a(n) Data x(n) Using A data path only ; clear A 4 MVK loop LDH MPY ADD SUB [A 2] B STH and initialize. S 1 40, A 2. D 1 *A 5++, A 0. D 1 *A 6++, A 1. M 1 A 0, A 1, A 3. L 1 A 3, A 4. L 1 A 2, 1, A 2. S 1 loop. D 1 A 4, *A 7 pointers A 5, A 6, and A 7 ; A 2 = 40 (loop counter) ; A 0 = a(n) ; A 1 = x(n) ; A 3 = a(n) * x(n) ; Y = Y + A 3 ; decrement loop counter ; if A 2 != 0, then branch ; *A 7 = Y 2 -12

Example: Vector Dot Product (Unoptimized) n Mo. Ve. Konstant 4 MVK. S 40, A 2 ; A 2 = 40 4 Lower 16 bits of A 2 are loaded n Conditional branch 4 [condition] B. S loop 4 [A 2] means to execute the instruction if A 2 != 0 4 Only A 1, A 2, B 0, B 1, and B 2 can be used n Loading registers 4 LDH. D *A 5, A 0 ; Loads half-word into A 0 from memory n Registers may be used as pointers (*A 1++) n Implementation not efficient due to pipeline effects 2 -13

Pipelining n CPU operations 4 Fetch instruction from (on-chip) program memory 4 Decode instruction 4 Execute instruction including reading data values n Overlap operations to increase performance 4 Pipeline CPU operations to increase clock speed over a sequential implementation 4 Separate parallel functional units 4 Peripheral interfaces for I/O do not burden CPU 2 -14

Pipelining Sequential (Motorola 56000) Fetch Decode Read Execute Pipelined (Most conventional DSP processors) Fetch Decode Read Execute Superscalar (Pentium, MIPS) Managing Pipelines • compiler or programmer (TMS 320 C 6000) Fetch Decode Read Execute Superpipelined (TMS 320 C 6000) • pipeline interlocking in processor (TMS 320 C 30) • hardware instruction scheduling Fetch Decode Execute 2 -15

TMS 320 C 6000 Pipeline n One instruction cycle every clock cycle n Deep pipeline 4 7 -11 stages in C 62 x: fetch 4, decode 2, execute 1 -5 4 7 -16 stages in C 67 x: fetch 4, decode 2, execute 1 -10 4 If a branch is in the pipeline, interrupts are disabled 4 Avoid branches by using conditional execution n No hardware protection against pipeline hazards 4 Compiler and assembler must prevent pipeline hazards n Dispatches instructions in packets 2 -16

Program Fetch (F) n Program fetching consists of 4 phases 4 Generate fetch address (FG) 4 Send address to memory (FS) 4 Wait for data ready (FW) 4 Read opcode (FR) n Fetch packet consists of 8 32 -bit instructions FR C 6000 Memory FS FG FW 2 -17

Decode Stage (D) n Decode stage consists of two phases 4 Dispatch instruction to functional unit (DP) 4 Instruction decoded at functional unit (DC) FR DP DC C 6000 Memory FS FG FW 2 -18

Execute Stage (E) 2 -19

Vector Dot Product with Pipeline Effects ; clear A 4 and initialize MVK. S 1 40, A 2 loop LDH. D 1 *A 5++, A 0 LDH. D 1 *A 6++, A 1 MPY. M 1 A 0, A 1, A 3 ADD. L 1 A 3, A 4 SUB. L 1 A 2, 1, A 2 [A 2] B. S 1 loop STH. D 1 A 4, *A 7 pointers A 5, A 6, and A 7 ; A 2 = 40 (loop counter) ; A 0 = a(n) ; A 1 = x(n) ; A 3 = a(n) * x(n) ; Y = Y + A 3 ; decrement loop counter ; if A 2 != 0, then branch ; *A 7 = Y Multiplication has a delay of 1 cycle Load has a delay of four cycles pipeline 2 -20

Fetch packet F DP DC E 1 E 2 E 3 E 4 E 5 E 6 MVK LDH MPY ADD SUB B STH (F 1 -4) Time (t) = 4 clock cycles 2 -21

Dispatch F F(2 -5) DP DC E 1 E 2 E 3 E 4 E 5 E 6 MVK LDH MPY ADD SUB B STH Time (t) = 5 clock cycles 2 -22

Decode F DP DC E 1 E 2 E 3 E 4 E 5 E 6 MVK F(2 -5) LDH MPY ADD SUB B STH Time (t) = 6 clock cycles 2 -23

Execute (E 1) F DP DC E 1 E 2 E 3 E 4 E 5 E 6 MVK LDH F(2 -5) LDH MPY ADD SUB B STH Time (t) = 7 clock cycles 2 -24

Execute (MVK done LDH in E 1) F DP DC E 1 E 2 E 3 E 4 E 5 E 6 MVK Done LDH F(2 -5) MPY ADD SUB B STH Time (t) = 8 clock cycles 2 -25

Vector Dot Product with Pipeline Effects ; clear A 4 MVK loop LDH NOP MPY NOP ADD SUB [A 2] B NOP STH and initialize. S 1 40, A 2. D 1 *A 5++, A 0. D 1 *A 6++, A 1 4. M 1 A 0, A 1, A 3. L 1. S 1 5. D 1 pointers A 5, A 6, and A 7 ; A 2 = 40 (loop counter) ; A 0 = a(n) ; A 1 = x(n) ; A 3 = a(n) * x(n) A 3, A 4 A 2, 1, A 2 loop ; Y = Y + A 3 ; decrement loop counter ; if A 2 != 0, then branch A 4, *A 7 ; *A 7 = Y Assembler will automatically insert NOP instructions Assembler can also make sequential code parallel 2 -26

Optimized Vector Dot Product on the C 6000 n Prologue 4 Retime dot product to compute two terms per cycle 4 Initialize pointers: A 5 for a(n), B 6 for x(n), A 7 for y(n) 4 Move number of times to loop (N) divided by 2 into A 2 n Inner loop 4 Put a(n) and a(n+1) in A 0 and x(n) and x(n+1) in A 1 (packed data) 4 Multiply a(n) x(n) and a(n+1) x(n+1) 4 Accumulate even (odd) indexed terms in A 4 (B 4) 4 Decrement loop counter (A 2) n Store result 2 -27

FIR Filter Implementation on the C 6000 MVK. S 1 0 x 0001, AMR ; modulo block size 2^2 MVKH. S 1 0 x 4000, AMR ; modulo addr register B 6 MVK. S 2 2, A 2 ; A 2 = 2 (four-tap filter) ZERO. L 1 A 4 ; initialize accumulators ZERO. L 2 B 4 ; initialize pointers A 5, B 6, and A 7 fir LDW. D 1 *A 5++, A 0 ; load a(n) and a(n+1) LDW. D 2 *B 6++, B 1 ; load x(n) and x(n+1) MPY. M 1 X A 0, B 1, A 3 ; A 3 = a(n) * x(n) MPYH. M 2 X A 0, B 1, B 3 ; B 3 = a(n+1) * x(n+1) ADD. L 1 A 3, A 4 ; yeven(n) += A 3 ADD. L 2 B 3, B 4 ; yodd(n) += B 3 [A 2] SUB. S 1 A 2, 1, A 2 ; decrement loop counter [A 2] B. S 2 fir ; if A 2 != 0, then branch ADD. L 1 A 4, B 4, A 4 ; Y = Yodd + Yeven STH. D 1 A 4, *A 7 ; *A 7 = Y Throughput of two multiply-accumulates per instruction cycle 2 -28

Selected TMS 320 C 6000 Fixed-Point DSPs C 6416 has Viterbi and Turbo decoder coprocessors. Unit price is for 1, 000 units. Prices effective June 3, 2005. For more information: http: //www. ti. com 2 -29

Conclusion n Conventional digital signal processors 4 High performance vs. power consumption/cost/volume 4 Excel at one-dimensional processing 4 Have instructions tailored to specific applications n TMS 320 C 6000 VLIW DSP 4 High performance vs. cost/volume 4 Excel at multidimensional signal processing 4 Maximum of 8 RISC instructions per cycle 2 -30

Conclusion n Web resources 4 comp. dsp news group: FAQ www. bdti. com/faq/dsp_faq. html 4 embedded processors and systems: www. eg 3. com 4 on-line courses and DSP boards: www. techonline. com n References 4 R. Bhargava, R. Radhakrishnan, B. L. Evans, and L. K. John, “Evaluating MMX Technology Using DSP and Multimedia Applications, ” Proc. IEEE Sym. Microarchitecture, pp. 37 -46, 1998. http: //www. ece. utexas. edu/~ravib/mmxdsp/ 4 B. L. Evans, “EE 345 S Real-Time DSP Laboratory, ” UT Austin. http: //www. ece. utexas. edu/~bevans/courses/realtime/ 4 B. L. Evans, “EE 382 C Embedded Software Systems, ” UT Austin. http: //www. ece. utexas. edu/~bevans/courses/ee 382 c/ 2 -31

Supplemental Slides FIR Filter on a TMS 320 C 5000 Coefficients Data COEFFP. set 02000 h X. set 037 Fh LASTAP. set 037 FH … LAR AR 3, #LASTAP RPT #127 MACD COEFFP, *APAC SACH Y, 1 ; Program mem address ; Newest data sample ; Oldest data sample ; Point to oldest sample ; Repeat next inst. 126 times ; Compute one tap of FIR ; Store result -- note shift 2 -32

Supplemental Slides TMS 320 C 6200 vs. Star. Core S 140 * Does not count equivalent RISC operations for modulo addressing ** On the C 6200, there is a performance penalty for 40 -bit accumulation 2 -33