INTRODUCTION TO THE TMS 320 C 6000 VLIW

  • Slides: 33
Download presentation
INTRODUCTION TO THE TMS 320 C 6000 VLIW DSP Accumulator architecture Memory-register architecture Prof.

INTRODUCTION TO THE TMS 320 C 6000 VLIW DSP Accumulator architecture Memory-register architecture Prof. Brian L. Evans in collaboration with Niranjan Damera-Venkata and Magesh Valliappan Embedded Signal Processing Laboratory The University of Texas at Austin, TX 78712 -1084 http: //signal. ece. utexas. edu/ Load-store architecture

Outline n C 6000 instruction set architecture review n Vector dot product example n

Outline n C 6000 instruction set architecture review n Vector dot product example n Pipelining n Finite impulse response filtering n Vector dot product example n Conclusion 2 -2

TI TMS 320 C 6000 DSP Architecture (Review) Simplified Architecture Program RAM or Cache

TI TMS 320 C 6000 DSP Architecture (Review) Simplified Architecture Program RAM or Cache Data RAM Addr Internal Buses Data . D 2 . M 1 . M 2 . L 1 . L 2 . S 1 . S 2 Regs (B 0 -B 15) Regs (A 0 -A 15) External Memory -Sync -Async . D 1 DMA Serial Port Host Port Boot Load Timers Control Regs Pwr Down C 6200 fixed point C 6400 fixed point C 6700 floating point CPU 2 -3

TI TMS 320 C 6000 DSP Architecture (Review) n Address 8/16/32 bit data +

TI TMS 320 C 6000 DSP Architecture (Review) n Address 8/16/32 bit data + 64 -bit data on C 67 x n Load-store RISC architecture with 2 data paths 4 16 32 -bit registers per data path (A 0 -A 15 and B 0 -B 15) 4 48 instructions (C 6200) and 79 instructions (C 6700) n Two parallel data paths with 32 -bit RISC units 4 Data unit - 32 -bit address calculations (modulo, linear) 4 Multiplier unit - 16 bit x 16 bit with 32 -bit result 4 Logical unit - 40 -bit (saturation) arithmetic & compares 4 Shifter unit - 32 -bit integer ALU and 40 -bit shifter 4 Conditionally executed based on registers A 1 -2 & B 0 -2 4 Work with two 16 -bit halfwords packed into 32 bits 2 -4

TI TMS 320 C 6000 DSP Architecture (Review) n . M multiplication unit 4

TI TMS 320 C 6000 DSP Architecture (Review) n . M multiplication unit 4 16 bit x 16 bit signed/unsigned packed/unpacked n . L arithmetic logic unit 4 Comparisons and logic operations (and, or, and xor) 4 Saturation arithmetic and absolute value calculation n . S shifter unit 4 Bit manipulation (set, get, shift, rotate) and branching 4 Addition and packed addition n . D data unit 4 Load/store to memory 4 Addition and pointer arithmetic 2 -5

C 6000 Restrictions on Register Accesses n Each function unit has read/write ports 4

C 6000 Restrictions on Register Accesses n Each function unit has read/write ports 4 Data path 1 (2) units read/write A (B) registers 4 Data path 2 (1) can read one A (B) register per cycle n Two simultaneous memory accesses cannot use registers of same register file as address pointers n Limit of four 32 -bit reads per register per cycle n 40 -bit longs stored in adjacent even/odd registers 4 Extended precision accumulation of 32 -bit numbers 4 Only one 40 -bit result can be written per cycle 4 40 -bit read cannot occur in same cycle as 40 -bit write 4 4: 1 performance penalty using 40 -bit mode 2 -6

Other C 6000 Disadvantages n No ALU acceleration for bit stream manipulation 4 50%

Other C 6000 Disadvantages n No ALU acceleration for bit stream manipulation 4 50% computation in MPEG-2 decoder spent on variable length decoding on C 6200 in C 4 C 6400 direct memory access controllers shred bit streams (for video conferencing & wireless basestations) n n n Branch in pipeline disables interrupts: Avoid branches by using conditional execution No hardware protection against pipeline hazards: Programmer and tools must guard against it Must emulate many conventional DSP features 4 No hardware looping: use register/conditional branch 4 No bit-reversed addressing: use fast algorithm by Elster 4 No status register: only saturation bit given by. L units 2 -7

FIR Filter n Difference equation (vector dot product) y(n) = 2 x(n) + 3

FIR Filter n Difference equation (vector dot product) y(n) = 2 x(n) + 3 x(n - 1) + 4 x(n - 2) + 5 x(n - 3) n Signal flow graph x(n) z 2 -1 z 3 -1 z 4 Tapped delay line -1 5 y(n) n Dot product of inputs vector and coefficient vector n Store input in circular buffer, coefficients in array 2 -8

FIR Filter n Each tap requires z-1 z-1 4 Fetching data sample 4 Fetching

FIR Filter n Each tap requires z-1 z-1 4 Fetching data sample 4 Fetching coefficient 4 Fetching operand 4 Multiplying two numbers 4 Accumulating multiplication result 4 Shifting one sample in the delay line n Computing an FIR tap in one instruction cycle 4 Two data memory and one program memory accesses 4 Auto-increment or auto-decrement addressing modes 4 Modulo addressing to implement delay line as circular buffer 2 -9

Example: Vector Dot Product (Unoptimized) n A vector dot product is common in filtering

Example: Vector Dot Product (Unoptimized) n A vector dot product is common in filtering n Store a(n) and x(n) into an array of N elements n C 6000 peaks at 8 RISC instructions/cycle 4 For 300 -MHz C 6000, RISC instructions per sample: 300, 000 for speech; 54, 421 for audio CD; and 290 for luminance NTSC video 4 Generally requires hand coding for peak performance n First dot product example will not be optimized 2 -10

Example: Vector Dot Product (Unoptimized) n Prologue 4 Initialize pointers: A 5 for a(n),

Example: Vector Dot Product (Unoptimized) n Prologue 4 Initialize pointers: A 5 for a(n), A 6 for x(n), and A 7 for Y 4 Move the number of times to loop (N) into A 2 4 Set accumulator (A 4) to zero n Inner loop 4 Put a(n) into A 0 and x(n) into A 1 4 Multiply a(n) and x(n) 4 Accumulate multiplication result into A 4 4 Decrement loop counter (A 2) 4 Continue inner loop if counter is not zero n Epilogue 4 Store the result into Y 2 -11

Example: Vector Dot Product (Unoptimized) Coefficients a(n) Data x(n) Using A data path only

Example: Vector Dot Product (Unoptimized) Coefficients a(n) Data x(n) Using A data path only ; clear A 4 MVK loop LDH MPY ADD SUB [A 2] B STH and initialize. S 1 40, A 2. D 1 *A 5++, A 0. D 1 *A 6++, A 1. M 1 A 0, A 1, A 3. L 1 A 3, A 4. L 1 A 2, 1, A 2. S 1 loop. D 1 A 4, *A 7 pointers A 5, A 6, and A 7 ; A 2 = 40 (loop counter) ; A 0 = a(n) ; A 1 = x(n) ; A 3 = a(n) * x(n) ; Y = Y + A 3 ; decrement loop counter ; if A 2 != 0, then branch ; *A 7 = Y 2 -12

Example: Vector Dot Product (Unoptimized) n Mo. Ve. Konstant 4 MVK. S 40, A

Example: Vector Dot Product (Unoptimized) n Mo. Ve. Konstant 4 MVK. S 40, A 2 ; A 2 = 40 4 Lower 16 bits of A 2 are loaded n Conditional branch 4 [condition] B. S loop 4 [A 2] means to execute the instruction if A 2 != 0 4 Only A 1, A 2, B 0, B 1, and B 2 can be used n Loading registers 4 LDH. D *A 5, A 0 ; Loads half-word into A 0 from memory n Registers may be used as pointers (*A 1++) n Implementation not efficient due to pipeline effects 2 -13

Pipelining n CPU operations 4 Fetch instruction from (on-chip) program memory 4 Decode instruction

Pipelining n CPU operations 4 Fetch instruction from (on-chip) program memory 4 Decode instruction 4 Execute instruction including reading data values n Overlap operations to increase performance 4 Pipeline CPU operations to increase clock speed over a sequential implementation 4 Separate parallel functional units 4 Peripheral interfaces for I/O do not burden CPU 2 -14

Pipelining Sequential (Motorola 56000) Fetch Decode Read Execute Pipelined (Most conventional DSP processors) Fetch

Pipelining Sequential (Motorola 56000) Fetch Decode Read Execute Pipelined (Most conventional DSP processors) Fetch Decode Read Execute Superscalar (Pentium, MIPS) Managing Pipelines • compiler or programmer (TMS 320 C 6000) Fetch Decode Read Execute Superpipelined (TMS 320 C 6000) • pipeline interlocking in processor (TMS 320 C 30) • hardware instruction scheduling Fetch Decode Execute 2 -15

TMS 320 C 6000 Pipeline n One instruction cycle every clock cycle n Deep

TMS 320 C 6000 Pipeline n One instruction cycle every clock cycle n Deep pipeline 4 7 -11 stages in C 62 x: fetch 4, decode 2, execute 1 -5 4 7 -16 stages in C 67 x: fetch 4, decode 2, execute 1 -10 4 If a branch is in the pipeline, interrupts are disabled 4 Avoid branches by using conditional execution n No hardware protection against pipeline hazards 4 Compiler and assembler must prevent pipeline hazards n Dispatches instructions in packets 2 -16

Program Fetch (F) n Program fetching consists of 4 phases 4 Generate fetch address

Program Fetch (F) n Program fetching consists of 4 phases 4 Generate fetch address (FG) 4 Send address to memory (FS) 4 Wait for data ready (FW) 4 Read opcode (FR) n Fetch packet consists of 8 32 -bit instructions FR C 6000 Memory FS FG FW 2 -17

Decode Stage (D) n Decode stage consists of two phases 4 Dispatch instruction to

Decode Stage (D) n Decode stage consists of two phases 4 Dispatch instruction to functional unit (DP) 4 Instruction decoded at functional unit (DC) FR DP DC C 6000 Memory FS FG FW 2 -18

Execute Stage (E) 2 -19

Execute Stage (E) 2 -19

Vector Dot Product with Pipeline Effects ; clear A 4 and initialize MVK. S

Vector Dot Product with Pipeline Effects ; clear A 4 and initialize MVK. S 1 40, A 2 loop LDH. D 1 *A 5++, A 0 LDH. D 1 *A 6++, A 1 MPY. M 1 A 0, A 1, A 3 ADD. L 1 A 3, A 4 SUB. L 1 A 2, 1, A 2 [A 2] B. S 1 loop STH. D 1 A 4, *A 7 pointers A 5, A 6, and A 7 ; A 2 = 40 (loop counter) ; A 0 = a(n) ; A 1 = x(n) ; A 3 = a(n) * x(n) ; Y = Y + A 3 ; decrement loop counter ; if A 2 != 0, then branch ; *A 7 = Y Multiplication has a delay of 1 cycle Load has a delay of four cycles pipeline 2 -20

Fetch packet F DP DC E 1 E 2 E 3 E 4 E

Fetch packet F DP DC E 1 E 2 E 3 E 4 E 5 E 6 MVK LDH MPY ADD SUB B STH (F 1 -4) Time (t) = 4 clock cycles 2 -21

Dispatch F F(2 -5) DP DC E 1 E 2 E 3 E 4

Dispatch F F(2 -5) DP DC E 1 E 2 E 3 E 4 E 5 E 6 MVK LDH MPY ADD SUB B STH Time (t) = 5 clock cycles 2 -22

Decode F DP DC E 1 E 2 E 3 E 4 E 5

Decode F DP DC E 1 E 2 E 3 E 4 E 5 E 6 MVK F(2 -5) LDH MPY ADD SUB B STH Time (t) = 6 clock cycles 2 -23

Execute (E 1) F DP DC E 1 E 2 E 3 E 4

Execute (E 1) F DP DC E 1 E 2 E 3 E 4 E 5 E 6 MVK LDH F(2 -5) LDH MPY ADD SUB B STH Time (t) = 7 clock cycles 2 -24

Execute (MVK done LDH in E 1) F DP DC E 1 E 2

Execute (MVK done LDH in E 1) F DP DC E 1 E 2 E 3 E 4 E 5 E 6 MVK Done LDH F(2 -5) MPY ADD SUB B STH Time (t) = 8 clock cycles 2 -25

Vector Dot Product with Pipeline Effects ; clear A 4 MVK loop LDH NOP

Vector Dot Product with Pipeline Effects ; clear A 4 MVK loop LDH NOP MPY NOP ADD SUB [A 2] B NOP STH and initialize. S 1 40, A 2. D 1 *A 5++, A 0. D 1 *A 6++, A 1 4. M 1 A 0, A 1, A 3. L 1. S 1 5. D 1 pointers A 5, A 6, and A 7 ; A 2 = 40 (loop counter) ; A 0 = a(n) ; A 1 = x(n) ; A 3 = a(n) * x(n) A 3, A 4 A 2, 1, A 2 loop ; Y = Y + A 3 ; decrement loop counter ; if A 2 != 0, then branch A 4, *A 7 ; *A 7 = Y Assembler will automatically insert NOP instructions Assembler can also make sequential code parallel 2 -26

Optimized Vector Dot Product on the C 6000 n Prologue 4 Retime dot product

Optimized Vector Dot Product on the C 6000 n Prologue 4 Retime dot product to compute two terms per cycle 4 Initialize pointers: A 5 for a(n), B 6 for x(n), A 7 for y(n) 4 Move number of times to loop (N) divided by 2 into A 2 n Inner loop 4 Put a(n) and a(n+1) in A 0 and x(n) and x(n+1) in A 1 (packed data) 4 Multiply a(n) x(n) and a(n+1) x(n+1) 4 Accumulate even (odd) indexed terms in A 4 (B 4) 4 Decrement loop counter (A 2) n Store result 2 -27

FIR Filter Implementation on the C 6000 MVK. S 1 0 x 0001, AMR

FIR Filter Implementation on the C 6000 MVK. S 1 0 x 0001, AMR ; modulo block size 2^2 MVKH. S 1 0 x 4000, AMR ; modulo addr register B 6 MVK. S 2 2, A 2 ; A 2 = 2 (four-tap filter) ZERO. L 1 A 4 ; initialize accumulators ZERO. L 2 B 4 ; initialize pointers A 5, B 6, and A 7 fir LDW. D 1 *A 5++, A 0 ; load a(n) and a(n+1) LDW. D 2 *B 6++, B 1 ; load x(n) and x(n+1) MPY. M 1 X A 0, B 1, A 3 ; A 3 = a(n) * x(n) MPYH. M 2 X A 0, B 1, B 3 ; B 3 = a(n+1) * x(n+1) ADD. L 1 A 3, A 4 ; yeven(n) += A 3 ADD. L 2 B 3, B 4 ; yodd(n) += B 3 [A 2] SUB. S 1 A 2, 1, A 2 ; decrement loop counter [A 2] B. S 2 fir ; if A 2 != 0, then branch ADD. L 1 A 4, B 4, A 4 ; Y = Yodd + Yeven STH. D 1 A 4, *A 7 ; *A 7 = Y Throughput of two multiply-accumulates per instruction cycle 2 -28

Selected TMS 320 C 6000 Fixed-Point DSPs C 6416 has Viterbi and Turbo decoder

Selected TMS 320 C 6000 Fixed-Point DSPs C 6416 has Viterbi and Turbo decoder coprocessors. Unit price is for 1, 000 units. Prices effective June 3, 2005. For more information: http: //www. ti. com 2 -29

Conclusion n Conventional digital signal processors 4 High performance vs. power consumption/cost/volume 4 Excel

Conclusion n Conventional digital signal processors 4 High performance vs. power consumption/cost/volume 4 Excel at one-dimensional processing 4 Have instructions tailored to specific applications n TMS 320 C 6000 VLIW DSP 4 High performance vs. cost/volume 4 Excel at multidimensional signal processing 4 Maximum of 8 RISC instructions per cycle 2 -30

Conclusion n Web resources 4 comp. dsp news group: FAQ www. bdti. com/faq/dsp_faq. html

Conclusion n Web resources 4 comp. dsp news group: FAQ www. bdti. com/faq/dsp_faq. html 4 embedded processors and systems: www. eg 3. com 4 on-line courses and DSP boards: www. techonline. com n References 4 R. Bhargava, R. Radhakrishnan, B. L. Evans, and L. K. John, “Evaluating MMX Technology Using DSP and Multimedia Applications, ” Proc. IEEE Sym. Microarchitecture, pp. 37 -46, 1998. http: //www. ece. utexas. edu/~ravib/mmxdsp/ 4 B. L. Evans, “EE 345 S Real-Time DSP Laboratory, ” UT Austin. http: //www. ece. utexas. edu/~bevans/courses/realtime/ 4 B. L. Evans, “EE 382 C Embedded Software Systems, ” UT Austin. http: //www. ece. utexas. edu/~bevans/courses/ee 382 c/ 2 -31

Supplemental Slides FIR Filter on a TMS 320 C 5000 Coefficients Data COEFFP. set

Supplemental Slides FIR Filter on a TMS 320 C 5000 Coefficients Data COEFFP. set 02000 h X. set 037 Fh LASTAP. set 037 FH … LAR AR 3, #LASTAP RPT #127 MACD COEFFP, *APAC SACH Y, 1 ; Program mem address ; Newest data sample ; Oldest data sample ; Point to oldest sample ; Repeat next inst. 126 times ; Compute one tap of FIR ; Store result -- note shift 2 -32

Supplemental Slides TMS 320 C 6200 vs. Star. Core S 140 * Does not

Supplemental Slides TMS 320 C 6200 vs. Star. Core S 140 * Does not count equivalent RISC operations for modulo addressing ** On the C 6200, there is a performance penalty for 40 -bit accumulation 2 -33