Introduction to Digital Signal Processors DSPs Outlineobjectives Identify

Introduction to Digital Signal Processors (DSPs)

Outline/objectives • Identify the most important DSP processor architecture features and how they relate to DSP applications • Understand the types of code appropriate for DSP implementation 2

What is a DSP? • A specialized microprocessor for realtime DSP applications – Digital filtering (FIR and IIR) – FFT – Convolution, Matrix Multiplication etc 3

Hardware used in DSP ASIC FPGA GPP DSP Performance Very High Medium High Flexibility Very low High Power Very low consumption low Medium Low Medium Development Long Time Medium Short 4

Common DSP features • Harvard architecture • Dedicated single-cycle Multiply-Accumulate (MAC) instruction (hardware MAC units) • Single-Instruction Multiple Data (SIMD) Very Large Instruction Word (VLIW) architecture • Pipelining • Saturation arithmetic • Zero overhead looping • Hardware circular addressing • Cache • DMA 5

Harvard Architecture • Physically separate memories and paths for instruction and data 6

Single-Cycle MAC unit Can compute a sum of nproducts in n cycles 7

Single Instruction - Multiple Data (SIMD) • A technique for data-level parallelism by employing a number of processing elements working in parallel 8

Very Long Instruction Word (VLIW) • A technique for instruction-level parallelism by executing instructions without dependencies (known at compile-time) in parallel • Example of a single VLIW instruction: F=a+b; c=e/g; d=x&y; w=z*h; 9

CISC vs. RISC vs. VLIW 10

Pipelining • DSPs commonly feature deep pipelines • TMS 320 C 6 x processors have 3 pipeline stages with a number of phases (cycles): – Fetch • • Program Address Generate (PG) Program Address Send (PS) Program ready wait (PW) Program receive (PR) – Decode • Dispatch (DP) • Decode (DC) – Execute • 6 to 10 phases 11

Saturation Arithmetic • fixed range for operations like addition and multiplication • normal overflow and underflow produce the maximum and minimum allowed value, respectively • Associativity and distributivity no longer apply • 1 signed byte saturation arithmetic examples: • 64 + 69 = 127 • -127 – 5 = -128 • (64 + 70) – 25 = 122 ≠ 64 + (70 -25) = 109 12

Examples • Perform the following operations using one -byte saturation arithmetic • 0 x 77 + 0 x 99 = • 0 x 4*0 x 42= • 0 x 3*0 x 51= 13

Zero Overhead Looping • Hardware support for loops with a constant number of iterations using hardware loop counters and loop buffers • No branching • No loop overhead • No pipeline stalls or branch prediction • No need for loop unrolling 14

Hardware Circular Addressing • A data structure implementing a fixed length queue of fixed size objects where objects are added to the head of the queue while items are removed from the tail of the queue. • Requires at least 2 pointers (head and tail) • Extensively used in digital filtering y[n] = a 0 x[n]+a 1 x[n-1]+…+akx[n-k] 15

Direct Memory Access (DMA) • The feature that allows peripherals to access main memory without the intervention of the CPU • Typically, the CPU initiates DMA transfer, does other operations while the transfer is in progress, and receives an interrupt from the DMA controller once the operation is complete. • Can create cache coherency problems (the data in the cache may be different from the data in the external memory after DMA) • Requires a DMA controller 16

Cache memory • Separate instruction and data L 1 caches (Harvard architecture) • Cache coherence protocols required, since most systems use DMA 17

DSP vs. Microcontroller • DSP – Harvard Architecture – VLIW/SIMD (parallel execution units) – No bit level operations – Hardware MACs – DSP applications • Microcontroller – Mostly von Neumann Architecture – Single execution unit – Flexible bit-level operations – No hardware MACs – Control applications 18

Examples • Estimate how long will the following code fragment take to execute on – A general purpose processor with 1 GHz operating frequency, five-stage pipelining and 5 cycles required for multiplication, 1 cycle for addition – A DSP running at 500 MHz, zero overhead looping and 6 independent ALUs and 2 independent singlecycle MAC units? for (i=0; i<8; i++) { a[i] = 2*i + 3; b[i] = 3*i + 5; } 19

Review Questions • Which of the following code fragments is appropriate for SIMD implementation? a[0]=b[0]+c[0]; a[2]=b[2]+c[2]; a[4]=b[4]+c[4]; a[6]=b[6]+c[6]; a[0]=b[0]&c[0]; a[0]=b[0]%c[0]; a[0]=b[0]+c[0]; a[0]=b[0]/c[0]; • Can the following instructions be merged into one VLIW instruction? If not in how many? – – a=b+c; d=c/e; f=d&a; g=b%c; 20

Review Questions • Which of the following is not a typical DSP feature? – – Dedicated multiplier/MAC Von Neumann memory architecture Pipelining Saturation arithmetic • Which implementation would you choose for lowest power consumption? – – ASIC FPGA General-Purpose Processor DSP 21

Examples • How many VLIW instructions does the following program fragment require if there two independent data paths (a, b), with 3 ALUs and 1 MAC available in each and 8 instructions/word? How many cycles will it take to execute if they are the first instructions in the program and all instructions require 1 cycle, assuming the pipelining architecture of slide 10 with 6 phases of execution? ADD a 1, a 2, a 3 SUB b 1, b 3, b 4 MUL a 2, a 3, a 5 MUL b 3, b 4, b 2 AND a 7, a 0, a 1 MUL a 3, a 4, a 5 OR a 6, a 3, a 2 ; a 3 ; b 4 ; a 5 ; b 2 ; a 1 ; a 5 ; a 2 = = = = a 1+a 2 b 1 -b 3 a 2 -a 3 b 3*b 4 a 7 AND a 0 a 3*a 4 a 6 OR a 3 22

References • DR. Chassaing, “DSP Applications using C and the TMS 320 C 6 x DSK”, Wiley, 2002 • Texas Instruments, TMS 320 C 64 x datasheets • Analog Devices, ADSP-21 xx Processors 23