INTRODUCTION TO DIGITAL SIGNAL PROCESSORS DSPs Accumulator architecture

Outline n Signal processing applications n Conventional DSP architecture n Pipelining in DSP processors

Signal Processing Applications n Embedded system demand: volume, … 4 400 Million units/year: automobiles,

Conventional DSP Processors n n Low cost: $3/processor in volume Deterministic interrupt service routine

Conventional DSP Architecture n Multiply-accumulate (MAC) in 1 instruction cycle n Harvard architecture for

Conventional DSP Architecture (con’t) n Data Shifting Using a Linear Buffer of length K

Conventional DSP Processors Summary 2 -7

Conventional DSP Processor Families n Floating-point DSPs 4 Used in initial prototyping of algorithms

Pipelining Sequential (Motorola 56000) Fetch Decode Read Execute Pipelined (Most conventional DSPs) Fetch Decode

Pipelining: Operation n Time-stationary pipeline model 4 Programmer controls each cycle 4 Example: Motorola

Pipelining: Hazards n A control hazard occurs when a branch instruction is decoded 4

Pipelining: Avoiding Control Hazards High throughput performance of DSPs is helped by on-chip dedicated

RISC vs. DSP: Instruction Encoding n RISC: Superscalar, out-of-order execution Reorder Load/store Memory Floating-Point

RISC vs. DSP: Memory Hierarchy n RISC Registers Out of order I/D Cache Physical

TI TMS 320 C 6000 DSP Architecture Simplified Architecture Program RAM or Cache Data

TI TMS 320 C 6000 DSP Architecture n Families: All support same C 6000

TI TMS 320 C 6000 DSP Architecture n Very long instruction word (VLIW) size

TI TMS 320 C 6000 Instruction Set by Functional Unit. S Unit ADD NEG

TI TMS 320 C 6000 Instruction Set Arithmetic ABS ADDA ADDK ADD 2 MPYH

C 6000 vs. C 5000 Addressing Modes n Immediate 4 The operand is part

TI TMS 320 C 6000 DSP Architecture n Deep pipeline 4 7 -11 stages

TI TMS 320 C 6700 Extensions C 6700 Floating Point Extensions by Unit. S

C 6712 vs. C 6713 n n n n C 6712 150 MHz clock,

Digital Signal Processor Cores n ASIC with 4 Programmable digital signal processor core 4

General Purpose Processors n Multimedia applications on PCs 4 Video, audio, graphics and animation

DSP on General Purpose Processors (con’t) n Programming is considerably tougher 4 Ability of

Intel MMX Instruction Set n 64 -bit SIMD register (4 data types) 4 64

Concluding Remarks n Conventional digital signal processors 4 High performance vs. power consumption/cost/volume 4

Concluding Remarks n Digital signal processor market 40% annual growth 1990 -2000: #1 in

References 1 G. E. Allen, B. L. Evans, and D. C. Schanbacher, “Real-Time Sonar

Slides: 30

Download presentation

INTRODUCTION TO DIGITAL SIGNAL PROCESSORS (DSPs) Accumulator architecture Memory-register architecture Prof. Brian L. Evans Contributions by Niranjan Damera-Venkata and Magesh Valliappan Embedded Signal Processing Laboratory The University of Texas at Austin, TX 78712 -1084 http: //signal. ece. utexas. edu/ Load-store architecture

Outline n Signal processing applications n Conventional DSP architecture n Pipelining in DSP processors n RISC vs. DSP processor architectures n TI TMS 320 C 6000 DSP architecture introduction n Signal processing on general-purpose processors n Conclusion 2 -2

Signal Processing Applications n Embedded system demand: volume, … 4 400 Million units/year: automobiles, PCs, cell phones 4 30 Million units/year: ADSL modems and printers n Embedded system cost and input/output rates 4 Low-cost, medium-throughput: low-end printers, handsets, sound cards, car audio, disk drives 4 High-cost, high-throughput: high-end printers, wireless basestations, 3 -D sonar, 3 -D images from 2 -D X-rays (tomographic reconstruction) n Single DSP Multiple DSPs Embedded processor requirements 4 Inexpensive with small area and volume 4 Predictable input/output (I/O) rates to/from processor 4 Power constraints (severe for handheld devices) 2 -3

Conventional DSP Processors n n Low cost: $3/processor in volume Deterministic interrupt service routine latency guarantees predictable input/output rates 4 On-chip direct memory access (DMA) controllers n n Processes streaming input/output separately from CPU Sends interrupt to CPU when block has been read/written 4 Ping-pong buffering n n n CPU reads/writes buffer 1 as DMA reads/writes buffer 2 After DMA finishes buffer 2, roles of buffers 1 & 2 switch Low power consumption: 10 -100 m. W 4 TI TMS 320 C 54 0. 32 m. A/MIP 76. 8 m. W at 1. 5 V, 160 MHz 4 TI TMS 320 C 55 0. 05 m. A/MIP 22. 5 m. W at 1. 5 V, 300 MHz n Based on conventional (pre-1996) architecture 2 -4

Conventional DSP Architecture n Multiply-accumulate (MAC) in 1 instruction cycle n Harvard architecture for fast on-chip I/O 4 Data memory/bus separate from program memory/bus 4 One read from program memory per instruction cycle 4 Two reads/writes from/to data memory per inst. cycle n Instructions to keep pipeline (3 -6 stages) full 4 Zero-overhead looping (one pipeline flush to set up) 4 Delayed branches n Special addressing modes supported in hardware 4 Bit-reversed addressing (e. g. fast Fourier transforms) 4 Modulo addressing for circular buffers (e. g. FIR filters) 2 -5

Conventional DSP Architecture (con’t) n Data Shifting Using a Linear Buffer of length K Time 4 Used in finite and infinite impulse response filters n Linear buffer 4 Sort by time index 4 Update: discard oldest data, copy old data left, insert new data n Circular buffer 4 Oldest data index 4 Update: insert new data at oldest index, update oldest index Buffer contents Next sample n=N x. N-K+1 x. N-K+2 x. N-1 x. N+1 n=N+1 x. N-K+2 x. N-K+3 x. N+1 x. N+2 n=N+2 x. N-K+3 x. N-K+4 x. N+1 x. N+2 x. N+3 Modulo Addressing Using a Circular Buffer Time Next sample Buffer contents n=N x. N-2 x. N-1 x. N-K+1 n=N+1 x. N-2 x. N-1 x. N+1 n=N+2 x. N-1 x. NN x. N+1 x. N-K+2 x. N-K+3 x. N+2 x. N-K+3 xx. N-K+4 x. N+3 2 -6

Conventional DSP Processors Summary 2 -7

Conventional DSP Processor Families n Floating-point DSPs 4 Used in initial prototyping of algorithms DSP Market Fixed-point 95% Floating-point 5% 4 Resurgence due to professional and car audio n Different on-chip configurations in each family 4 Size and map of data and program memory 4 A/D, input/output buffers, interfaces, timers, and D/A n Drawbacks to conventional DSP processors 4 No byte addressing (needed for images and video) 4 Limited on-chip memory 4 Limited addressable memory on fixed-point DSPs (exceptions include Motorola 56300 and TI C 5409) 4 Non-standard C extensions for fixed-point data type 2 -8

Pipelining Sequential (Motorola 56000) Fetch Decode Read Execute Pipelined (Most conventional DSPs) Fetch Decode Read Execute Superscalar (Pentium, MIPS) Pipelining • Process instruction stream in stages (like oil flows in an oil pipeline) • Increase throughput Fetch Decode Read Execute Superpipelined (TMS 320 C 6000) Managing Pipelines • Compiler or programmer • Pipeline interlocking Fetch Decode Read Execute 2 -9

Pipelining: Operation n Time-stationary pipeline model 4 Programmer controls each cycle 4 Example: Motorola DSP 56001 (has separate X/Y data memories/registers) MAC X 0, Y 0, A n X: (R 0)+, X 0 Y: (R 4)-, Y 0 Data-stationary pipeline model 4 Programmer specifies data operations 4 Example: TI TMS 320 C 30 MPYF *++AR 0(1), *++AR 1(IR 0), R 0 n Interlocked pipeline 4 “Protection” from pipeline effects 4 May not be reported by simulators: inner loops may take extra cycles MAC means multiplication-accumulation. Fetch Decode Read Execute F D E F G H I J K L L D C D E F G H I J K L R B C D E F G H I J K L E A B C D E F G H I J K L 2 -10

Pipelining: Hazards n A control hazard occurs when a branch instruction is decoded 4 Processor “flushes” the pipeline, or 4 Use delayed branch (expose pipeline) n A data hazard occurs because an operand cannot be read yet 4 Intended by programmer, or 4 Interlock hardware inserts “bubble” 4 TI TMS 320 C 5000 (20 CPU & 16 I/O registers, one accumulator, and one address pointer ARP implied by *) LAR AR 2, ADDR ; load address reg. LACC *; load accumulator w/ ; contents of AR 2 Fetch Decode Read Execute F D R E D C B A E D C B F E D C br F E D G br F E - - br F - - - br X - - Y X - Y - X Z Y Z LAR: 2 cycles to update AR 2 & ARP; need NOP after it 2 -11

Pipelining: Avoiding Control Hazards High throughput performance of DSPs is helped by on-chip dedicated logic for looping (downcounters/looping registers) ; repeat TBLR inst. COUNT-1 times RPT COUNT TBLR *+ n A repeat instruction repeats one instruction or a block of instructions after repeat n The pipeline is filled with repeated instruction (or block of instructions) n Cost: one pipeline flush only Fetch Decode Read Execute F D D E F C D E F rpt X X X X R E B AB CD CD E E F rpt rpt X X X 2 -12

RISC vs. DSP: Instruction Encoding n RISC: Superscalar, out-of-order execution Reorder Load/store Memory Floating-Point Unit n Integer Unit DSP: Horizontal microcode, in-order execution Load/store Memory ALU Multiplier Address 2 -13

RISC vs. DSP: Memory Hierarchy n RISC Registers Out of order I/D Cache Physical memory TLB: Translation Lookaside Buffer n I Cache DSP Registers Internal memories External memories DMA Controller DMA: Direct Memory Access 2 -14

TI TMS 320 C 6000 DSP Architecture Simplified Architecture Program RAM or Cache Data RAM Addr Internal Buses Data . D 2 . M 1 . M 2 . L 1 . L 2 . S 1 . S 2 Regs (B 0 -B 15) Regs (A 0 -A 15) External Memory -Sync -Async . D 1 DMA Serial Port Host Port Boot Load Timers Control Regs Pwr Down CPU 2 -15

TI TMS 320 C 6000 DSP Architecture n Families: All support same C 6000 instruction set C 6200 fixed-pt. 150 - 300 MHz ADSL, printers C 6400 fixed pt. 300 -1, 000 MHz video, wireless basestations C 6700 floating 100 - 225 MHz medical imaging, pro-audio n TMS 320 C 6211: 150 MHz, $21 in volume 300 million multiply-accumulates/s, 1200 RISC MIPS On-chip memory: 16 kwords program, 16 kwords data n TMS 320 C 6701 Evaluation Module Board 167 MHz 334 million multiply-accumulates/s, 1336 RISC MIPS On-chip memory: 16 kwords program, 16 kwords data External: one 133 -MHz 64 -kword, two 100 -MHz 1 -Mword 2 -16

TI TMS 320 C 6000 DSP Architecture n Very long instruction word (VLIW) size of 256 bits 4 Eight 32 -bit functional units with single cycle throughput 4 One instruction cycle per clock cycle n Data word size is 32 bits 4 16 (32 on C 64) 32 -bit registers in each of two data paths 4 40 bits can be stored in adjacent even/odd registers n Two parallel data paths 4 Data unit - 32 -bit address calculations (modulo, linear) 4 Multiplier unit - 16 bit with 32 -bit result 4 Logical unit - 40 -bit (saturation) arithmetic & compares 4 Shifter unit - 32 -bit integer ALU and 40 -bit shifter 2 -17

TI TMS 320 C 6000 Instruction Set by Functional Unit. S Unit ADD NEG ADDK NOT ADD 2 OR AND SET B SHL CLR SHR EXT SSHL MV SUB MVC SUB 2 MVK XOR MVKH ZERO . L Unit ABS NOT ADD OR AND SADD CMPEQ SAT CMPGT SSUB CMPLT SUB LMBD SUBC MV XOR NEG ZERO NORM . D Unit ADD ST ADDA SUB LD SUBA MV ZERO NEG. M Unit MPY SMPY MPYH SMPYH Other NOP IDLE Six of the eight functional units can perform integer add, subtract, and move operations 2 -18

TI TMS 320 C 6000 Instruction Set Arithmetic ABS ADDA ADDK ADD 2 MPYH NEG SMPYH SADD SAT SSUB SUBA SUBC SUB 2 ZERO Logical AND CMPEQ CMPGT CMPLT NOT OR SHL SHR SSHL XOR Bit Management CLR EXT LMBD NORM SET Data Management LD MV MVC MVKH ST Program Control B IDLE NOP C 6000 Instruction Set by Category (un)signed multiplication saturation/packed arithmetic 2 -19

C 6000 vs. C 5000 Addressing Modes n Immediate 4 The operand is part of the instruction n ADD #0 FFh add. L 1 -13, A 1, A 6 (implied) add. L 1 A 7, A 6, A 7 ADD 010 h not supported ADD * ldw. L 1 *A 5++[8], A 1 Direct 4 Address of operand is part of the instruction (added to imply memory page) n TI C 6000 Register 4 Operand is specified in a register n TI C 5000 Indirect 4 Address of operand is stored in a register 2 -20

TI TMS 320 C 6000 DSP Architecture n Deep pipeline 4 7 -11 stages in C 6200: fetch 4, decode 2, execute 1 -5 4 7 -16 stages in C 6700: fetch 4, decode 2, execute 1 -10 4 Pentium IV has an estimated 20 pipeline stages n Avoid using branch instructions in code 4 Branch instruction in pipeline disables interrupts: latency of a branch is 5 cycles 4 Avoid branches by using conditional execution: every instruction can be conditionally executed n No hardware protection against pipeline hazards 4 Compiler and assembler must prevent pipeline hazards 2 -21

TI TMS 320 C 6700 Extensions C 6700 Floating Point Extensions by Unit. S Unit ABSDP CMPLTSP ABSSP RCPDP CMPEQDP RCPSP CMPEQSP RSARDP CMPGTDP RSQRSP CMPGTSP SPDP CMPLTDP. D Unit ADDAD LDDW . L Unit ADDDP INTSP ADDSP SPINT DPINT SPTRUNC DPSP SUBDP DPTRUNC SUBSP INTDP. M Unit MPYDP MPYID MPYI MPYSP Four functional units can perform IEEE single-precision (SP) and double-precision (DP) floating-point add, subtract, move. Operations beginning with R are reciprocal calculations. 2 -22

C 6712 vs. C 6713 n n n n C 6712 150 MHz clock, 900 MFLOPS 4 k. B/4 k. B of L 1 program/data memory 64 k. B of L 2 cache 0 k. B of L 2 SRAM 1200 MB/s on-chip data bus bandwidth $13. 50 each in volume n n n n C 6713 225 MHz clock, 1350 MFLOPS 4 k. B/4 k. B of L 1 program/data memory 256 k. B of L 2 cache 192 k. B of L 2 SRAM 1800 MB/s on-chip data bus bandwidth $26. 85 each in volume Information as of December 14, 2003 2 -23

Digital Signal Processor Cores n ASIC with 4 Programmable digital signal processor core 4 RAM 4 ROM 4 Standard cells 4 Codec 4 Peripherals 4 Gate array 4 Microcontroller core 2 -24

General Purpose Processors n Multimedia applications on PCs 4 Video, audio, graphics and animation 4 Repetitive parallel sequences of instructions n Single Instruction Multiple Data (SIMD) 4 One instruction acts on multiple data in parallel 4 Well-suited for graphics n Native signal processing extensions use SIMD 4 Sun Visual Instruction Set [1995] (Ultra. SPARC 1/2) 4 Intel MMX [1996] (Pentium I/II/IV) 4 Intel Streaming SIMD Extensions (Pentium III) 2 -25

DSP on General Purpose Processors (con’t) n Programming is considerably tougher 4 Ability of compilers to generate code for instruction set extensions may lag (e. g. four years for Pentium MMX) 4 Libraries of routines using native signal processing 4 Hand code in assembly for best performance n Single-instruction multiple-data (SIMD) approach 4 Pack/unpack data not aligned on SIMD word boundaries 4 Saturation arithmetic in MMX; not supported in VIS 4 Extended-precision accumulation in MMX; none in VIS n Application speedup for Intel MMX and Sun VIS 4 Signal and image processing: 1. 5: 1 to 2: 1 4 Graphics: 4: 1 to 6: 1 (no packing/unpacking) 2 -26

Intel MMX Instruction Set n 64 -bit SIMD register (4 data types) 4 64 -bit quad word 4 Packed byte (8 bytes packed into 64 bits) 4 Packed word (4 16 -bit words packed into 64 bits) 4 Packed double word (2 double words packed into 64 bits) n 57 new instructions 4 Pack and unpack 4 Add, subtract, multiply, and multiply/accumulate n n Saturation and wraparound arithmetic Maximum parallelism possible 4 8: 1 for 8 -bit additions 4 4: 1 for 8 16 multiplication or 16 -bit additions 2 -27

Concluding Remarks n Conventional digital signal processors 4 High performance vs. power consumption/cost/volume 4 Excel at one-dimensional processing 4 Per cycle: 1 16 MAC & 4 16 -bit RISC instructions n TMS 320 C 6000 VLIW DSP family 4 High performance vs. cost/volume 4 Excel at multidimensional signal processing 4 Per cycle: 2 16 MACs & 4 32 -bit RISC instructions n Native signal processing 4 Available on desktop computers 4 Excels at graphics 4 Per cycle: 2 8 16 MACs OR 8 8 -bit RISC instructions n Use assembly for computational kernels and C for main program (control code, interrupt def. ) 2 -28

Concluding Remarks n Digital signal processor market 40% annual growth 1990 -2000: #1 in semiconductor market Revenue: $4. 4 B ‘ 99, $6. 1 B ‘ 00, $4. 5 B ‘ 01, $4. 9 B ‘ 02, $6 B ‘ 03 2000: 44% TI, 23% Agere, 13% Motorola, 10% Analog Dev. 2001: 40% TI, 16% Agere, 12% Motorola, 8% Analog Dev. 2002: 43% TI, 14% Motorola, 14% Agere, 9% Analog Dev. n Independent processor benchmarking by industry 4 Berkeley Design Technology Inc. http: //www. bdti. com 4 EDN Embedded Microprocessor Benchmark Consortium http: //www. eembc. org n Web resources 4 Newsgroup comp. dsp: FAQ http: //www. bdti. com/faq/dsp_faq. html 4 Embedded processors and systems: http: //www. eg 3. com 4 On-line courses: http: //www. techonline. com 2 -29

References 1 G. E. Allen, B. L. Evans, and D. C. Schanbacher, “Real-Time Sonar Beamforming on a Unix Workstation, ” Proc. IEEE Asilomar Conf. On Signals, Systems, and Computers, pp. 764 -768, 1998. http: //www. ece. utexas. edu/~bevans/papers/1998/beamforming/ 2 R. Bhargava, R. Radhakrishnan, B. L. Evans, and L. K. John, “Evaluating MMX Technology Using DSP and Multimedia Applications, ” Proc. IEEE Sym. On Microarchitecture, pp. 37 -46, 1998. http: //www. ece. utexas. edu/~ravib/mmxdsp/ 3 W. Chen, H. J. Reekie, S. Bhave, and E. A. Lee, “Native Signal Processing on the Ultra. SPARC in the Ptolemy Environment, ” Proc. IEEE Asilomar Conf. On Signals, Systems, and Computers, 1996. http: //www. ece. utexas. edu/~bevans/courses/ee 382 c/lectures/21_nsp/vis/ 4 B. L. Evans, “EE 345 S Real-Time DSP Laboratory, ” UT Austin. http: //www. ece. utexas. edu/~bevans/courses/realtime/ 5 B. L. Evans, “EE 382 C-9 Embedded Software Systems, ” UT Austin. http: //www. ece. utexas. edu/~bevans/courses/ee 382 c/ 6 A. Kulkarni and A. Dube, “Evaluation of the Code Generation Domain in Ptolemy, ” http: //www. ece. utexas. edu/~bevans/talks/benchmarking 97/sld 001. htm 7 P. Lapsley, J. Bier, A. Shoham, and E. A. Lee, DSP Processor Fundamentals, IEEE Press, 1997. 2 -30