INTRODUCTION TO DIGITAL SIGNAL PROCESSORS DSPs Accumulator architecture

  • Slides: 30
Download presentation
INTRODUCTION TO DIGITAL SIGNAL PROCESSORS (DSPs) Accumulator architecture Memory-register architecture Prof. Brian L. Evans

INTRODUCTION TO DIGITAL SIGNAL PROCESSORS (DSPs) Accumulator architecture Memory-register architecture Prof. Brian L. Evans Contributions by Niranjan Damera-Venkata and Magesh Valliappan Embedded Signal Processing Laboratory The University of Texas at Austin, TX 78712 -1084 http: //signal. ece. utexas. edu/ Load-store architecture

Outline n Signal processing applications n Conventional DSP architecture n Pipelining in DSP processors

Outline n Signal processing applications n Conventional DSP architecture n Pipelining in DSP processors n RISC vs. DSP processor architectures n TI TMS 320 C 6000 DSP architecture introduction n Signal processing on general-purpose processors n Conclusion 2 -2

Signal Processing Applications n Embedded system demand: volume, … 4 400 Million units/year: automobiles,

Signal Processing Applications n Embedded system demand: volume, … 4 400 Million units/year: automobiles, PCs, cell phones 4 30 Million units/year: ADSL modems and printers n Embedded system cost and input/output rates 4 Low-cost, medium-throughput: low-end printers, handsets, sound cards, car audio, disk drives 4 High-cost, high-throughput: high-end printers, wireless basestations, 3 -D sonar, 3 -D images from 2 -D X-rays (tomographic reconstruction) n Single DSP Multiple DSPs Embedded processor requirements 4 Inexpensive with small area and volume 4 Predictable input/output (I/O) rates to/from processor 4 Power constraints (severe for handheld devices) 2 -3

Conventional DSP Processors n n Low cost: $3/processor in volume Deterministic interrupt service routine

Conventional DSP Processors n n Low cost: $3/processor in volume Deterministic interrupt service routine latency guarantees predictable input/output rates 4 On-chip direct memory access (DMA) controllers n n Processes streaming input/output separately from CPU Sends interrupt to CPU when block has been read/written 4 Ping-pong buffering n n n CPU reads/writes buffer 1 as DMA reads/writes buffer 2 After DMA finishes buffer 2, roles of buffers 1 & 2 switch Low power consumption: 10 -100 m. W 4 TI TMS 320 C 54 0. 32 m. A/MIP 76. 8 m. W at 1. 5 V, 160 MHz 4 TI TMS 320 C 55 0. 05 m. A/MIP 22. 5 m. W at 1. 5 V, 300 MHz n Based on conventional (pre-1996) architecture 2 -4

Conventional DSP Architecture n Multiply-accumulate (MAC) in 1 instruction cycle n Harvard architecture for

Conventional DSP Architecture n Multiply-accumulate (MAC) in 1 instruction cycle n Harvard architecture for fast on-chip I/O 4 Data memory/bus separate from program memory/bus 4 One read from program memory per instruction cycle 4 Two reads/writes from/to data memory per inst. cycle n Instructions to keep pipeline (3 -6 stages) full 4 Zero-overhead looping (one pipeline flush to set up) 4 Delayed branches n Special addressing modes supported in hardware 4 Bit-reversed addressing (e. g. fast Fourier transforms) 4 Modulo addressing for circular buffers (e. g. FIR filters) 2 -5

Conventional DSP Architecture (con’t) n Data Shifting Using a Linear Buffer of length K

Conventional DSP Architecture (con’t) n Data Shifting Using a Linear Buffer of length K Time 4 Used in finite and infinite impulse response filters n Linear buffer 4 Sort by time index 4 Update: discard oldest data, copy old data left, insert new data n Circular buffer 4 Oldest data index 4 Update: insert new data at oldest index, update oldest index Buffer contents Next sample n=N x. N-K+1 x. N-K+2 x. N-1 x. N+1 n=N+1 x. N-K+2 x. N-K+3 x. N+1 x. N+2 n=N+2 x. N-K+3 x. N-K+4 x. N+1 x. N+2 x. N+3 Modulo Addressing Using a Circular Buffer Time Next sample Buffer contents n=N x. N-2 x. N-1 x. N-K+1 n=N+1 x. N-2 x. N-1 x. N+1 n=N+2 x. N-1 x. NN x. N+1 x. N-K+2 x. N-K+3 x. N+2 x. N-K+3 xx. N-K+4 x. N+3 2 -6

Conventional DSP Processors Summary 2 -7

Conventional DSP Processors Summary 2 -7

Conventional DSP Processor Families n Floating-point DSPs 4 Used in initial prototyping of algorithms

Conventional DSP Processor Families n Floating-point DSPs 4 Used in initial prototyping of algorithms DSP Market Fixed-point 95% Floating-point 5% 4 Resurgence due to professional and car audio n Different on-chip configurations in each family 4 Size and map of data and program memory 4 A/D, input/output buffers, interfaces, timers, and D/A n Drawbacks to conventional DSP processors 4 No byte addressing (needed for images and video) 4 Limited on-chip memory 4 Limited addressable memory on fixed-point DSPs (exceptions include Motorola 56300 and TI C 5409) 4 Non-standard C extensions for fixed-point data type 2 -8

Pipelining Sequential (Motorola 56000) Fetch Decode Read Execute Pipelined (Most conventional DSPs) Fetch Decode

Pipelining Sequential (Motorola 56000) Fetch Decode Read Execute Pipelined (Most conventional DSPs) Fetch Decode Read Execute Superscalar (Pentium, MIPS) Pipelining • Process instruction stream in stages (like oil flows in an oil pipeline) • Increase throughput Fetch Decode Read Execute Superpipelined (TMS 320 C 6000) Managing Pipelines • Compiler or programmer • Pipeline interlocking Fetch Decode Read Execute 2 -9

Pipelining: Operation n Time-stationary pipeline model 4 Programmer controls each cycle 4 Example: Motorola

Pipelining: Operation n Time-stationary pipeline model 4 Programmer controls each cycle 4 Example: Motorola DSP 56001 (has separate X/Y data memories/registers) MAC X 0, Y 0, A n X: (R 0)+, X 0 Y: (R 4)-, Y 0 Data-stationary pipeline model 4 Programmer specifies data operations 4 Example: TI TMS 320 C 30 MPYF *++AR 0(1), *++AR 1(IR 0), R 0 n Interlocked pipeline 4 “Protection” from pipeline effects 4 May not be reported by simulators: inner loops may take extra cycles MAC means multiplication-accumulation. Fetch Decode Read Execute F D E F G H I J K L L D C D E F G H I J K L R B C D E F G H I J K L E A B C D E F G H I J K L 2 -10

Pipelining: Hazards n A control hazard occurs when a branch instruction is decoded 4

Pipelining: Hazards n A control hazard occurs when a branch instruction is decoded 4 Processor “flushes” the pipeline, or 4 Use delayed branch (expose pipeline) n A data hazard occurs because an operand cannot be read yet 4 Intended by programmer, or 4 Interlock hardware inserts “bubble” 4 TI TMS 320 C 5000 (20 CPU & 16 I/O registers, one accumulator, and one address pointer ARP implied by *) LAR AR 2, ADDR ; load address reg. LACC *; load accumulator w/ ; contents of AR 2 Fetch Decode Read Execute F D R E D C B A E D C B F E D C br F E D G br F E - - br F - - - br X - - Y X - Y - X Z Y Z LAR: 2 cycles to update AR 2 & ARP; need NOP after it 2 -11

Pipelining: Avoiding Control Hazards High throughput performance of DSPs is helped by on-chip dedicated

Pipelining: Avoiding Control Hazards High throughput performance of DSPs is helped by on-chip dedicated logic for looping (downcounters/looping registers) ; repeat TBLR inst. COUNT-1 times RPT COUNT TBLR *+ n A repeat instruction repeats one instruction or a block of instructions after repeat n The pipeline is filled with repeated instruction (or block of instructions) n Cost: one pipeline flush only Fetch Decode Read Execute F D D E F C D E F rpt X X X X R E B AB CD CD E E F rpt rpt X X X 2 -12

RISC vs. DSP: Instruction Encoding n RISC: Superscalar, out-of-order execution Reorder Load/store Memory Floating-Point

RISC vs. DSP: Instruction Encoding n RISC: Superscalar, out-of-order execution Reorder Load/store Memory Floating-Point Unit n Integer Unit DSP: Horizontal microcode, in-order execution Load/store Memory ALU Multiplier Address 2 -13

RISC vs. DSP: Memory Hierarchy n RISC Registers Out of order I/D Cache Physical

RISC vs. DSP: Memory Hierarchy n RISC Registers Out of order I/D Cache Physical memory TLB: Translation Lookaside Buffer n I Cache DSP Registers Internal memories External memories DMA Controller DMA: Direct Memory Access 2 -14

TI TMS 320 C 6000 DSP Architecture Simplified Architecture Program RAM or Cache Data

TI TMS 320 C 6000 DSP Architecture Simplified Architecture Program RAM or Cache Data RAM Addr Internal Buses Data . D 2 . M 1 . M 2 . L 1 . L 2 . S 1 . S 2 Regs (B 0 -B 15) Regs (A 0 -A 15) External Memory -Sync -Async . D 1 DMA Serial Port Host Port Boot Load Timers Control Regs Pwr Down CPU 2 -15

TI TMS 320 C 6000 DSP Architecture n Families: All support same C 6000

TI TMS 320 C 6000 DSP Architecture n Families: All support same C 6000 instruction set C 6200 fixed-pt. 150 - 300 MHz ADSL, printers C 6400 fixed pt. 300 -1, 000 MHz video, wireless basestations C 6700 floating 100 - 225 MHz medical imaging, pro-audio n TMS 320 C 6211: 150 MHz, $21 in volume 300 million multiply-accumulates/s, 1200 RISC MIPS On-chip memory: 16 kwords program, 16 kwords data n TMS 320 C 6701 Evaluation Module Board 167 MHz 334 million multiply-accumulates/s, 1336 RISC MIPS On-chip memory: 16 kwords program, 16 kwords data External: one 133 -MHz 64 -kword, two 100 -MHz 1 -Mword 2 -16

TI TMS 320 C 6000 DSP Architecture n Very long instruction word (VLIW) size

TI TMS 320 C 6000 DSP Architecture n Very long instruction word (VLIW) size of 256 bits 4 Eight 32 -bit functional units with single cycle throughput 4 One instruction cycle per clock cycle n Data word size is 32 bits 4 16 (32 on C 64) 32 -bit registers in each of two data paths 4 40 bits can be stored in adjacent even/odd registers n Two parallel data paths 4 Data unit - 32 -bit address calculations (modulo, linear) 4 Multiplier unit - 16 bit with 32 -bit result 4 Logical unit - 40 -bit (saturation) arithmetic & compares 4 Shifter unit - 32 -bit integer ALU and 40 -bit shifter 2 -17

TI TMS 320 C 6000 Instruction Set by Functional Unit. S Unit ADD NEG

TI TMS 320 C 6000 Instruction Set by Functional Unit. S Unit ADD NEG ADDK NOT ADD 2 OR AND SET B SHL CLR SHR EXT SSHL MV SUB MVC SUB 2 MVK XOR MVKH ZERO . L Unit ABS NOT ADD OR AND SADD CMPEQ SAT CMPGT SSUB CMPLT SUB LMBD SUBC MV XOR NEG ZERO NORM . D Unit ADD ST ADDA SUB LD SUBA MV ZERO NEG. M Unit MPY SMPY MPYH SMPYH Other NOP IDLE Six of the eight functional units can perform integer add, subtract, and move operations 2 -18

TI TMS 320 C 6000 Instruction Set Arithmetic ABS ADDA ADDK ADD 2 MPYH

TI TMS 320 C 6000 Instruction Set Arithmetic ABS ADDA ADDK ADD 2 MPYH NEG SMPYH SADD SAT SSUB SUBA SUBC SUB 2 ZERO Logical AND CMPEQ CMPGT CMPLT NOT OR SHL SHR SSHL XOR Bit Management CLR EXT LMBD NORM SET Data Management LD MV MVC MVKH ST Program Control B IDLE NOP C 6000 Instruction Set by Category (un)signed multiplication saturation/packed arithmetic 2 -19

C 6000 vs. C 5000 Addressing Modes n Immediate 4 The operand is part

C 6000 vs. C 5000 Addressing Modes n Immediate 4 The operand is part of the instruction n ADD #0 FFh add. L 1 -13, A 1, A 6 (implied) add. L 1 A 7, A 6, A 7 ADD 010 h not supported ADD * ldw. L 1 *A 5++[8], A 1 Direct 4 Address of operand is part of the instruction (added to imply memory page) n TI C 6000 Register 4 Operand is specified in a register n TI C 5000 Indirect 4 Address of operand is stored in a register 2 -20

TI TMS 320 C 6000 DSP Architecture n Deep pipeline 4 7 -11 stages

TI TMS 320 C 6000 DSP Architecture n Deep pipeline 4 7 -11 stages in C 6200: fetch 4, decode 2, execute 1 -5 4 7 -16 stages in C 6700: fetch 4, decode 2, execute 1 -10 4 Pentium IV has an estimated 20 pipeline stages n Avoid using branch instructions in code 4 Branch instruction in pipeline disables interrupts: latency of a branch is 5 cycles 4 Avoid branches by using conditional execution: every instruction can be conditionally executed n No hardware protection against pipeline hazards 4 Compiler and assembler must prevent pipeline hazards 2 -21

TI TMS 320 C 6700 Extensions C 6700 Floating Point Extensions by Unit. S

TI TMS 320 C 6700 Extensions C 6700 Floating Point Extensions by Unit. S Unit ABSDP CMPLTSP ABSSP RCPDP CMPEQDP RCPSP CMPEQSP RSARDP CMPGTDP RSQRSP CMPGTSP SPDP CMPLTDP. D Unit ADDAD LDDW . L Unit ADDDP INTSP ADDSP SPINT DPINT SPTRUNC DPSP SUBDP DPTRUNC SUBSP INTDP. M Unit MPYDP MPYID MPYI MPYSP Four functional units can perform IEEE single-precision (SP) and double-precision (DP) floating-point add, subtract, move. Operations beginning with R are reciprocal calculations. 2 -22

C 6712 vs. C 6713 n n n n C 6712 150 MHz clock,

C 6712 vs. C 6713 n n n n C 6712 150 MHz clock, 900 MFLOPS 4 k. B/4 k. B of L 1 program/data memory 64 k. B of L 2 cache 0 k. B of L 2 SRAM 1200 MB/s on-chip data bus bandwidth $13. 50 each in volume n n n n C 6713 225 MHz clock, 1350 MFLOPS 4 k. B/4 k. B of L 1 program/data memory 256 k. B of L 2 cache 192 k. B of L 2 SRAM 1800 MB/s on-chip data bus bandwidth $26. 85 each in volume Information as of December 14, 2003 2 -23

Digital Signal Processor Cores n ASIC with 4 Programmable digital signal processor core 4

Digital Signal Processor Cores n ASIC with 4 Programmable digital signal processor core 4 RAM 4 ROM 4 Standard cells 4 Codec 4 Peripherals 4 Gate array 4 Microcontroller core 2 -24

General Purpose Processors n Multimedia applications on PCs 4 Video, audio, graphics and animation

General Purpose Processors n Multimedia applications on PCs 4 Video, audio, graphics and animation 4 Repetitive parallel sequences of instructions n Single Instruction Multiple Data (SIMD) 4 One instruction acts on multiple data in parallel 4 Well-suited for graphics n Native signal processing extensions use SIMD 4 Sun Visual Instruction Set [1995] (Ultra. SPARC 1/2) 4 Intel MMX [1996] (Pentium I/II/IV) 4 Intel Streaming SIMD Extensions (Pentium III) 2 -25

DSP on General Purpose Processors (con’t) n Programming is considerably tougher 4 Ability of

DSP on General Purpose Processors (con’t) n Programming is considerably tougher 4 Ability of compilers to generate code for instruction set extensions may lag (e. g. four years for Pentium MMX) 4 Libraries of routines using native signal processing 4 Hand code in assembly for best performance n Single-instruction multiple-data (SIMD) approach 4 Pack/unpack data not aligned on SIMD word boundaries 4 Saturation arithmetic in MMX; not supported in VIS 4 Extended-precision accumulation in MMX; none in VIS n Application speedup for Intel MMX and Sun VIS 4 Signal and image processing: 1. 5: 1 to 2: 1 4 Graphics: 4: 1 to 6: 1 (no packing/unpacking) 2 -26

Intel MMX Instruction Set n 64 -bit SIMD register (4 data types) 4 64

Intel MMX Instruction Set n 64 -bit SIMD register (4 data types) 4 64 -bit quad word 4 Packed byte (8 bytes packed into 64 bits) 4 Packed word (4 16 -bit words packed into 64 bits) 4 Packed double word (2 double words packed into 64 bits) n 57 new instructions 4 Pack and unpack 4 Add, subtract, multiply, and multiply/accumulate n n Saturation and wraparound arithmetic Maximum parallelism possible 4 8: 1 for 8 -bit additions 4 4: 1 for 8 16 multiplication or 16 -bit additions 2 -27

Concluding Remarks n Conventional digital signal processors 4 High performance vs. power consumption/cost/volume 4

Concluding Remarks n Conventional digital signal processors 4 High performance vs. power consumption/cost/volume 4 Excel at one-dimensional processing 4 Per cycle: 1 16 MAC & 4 16 -bit RISC instructions n TMS 320 C 6000 VLIW DSP family 4 High performance vs. cost/volume 4 Excel at multidimensional signal processing 4 Per cycle: 2 16 MACs & 4 32 -bit RISC instructions n Native signal processing 4 Available on desktop computers 4 Excels at graphics 4 Per cycle: 2 8 16 MACs OR 8 8 -bit RISC instructions n Use assembly for computational kernels and C for main program (control code, interrupt def. ) 2 -28

Concluding Remarks n Digital signal processor market 40% annual growth 1990 -2000: #1 in

Concluding Remarks n Digital signal processor market 40% annual growth 1990 -2000: #1 in semiconductor market Revenue: $4. 4 B ‘ 99, $6. 1 B ‘ 00, $4. 5 B ‘ 01, $4. 9 B ‘ 02, $6 B ‘ 03 2000: 44% TI, 23% Agere, 13% Motorola, 10% Analog Dev. 2001: 40% TI, 16% Agere, 12% Motorola, 8% Analog Dev. 2002: 43% TI, 14% Motorola, 14% Agere, 9% Analog Dev. n Independent processor benchmarking by industry 4 Berkeley Design Technology Inc. http: //www. bdti. com 4 EDN Embedded Microprocessor Benchmark Consortium http: //www. eembc. org n Web resources 4 Newsgroup comp. dsp: FAQ http: //www. bdti. com/faq/dsp_faq. html 4 Embedded processors and systems: http: //www. eg 3. com 4 On-line courses: http: //www. techonline. com 2 -29

References 1 G. E. Allen, B. L. Evans, and D. C. Schanbacher, “Real-Time Sonar

References 1 G. E. Allen, B. L. Evans, and D. C. Schanbacher, “Real-Time Sonar Beamforming on a Unix Workstation, ” Proc. IEEE Asilomar Conf. On Signals, Systems, and Computers, pp. 764 -768, 1998. http: //www. ece. utexas. edu/~bevans/papers/1998/beamforming/ 2 R. Bhargava, R. Radhakrishnan, B. L. Evans, and L. K. John, “Evaluating MMX Technology Using DSP and Multimedia Applications, ” Proc. IEEE Sym. On Microarchitecture, pp. 37 -46, 1998. http: //www. ece. utexas. edu/~ravib/mmxdsp/ 3 W. Chen, H. J. Reekie, S. Bhave, and E. A. Lee, “Native Signal Processing on the Ultra. SPARC in the Ptolemy Environment, ” Proc. IEEE Asilomar Conf. On Signals, Systems, and Computers, 1996. http: //www. ece. utexas. edu/~bevans/courses/ee 382 c/lectures/21_nsp/vis/ 4 B. L. Evans, “EE 345 S Real-Time DSP Laboratory, ” UT Austin. http: //www. ece. utexas. edu/~bevans/courses/realtime/ 5 B. L. Evans, “EE 382 C-9 Embedded Software Systems, ” UT Austin. http: //www. ece. utexas. edu/~bevans/courses/ee 382 c/ 6 A. Kulkarni and A. Dube, “Evaluation of the Code Generation Domain in Ptolemy, ” http: //www. ece. utexas. edu/~bevans/talks/benchmarking 97/sld 001. htm 7 P. Lapsley, J. Bier, A. Shoham, and E. A. Lee, DSP Processor Fundamentals, IEEE Press, 1997. 2 -30