Lecture 9 Digital Signal Processors Applications and Architectures

  • Slides: 61
Download presentation
Lecture 9: Digital Signal Processors: Applications and Architectures Prepared by: Professor Kurt Keutzer Computer

Lecture 9: Digital Signal Processors: Applications and Architectures Prepared by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Dr. Jeff Bier, BDTI; Dr. Brock Barton, TI; Prof. Bob Brodersen, Prof. David Patterson Kurt Keutzer 1

Processor Applications l Pentiums, Alpha’s, SPARC l Used for general purpose software l Heavy

Processor Applications l Pentiums, Alpha’s, SPARC l Used for general purpose software l Heavy weight OS - UNIX, NT l Workstations, PC’s Embedded processors and processor cores ARM, 486 SX, Hitachi SH 7000, NEC V 800 l Single program l Lightweight, often realtime OS l DSP support l Cellular phones, consumer electronics (e. g. CD players) Increasing volume l Increasing Cost General Purpose - high performance Microcontrollers l Extremely cost sensitive l Small word size - 8 bit common l Highest volume processors by far l Automobiles, toasters, thermostats, . . . Kurt Keutzer 2

Processor Markets $30 B 32 -bit micro $1. 2 B/4% Kurt Keutzer $5. 2

Processor Markets $30 B 32 -bit micro $1. 2 B/4% Kurt Keutzer $5. 2 B/17% 32 bit DSP $10 B/33% 16 -bit micro $5. 7 B/19% 8 -bit micro $9. 3 B/31% 3

Performance The Processor Design Space Application specific architectures for performance Embedded processors Microprocessors Performance

Performance The Processor Design Space Application specific architectures for performance Embedded processors Microprocessors Performance is everything & Software rules Microcontrollers Cost is everything Cost Kurt Keutzer 4

Market for DSP Products Mixed/ Signal Analog DSP is the fastest growing segment of

Market for DSP Products Mixed/ Signal Analog DSP is the fastest growing segment of the semiconductor market Kurt Keutzer 5

DSP Applications Audio applications Networking • MPEG Audio • Cable modems • Portable audio

DSP Applications Audio applications Networking • MPEG Audio • Cable modems • Portable audio • ADSL Digital cameras • VDSL Wireless • Cellular telephones • Base station Kurt Keutzer 6

Another Look at DSP Applications l Wireless Base Station - TMS 320 C 6000

Another Look at DSP Applications l Wireless Base Station - TMS 320 C 6000 l Cable modem l gateways Mid-end l Cellular phone - TMS 320 C 540 l Fax/ voice server Increasing Cost High-end Low end Storage products - TMS 320 C 27 l Digital camera - TMS 320 C 5000 l Portable phones l Wireless headsets l Consumer audio l Automobiles, toasters, thermostats, . . . Kurt Keutzer Increasing volume l 7

Serving a range of applications Kurt Keutzer 8

Serving a range of applications Kurt Keutzer 8

World’s Cellular Subscribers Millions Digital Will provide a ubiquitous infrastructure for wireless data as

World’s Cellular Subscribers Millions Digital Will provide a ubiquitous infrastructure for wireless data as well as voice Analog Year Kurt Keutzer 9 Source: Ericsson Radio Systems, Inc.

CELLULAR TELEPHONE SYSTEM 123 456 789 0 PHYSICAL LAYER PROCESSING A/D Kurt Keutzer 415

CELLULAR TELEPHONE SYSTEM 123 456 789 0 PHYSICAL LAYER PROCESSING A/D Kurt Keutzer 415 -555 -1212 CONTROLLER SPEECH ENCODE BASEBAND CONVERTER SPEECH DECODE RF MODEM DAC 10

HW/SW/IC PARTITIONING MICROCONTROLLER 123 456 789 0 ASIC A/D 415 -555 -1212 CONTROLLER PHYSICAL

HW/SW/IC PARTITIONING MICROCONTROLLER 123 456 789 0 ASIC A/D 415 -555 -1212 CONTROLLER PHYSICAL LAYER PROCESSING SPEECH ENCODE BASEBAND CONVERTER SPEECH DECODE RF MODEM DAC DSP ANALOG IC Kurt Keutzer 11

Mapping onto a system on a chip S/P RAM DMA µC S/P phone keypad

Mapping onto a system on a chip S/P RAM DMA µC S/P phone keypad book intfc DMA control protocol speech quality ASIC LOGIC DSP CORE enhancment recognition de-intl & RPE-LTP decoder speech decoder demodulator and synchronizer Kurt Keutzer voice Viterbi equalizer 12

Example Wireless Phone Organization C 540 ARM 7 Kurt Keutzer 13

Example Wireless Phone Organization C 540 ARM 7 Kurt Keutzer 13

Multimedia I/O Architecture Radio Modem Embedded Processor Sched ECC Pact Interface Low Power Bus

Multimedia I/O Architecture Radio Modem Embedded Processor Sched ECC Pact Interface Low Power Bus FB Data Flow Kurt Keutzer Fifo Video Decomp Pen SRAM Graphics Audio Video 14

Multimedia System on a Chip E. g. Multimedia terminal electronics Graphics Out Uplink Radio

Multimedia System on a Chip E. g. Multimedia terminal electronics Graphics Out Uplink Radio Downlink Radio Video I/O Voice I/O Pen In µP Video Unit Memory Kurt Keutzer Coms Future chips will be a mix of processors, memory and dedicated hardware for specific algorithms and I/O custom DSP 15

Requirements of the Embedded Processors Optimized for a single program - code often in

Requirements of the Embedded Processors Optimized for a single program - code often in on-chip ROM or off chip EPROM Minimum code size (one of the motivations initially for Java) Performance obtained by optimizing datapath Low cost l Lowest possible area l Technology behind the leading edge l High level of integration of peripherals (reduces system cost) Fast time to market l Compatible architectures (e. g. ARM) allows reuseable code l Customizable core Low power if application requires portability Kurt Keutzer 16

Area of processor cores = Cost Nintendo processor Kurt Keutzer Cellular phones 17

Area of processor cores = Cost Nintendo processor Kurt Keutzer Cellular phones 17

Another figure of merit Computation per unit area ? ? ? Kurt Keutzer Nintendo

Another figure of merit Computation per unit area ? ? ? Kurt Keutzer Nintendo processor Cellular phones 18

Code size If a majority of the chip is the program stored in ROM,

Code size If a majority of the chip is the program stored in ROM, then code size is a critical issue The Piranha has 3 sized instructions - basic 2 byte, and 2 byte 19 plus 16 or 32 bit immediate Kurt Keutzer

BENCHMARKS - DSPstone ZIVOJNOVIC, VERLADE, SCHLAGER: UNIVERSITY OF AACHEN APPLICATION BENCHMARKS l ADPCM TRANSCODER

BENCHMARKS - DSPstone ZIVOJNOVIC, VERLADE, SCHLAGER: UNIVERSITY OF AACHEN APPLICATION BENCHMARKS l ADPCM TRANSCODER - CCITT G. 721 l REAL_UPDATE l COMPLEX_UPDATES l DOT_PRODUCT l MATRIX_1 X 3 l CONVOLUTION l FIR 2 DIM l HR_ONE_BIQUAD l LMS l FFT_INPUT_SCALED Kurt Keutzer 20

Evolution of GP and DSP General Purpose Microprocessor traces roots back to Eckert, Mauchly,

Evolution of GP and DSP General Purpose Microprocessor traces roots back to Eckert, Mauchly, Von Neumann (ENIAC) DSP evolved from Analog Signal Processors, using analog hardware to transform phyical signals (classical electrical engineering) ASP to DSP because l l DSP insensitive to environment (e. g. , same response in snow or desert if it works at all) DSP performance identical even with variations in components; 2 analog systems behavior varies even if built with same components with 1% variation Different history and different applications led to different terms, different metrics, some new inventions Convergence of markets will lead to architectural showdown Kurt Keutzer 21

Embedded Systems vs. General Purpose Computing - 1 Embedded System General purpose computing Runs

Embedded Systems vs. General Purpose Computing - 1 Embedded System General purpose computing Runs a few applications often Intended to run a fully general known at design time set of applications Not end-user programmable End-user programmable Operates in fixed run-time constraints, additional performance may not be Faster is always better useful/valuable Kurt Keutzer 22

Embedded Systems vs. General Purpose Computing - 2 Embedded System General purpose computing Differentiating

Embedded Systems vs. General Purpose Computing - 2 Embedded System General purpose computing Differentiating features: Differentiating features l power l cost l speed (must be predictable) l l speed l did we mention speed? l Kurt Keutzer speed (need not be fully predictable) cost (largest component power) 23

DSP vs. General Purpose MPU DSPs tend to be written for 1 program, not

DSP vs. General Purpose MPU DSPs tend to be written for 1 program, not many programs. l Hence OSes are much simpler, there is no virtual memory or protection, . . . DSPs sometimes run hard real-time apps l l l You must account for anything that could happen in a time slot All possible interrupts or exceptions must be accounted for and their collective time be subtracted from the time interval. Therefore, exceptions are BAD! DSPs have an infinite continuous data stream Kurt Keutzer 24

DSP vs. General Purpose MPU The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC).

DSP vs. General Purpose MPU The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC). l DSP are judged by whether they can keep the multipliers busy 100% of the time. The "SPEC" of DSPs is 4 algorithms: l Inifinite Impule Response (IIR) filters l Finite Impule Response (FIR) filters l FFT, and l convolvers In DSPs, algorithms are king! l Binary compatability not an issue Software is not (yet) king in DSPs. l Kurt Keutzer People still write in assembly language for a product to minimize the die area for ROM in the DSP chip. 25

TYPES OF DSP PROCESSORS DSP Multiprocessors on a die l TMS 320 C 80

TYPES OF DSP PROCESSORS DSP Multiprocessors on a die l TMS 320 C 80 l TMS 320 C 6000 32 -BIT FLOATING POINT l TI TMS 320 C 4 X l MOTOROLA 96000 l AT&T DSP 32 C l ANALOG DEVICES ADSP 21000 16 -BIT FIXED POINT l TI TMS 320 C 2 X l MOTOROLA 56000 l AT&T DSP 16 l ANALOG DEVICES ADSP 2100 Kurt Keutzer 26

Note of Caution on DSP Architectures Successful DSP architectures have two aspects: l l

Note of Caution on DSP Architectures Successful DSP architectures have two aspects: l l Key architectural and micro-architectural features that enabled product success in key parameters l Speed l Code density l Low power Architectural and micro-architectural features that are artifacts of the era in which they were designed • We will focus on the former! Kurt Keutzer 27

Architectural Features of DSPs Data path configured for DSP l Fixed-point arithmetic l MAC-

Architectural Features of DSPs Data path configured for DSP l Fixed-point arithmetic l MAC- Multiply-accumulate Multiple memory banks and buses l Harvard Architecture l Multiple data memories Specialized addressing modes l Bit-reversed addressing l Circular buffers Specialized instruction set and execution control l Zero-overhead loops l Support for MAC Specialized peripherals for DSP THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE DESIGN!!! Kurt Keutzer 28

DSP Data Path: Arithmetic DSPs dealing with numbers representing real world => Want “reals”/

DSP Data Path: Arithmetic DSPs dealing with numbers representing real world => Want “reals”/ fractions DSPs dealing with numbers for addresses => Want integers Support “fixed point” as well as integers . -1 Š x < 1 S radix point S . radix point Kurt Keutzer – 2 N– 1 Š x < 2 N– 1 29

DSP Data Path: Precision Word size affects precision of fixed point numbers DSPs have

DSP Data Path: Precision Word size affects precision of fixed point numbers DSPs have 16 -bit, 20 -bit, or 24 -bit data words Floating Point DSPs cost 2 X - 4 X vs. fixed point, slower than fixed point DSP programmers will scale values inside code l SW Libraries l Separate explicit exponent “Blocked Floating Point” single exponent for a group of fractions Floating point support simplify development Kurt Keutzer 30

DSP Data Path: Overflow? DSP are descended from analog : what should happen to

DSP Data Path: Overflow? DSP are descended from analog : what should happen to output when “peg” an input? (e. g. , turn up volume control knob on stereo) l Modulo Arithmetic? ? ? Set to most positive (2 N– 1– 1) or most negative value(– 2 N– 1) : “saturation” Many algorithms were developed in this model Kurt Keutzer 31

DSP Data Path: Multiplier Specialized hardware performs all key arithmetic operations in 1 cycle

DSP Data Path: Multiplier Specialized hardware performs all key arithmetic operations in 1 cycle � 50% of instructions can involve multiplier => single cycle latency multiplier Need to perform multiply-accumulate (MAC) n-bit multiplier => 2 n-bit product Kurt Keutzer 32

DSP Data Path: Accumulator Don’t want overflow or have to scale accumulator Option 1:

DSP Data Path: Accumulator Don’t want overflow or have to scale accumulator Option 1: accumalator wider than product: “guard bits” l Motorola DSP: 24 b x 24 b => 48 b product, 56 b Accumulator Option 2: shift right and round product before adder Multiplier Shift ALU Accumulator G Kurt Keutzer ALU Accumulator 33

DSP Data Path: Rounding Even with guard bits, will need to round when store

DSP Data Path: Rounding Even with guard bits, will need to round when store accumulator into memory 3 DSP standard options Truncation: chop results => biases results up Round to nearest: < 1/2 round down, � 1/2 round up (more positive) => smaller bias Convergent: < 1/2 round down, > 1/2 round up (more positive), = 1/2 round to make lsb a zero (+1 if 1, +0 if 0) => no bias IEEE 754 calls this round to nearest even Kurt Keutzer 34

Data Path DSP Processor General-Purpose Processor Specialized hardware performs Multiplies often take>1 cycle all

Data Path DSP Processor General-Purpose Processor Specialized hardware performs Multiplies often take>1 cycle all key arithmetic operations in Shifts often take >1 cycle. Other operations (e. g. , saturation, rounding) typically take multiple cycles. Hardware support for managing numeric fidelity: l Shifters l Guard bits l Saturation Kurt Keutzer 35

320 C 54 x DSP Functional Block Diagram Kurt Keutzer 36

320 C 54 x DSP Functional Block Diagram Kurt Keutzer 36

FIR Filtering: A Motivating Problem M most recent samples in the delay line (Xi)

FIR Filtering: A Motivating Problem M most recent samples in the delay line (Xi) New sample moves data down delay line “Tap” is a multiply-add Each tap (M+1 taps total) nominally requires: l Two data fetches l Multiply l Accumulate l Memory write-back to update delay line Goal: 1 FIR Tap / DSP instruction cycle Kurt Keutzer 37

BENCHMARKS - FIR FILTER FINITE-IMPULSE RESPONSE FILTER. . Kurt Keutzer 38

BENCHMARKS - FIR FILTER FINITE-IMPULSE RESPONSE FILTER. . Kurt Keutzer 38

Micro-architectural impact - MAC element of finite-impulse response filter computation X Y MPY ADD/SUB

Micro-architectural impact - MAC element of finite-impulse response filter computation X Y MPY ADD/SUB ACC REG Kurt Keutzer 39

Mapping of the filter onto a DSP execution unit 1 Xn X 2 3

Mapping of the filter onto a DSP execution unit 1 Xn X 2 3 b a. Y 5 S X n-1 4 6 Yn 4 6 1 2 D a 5 D 3 The critical hardware unit in a DSP is the multiplier - much of the architecture is organized around allowing use of the multiplier on every cycle This means providing two operands on every cycle, through multiple data and address busses, multiple address units and Kurt Keutzerlocal accumulator feedback 40

MAC Eg. - 320 C 54 x DSP Functional Block Diagram Kurt Keutzer 41

MAC Eg. - 320 C 54 x DSP Functional Block Diagram Kurt Keutzer 41

DSP Memory FIR Tap implies multiple memory accesses DSPs want multiple data ports Some

DSP Memory FIR Tap implies multiple memory accesses DSPs want multiple data ports Some DSPs have ad hoc techniques to reduce memory bandwdith demand l l Instruction repeat buffer: do 1 instruction 256 times Often disables interrupts, thereby increasing interrupt response time Some recent DSPs have instruction caches l l Even then may allow programmer to “lock in” instructions into cache Option to turn cache into fast program memory No DSPs have data caches May have multiple data memories Kurt Keutzer 42

Conventional ``Von Neumann’’ memory Kurt Keutzer 43

Conventional ``Von Neumann’’ memory Kurt Keutzer 43

HARVARD ARCHITECTURE in DSP PROGRAM MEMORY X MEMORY Y MEMORY GLOBAL P DATA X

HARVARD ARCHITECTURE in DSP PROGRAM MEMORY X MEMORY Y MEMORY GLOBAL P DATA X DATA Y DATA Kurt Keutzer 44

Memory Architecture DSP Processor General-Purpose Processor Harvard architecture Von Neumann architecture 2 -4 memory

Memory Architecture DSP Processor General-Purpose Processor Harvard architecture Von Neumann architecture 2 -4 memory accesses/cycle Typically 1 access/cycle No caches-on-chip SRAM May use caches Program Memory Processor Memory Data Memory Kurt Keutzer 45

Eg. TMS 320 C 3 x MEMORY BLOCK DIAGRAM - Harvard Architecture Kurt Keutzer

Eg. TMS 320 C 3 x MEMORY BLOCK DIAGRAM - Harvard Architecture Kurt Keutzer 46

Eg. 320 C 62 x/67 x DSP Kurt Keutzer 47

Eg. 320 C 62 x/67 x DSP Kurt Keutzer 47

DSP Addressing Have standard addressing modes: immediate, displacement, register indirect Want to keep MAC

DSP Addressing Have standard addressing modes: immediate, displacement, register indirect Want to keep MAC datapth busy Assumption: any extra instructions imply clock cycles of overhead in inner loop => complex addressing is good => don’t use datapath to calculate fancy address Autoincrement/Autodecrement register indirect Kurt Keutzer l lw r 1, 0(r 2)+ => r 1 <- M[r 2]; r 2<-r 2+1 l Option to do it before addressing, positive or negative 48

DSP Addressing: FFTs start or end with data in weird bufferfly order 0 (000)

DSP Addressing: FFTs start or end with data in weird bufferfly order 0 (000) => 0 (000) 1 (001) => 4 (100) 2 (010) => 2 (010) 3 (011) => 6 (110) 4 (100) => 1 (001) 5 (101) => 5 (101) 6 (110) => 3 (011) 7 (111) => 7 (111) What can do to avoid overhead of address checking instructions for FFT? Have an optional “bit reverse” addressing mode for use with autoincrement addressing Many DSPs have “bit reverse” addressing for radix-2 FFT Kurt Keutzer 49

BIT REVERSED ADDRESSING Data flow in the radix-2 decimation-in-time FFT algorithm Kurt Keutzer 50

BIT REVERSED ADDRESSING Data flow in the radix-2 decimation-in-time FFT algorithm Kurt Keutzer 50

DSP Addressing: Buffers DSPs dealing with continuous I/O Often interact with an I/O buffer

DSP Addressing: Buffers DSPs dealing with continuous I/O Often interact with an I/O buffer (delay lines) To save memory, buffer often organized as circular buffer What can do to avoid overhead of address checking instructions for circular buffer? Option 1: Keep start register and end register per address register for use with autoincrement addressing, reset to start when reach end of buffer Option 2: Keep a buffer length register, assuming buffers starts on aligned address, reset to start when reach end Every DSP has “modulo” or “circular” addressing Kurt Keutzer 51

CIRCULAR BUFFERS Instructions accomodate three elements: • buffer address • buffer size • increment

CIRCULAR BUFFERS Instructions accomodate three elements: • buffer address • buffer size • increment Allows for cyling through: • delay elements • coefficients in data memory Kurt Keutzer 52

Addressing DSP Processor General-Purpose Processor • Dedicated address generation units • Often, no separate

Addressing DSP Processor General-Purpose Processor • Dedicated address generation units • Often, no separate address generation unit • Specialized addressing modes; e. g. : • General-purpose addressing modes l Autoincrement l Modulo (circular) l Bit-reversed (for FFT) • Good immediate data support Kurt Keutzer 53

Address calculation unit for DSP Supports modulo and bit reversal arithmetic Often duplicated to

Address calculation unit for DSP Supports modulo and bit reversal arithmetic Often duplicated to calculate multiple addresses per cycle Kurt Keutzer 54

DSP Instructions and Execution May specify multiple operations in a single instruction Must support

DSP Instructions and Execution May specify multiple operations in a single instruction Must support Multiply-Accumulate (MAC) Need parallel move support Usually have special loop support to reduce branch overhead l l l Loop an instruction or sequence 0 value in register usually means loop maximum number of times Must be sure if calculate loop count that 0 does not mean 0 May have saturating shift left arithmetic May have conditional execution to reduce branches Kurt Keutzer 55

ADSP 2100: ZERO-OVERHEAD LOOP DO <addr> UNTIL condition” DO X. . . X Address

ADSP 2100: ZERO-OVERHEAD LOOP DO <addr> UNTIL condition” DO X. . . X Address Generation PCS = PC + 1 if (PC = x && ! condition) PC = PCS else PC = PC +1 • Eliminates a few instructions in loops • Important in loops with small bodies Kurt Keutzer 56

Instruction Set DSP Processor General-Purpose Processor Specialized, complex instructions Multiple operations per instruction mac

Instruction Set DSP Processor General-Purpose Processor Specialized, complex instructions Multiple operations per instruction mac x 0, y 0, a Kurt Keutzer x: (r 0) + , x 0 y: (r 4) + , y 0 General-purpose instructions Typically one operation per instruction mov *r 0, x 0 mov *r 1, y 0 mpy x 0, y 0, a add a, b mov y 0, *r 2 inc r 0 inc rl 57

Specialized Peripherals for DSPs • Synchronous serial ports • Host ports • Parallel ports

Specialized Peripherals for DSPs • Synchronous serial ports • Host ports • Parallel ports • Bit I/O ports • Timers • On-chip DMA controller • On-chip A/D, D/A converters • Clock generators • On-chip peripherals often designed for “background” operation, even when core is powered down. Kurt Keutzer 58

Specialized peripherals Kurt Keutzer 59

Specialized peripherals Kurt Keutzer 59

TMS 320 C 203/LC 203 BLOCK DIAGRAM DSP Core Approach - 1995 Kurt Keutzer

TMS 320 C 203/LC 203 BLOCK DIAGRAM DSP Core Approach - 1995 Kurt Keutzer 60

Summary of Architectural Features of DSPs Data path configured for DSP l Fixed-point arithmetic

Summary of Architectural Features of DSPs Data path configured for DSP l Fixed-point arithmetic l MAC- Multiply-accumulate Multiple memory banks and buses l Harvard Architecture l Multiple data memories Specialized addressing modes l Bit-reversed addressing l Circular buffers Specialized instruction set and execution control l Zero-overhead loops l Support for MAC Specialized peripherals for DSP THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE DESIGN!!! Kurt Keutzer 61