Computing Engine Choices General Purpose Processors GPPs Intended

  • Slides: 94
Download presentation
 • • Computing Engine Choices General Purpose Processors (GPPs): Intended for general purpose

• • Computing Engine Choices General Purpose Processors (GPPs): Intended for general purpose computing (desktops, servers, clusters. . ) General Purpose ISAs (RISC or CISC) Application-Specific Processors (ASPs): Processors with ISAs and architectural features tailored towards specific application domains – E. g Digital Signal Processors (DSPs), Network Processors (NPs), Media Processors, Graphics Processing Units (GPUs), Vector Processors? ? ? . . . Special Purpose ISAs • • Co-Processors: A hardware (hardwired) implementation of specific algorithms with limited programming interface (augment GPPs or ASPs) Configurable Hardware: – Field Programmable Gate Arrays (FPGAs) – Configurable array of simple processing elements • • Application Specific Integrated Circuits (ASICs): A custom VLSI hardware solution for a specific computational task The choice of one or more depends on a number of factors including: - Type and complexity of computational algorithm (general purpose vs. Specialized) - Desired level of flexibility and programmability - Performance requirements - Desired level of computational efficiency (e. g Computations per watt or computations per chip area) - Power requirements - Development time and cost - Real-time constraints - System cost EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Programmability / Flexibility Computing Engine Choices e. g Digital Signal Processors (DSPs), Network Processors

Programmability / Flexibility Computing Engine Choices e. g Digital Signal Processors (DSPs), Network Processors (NPs), Media Processors, Graphics Processing Units (GPUs) Physics Processor …. General Purpose Processors (GPPs): Application-Specific Processors (ASPs) Processor = Programmable computing element that runs programs written using a pre-defined set of instructions Configurable Hardware Selection Factors: -Type and complexity of computational algorithm (general purpose vs. Specialized) - Desired level of flexibility and programmability - Performance requirements - Desired level of computational efficiency - Power requirements - Real-time constraints - Development time and cost - System cost Co-Processors Application Specific Integrated Circuits (ASICs) Specialization , Development cost/time Performance/Chip Area/Watt (Computational Efficiency) EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Why Application-Specific Processors (ASPs)? Computing Element Choices Observation • Generality and efficiency are in

Why Application-Specific Processors (ASPs)? Computing Element Choices Observation • Generality and efficiency are in some sense inversely related to one another: – The more general-purpose a computing element is and thus the greater the number of tasks it can perform, the less efficient (e. g. Computations per chip area /watt) it will be in performing any of those specific tasks. – Design decisions are therefore almost always compromises; designers identify key features or requirements of applications that must be met and make compromises on other less important features. • To counter the problem of computationally intense and specialized problems for which general purpose machines cannot achieve the necessary performance/other requirements: – Special-purpose processors (or Application-Specific Processors, ASPs) , attached processors, and coprocessors have been designed/built for many years, for specific application domains, such as image or digital signal processing (for which many of the computational tasks are specialized and can be very well defined). Generality = Flexibility = Programmability ? Efficiency = Computations per watt or chip area EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Digital Signal Processor (DSP) Architecture • • • Classification of Main Processor Types/Applications Requirements

Digital Signal Processor (DSP) Architecture • • • Classification of Main Processor Types/Applications Requirements of Embedded Processors DSP vs. General Purpose CPUs DSP Cores vs. Chips Classification of DSP Applications DSP Algorithm Format DSP Benchmarks Basic Architectural Features of DSPs DSP Software Development Considerations Classification of Current DSP Architectures and example DSPs: – Conventional DSPs: TI TMSC 54 xx – Enhanced Conventional DSPs: TI TMSC 55 xx – Multiple-Issue DSPs: • VLIW DSPs: TI TMS 320 C 62 xx, TMS 320 C 64 xx • Superscalar DSPs: LSI Logic ZSP 400/500 DSP core EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

 • • General Purpose Computing & General Purpose Processors (GPPs) – – High

• • General Purpose Computing & General Purpose Processors (GPPs) – – High performance: In general, faster is always better. – RISC or CISC: Intel P 4, IBM Power 4, SPARC, Power. PC, MIPS. . . – Used for general purpose software – End-user programmable – Real-time performance may not be fully predictable (due to dynamic arch. features) – Heavy weight, multi-tasking OS - Windows, UNIX – Normally, low cost and power not a requirement (changing) – Servers, Workstations, Desktops (PC’s), Notebooks, Clusters … Embedded Processing: Embedded processors and processor cores – Cost, power code-size and real-time requirements and constraints – Once real-time constraints are met, a faster processor may not be better – e. g: Intel XScale, ARM, 486 SX, Hitachi SH 7000, NEC V 800. . . – Often require Digital signal processing (DSP) support or other • – – – Microcontrollers – Extremely code size/cost/power sensitive – Single program – Small word size - 8 bit common – Usually no OS – Highest volume processors by far – Examples: Control systems, Automobiles, industrial control, thermostats, . . . Examples of Application-Specific Processors (ASPs) Increasing volume application-specific support (e. g network, media processing) Single or few specialized programs – known at system design time Not end-user programmable Real-time performance must be fully predictable (avoid dynamic arch. features) Lightweight, often realtime OS or no OS Examples: Cellular phones, consumer electronics. . … Increasing Cost/Complexity Main Processor Types/Applications EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Performance The Processor Design Space Application specific architectures for performance Embedded Real-time constraints processors

Performance The Processor Design Space Application specific architectures for performance Embedded Real-time constraints processors Specialized applications Low power/cost constraints Microprocessors GPPs Performance is everything & Software rules Microcontrollers Cost is everything Chip Area, Power complexity Processor Cost EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Requirements of Embedded Processors • Usually must meet strict real-time constraints: – Real-time performance

Requirements of Embedded Processors • Usually must meet strict real-time constraints: – Real-time performance must be fully predictable: • Avoid dynamic processor architectural features that make real-time performance harder to predict ( e. g cache, dynamic scheduling, hardware speculation …) • • – Once real-time constraints are met, a faster processor is not desirable (overkill) due to increased cost/power requirements. Optimized for a single (or few) program (s) - code often in on-chip ROM or on/off chip EPROM/flash memory. Minimum code size (one of the motivations initially for Java) Performance obtained by optimizing datapath Low cost – Lowest possible area • High computational efficiency: Computation per unit area – VLSI implementation technology usually behind the leading edge – High level of integration of peripherals (System-on-Chip -So. C- approach reduces system cost/power) • Fast time to market – Compatible architectures (e. g. ARM family) allows reusable code – Customizable cores (System-on-Chip, So. C). • Low power if application requires portability EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Embedded Processors Area of processor cores = Cost (and Power requirements) Nintendo processor Cellular

Embedded Processors Area of processor cores = Cost (and Power requirements) Nintendo processor Cellular phones EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Embedded Processors Another figure of merit: Computation per unit area (Computational Efficiency) Nintendo processor

Embedded Processors Another figure of merit: Computation per unit area (Computational Efficiency) Nintendo processor Cellular phones EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Embedded Processors • • Code size Smaller is better If a majority of the

Embedded Processors • • Code size Smaller is better If a majority of the chip is the program stored in ROM, then minimizing code size is a critical issue Common embedded processor ISA features to minimize code size: – Variable length instruction encoding common: • e. g. the Piranha has 3 sized instructions - basic 2 byte, and 2 byte plus 16 or 32 bit immediate – Complex/specialized instructions – Complex addressing modes EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Embedded Systems vs. General Purpose Computing Embedded Systems General Purpose Computing Systems (and embedded

Embedded Systems vs. General Purpose Computing Embedded Systems General Purpose Computing Systems (and embedded processors) (and processors GPPs) Run a single or few specialized applications often known at system design time Used for general purpose software : Intended to run a fully general set of applications that may not be known at design time May require application-specific capability (e. g DSP) Not end-user programmable No application-specific capability required Minimum code size is highly desirable Lightweight, often real-time OS or no OS Minimizing code size is not an issue Heavy weight, multi-tasking OS - Windows, UNIX Low power and cost constraints/requirements Higher power and cost constraints/requirements Usually must meet strict real-time constraints –(e. g. real-time sampling rate) In general, no real-time constraints Real-time performance must be fully predictable: Real-time performance may not be fully predictable (due to dynamic processor architectural features): • Superscalar: dynamic scheduling, hardware • Avoid dynamic processor architectural features that make real-time performance harder to predict Once real-time constraints are met, a faster processor is not desirable (overkill) due to increased cost/power requirements. End-user programmable speculation, branch prediction, cache. Faster (higher-performance) is always better EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Evolution of GPPs and DSPs • General Purpose Processors (GPPs) trace roots back to

Evolution of GPPs and DSPs • General Purpose Processors (GPPs) trace roots back to Eckert, Mauchly, Von Neumann (ENIAC) + EDSAC • Digital Signal Processors (DSPs) are microprocessors designed for efficient mathematical manipulation of digital signals utilizing digital signal processing algorithms. – DSPs usually process infinite continuous sampled data streams (signals) while meeting real-time and power constraints. – DSPs evolved from Analog Signal Processors (ASPs) that utilize analog hardware to transform physical signals (classical electrical engineering) – ASP to DSP because: • DSP insensitive to environment (e. g. , same response in snow or desert if it works at all) • DSP performance identical even with variations in components; 2 analog systems behavior varies even if built with same components with 1% variation • Different history and different applications requirements led to different terms, different metrics, architectures, some new inventions. EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP vs. General Purpose CPUs • DSPs tend to run one (or few) program(s),

DSP vs. General Purpose CPUs • DSPs tend to run one (or few) program(s), not many programs. – Hence OSes (if any) are much simpler, there is no virtual memory or protection, . . . • DSPs usually run applications with hard real-time constraints: – DSP must meet application signal sampling rate computational requirements: • A faster DSP is overkill (higher DSP cost, power. . ) – You must account for anything that could happen in a time slot (DSP algorithm inner-loop, data sampling rate) – All possible interrupts or exceptions must be accounted for and their collective time be subtracted from the time interval. • Therefore, exceptions are BAD. • DSPs usually process infinite continuous data streams: – Requires high memory bandwidth (with predictable latency, e. g no data cache) for streaming real-time data samples and predictable processing time on the data samples • The design of DSP ISAs and processor architectures is driven by the requirements of DSP algorithms. – Thus DSPs are application-specific processors EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP vs. GPP • The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC). i.

DSP vs. GPP • The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC). i. e Main performance measure of DSPs is MAC speed Why? – MAC is common in DSP algorithms that involve computing a vector dot product, such as digital filters, correlation, and Fourier transforms. – DSP are judged by whether they can keep the multipliers busy 100% of the time and by how many MACs are performed in each cycle. • The "SPEC" of DSPs is 4 algorithms: – – Inifinite Impule Response (IIR) filters Finite Impule Response (FIR) filters FFT, and convolvers • In DSPs, target algorithms are important: – Binary compatibility not a major issue • High-level Software is not as important in DSPs as in GPPs. – People still write in assembly language for a product to minimize the die area for ROM in the DSP chip. Note: While this is still mostly true, however, programming for DSPs in high level languages (HLLs) has been gaining more acceptance due to the development of more efficient HLL DSP compilers in recent years. EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Types of DSP Processors According to type of Arithmetic/operand Size Supported • 32 -BIT

Types of DSP Processors According to type of Arithmetic/operand Size Supported • 32 -BIT FLOATING POINT (5% of DSP market): – – TI TMS 320 C 3 X, TMS 320 C 67 xx (VLIW) AT&T DSP 32 C ANALOG DEVICES ADSP 21 xxx Hitachi SH-4 • 16 -BIT FIXED POINT (95% of DSP market): – – – – TI TMS 320 C 2 X, TMS 320 C 62 xx (VLIW) Infineon TC 1 xxx (Tri. Core 1) (VLIW) MOTOROLA DSP 568 xx, MSC 810 x (VLIW) ANALOG DEVICES ADSP 21 xx Agere Systems DSP 16 xxx, Starpro 2000 LSI Logic LSI 140 x (ZPS 400) superscalar Hitachi SH 3 -DSP – Star. Core SC 110, SC 140 (VLIW) EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP Cores vs. Chips DSP are usually available as synthesizable cores or off-theshelf packaged

DSP Cores vs. Chips DSP are usually available as synthesizable cores or off-theshelf packaged chips • Synthesizable Cores: – Map into chosen fabrication process • Speed, power, and size vary – Choice of peripherals, etc. (So. C) SOC = System On Chip – Requires extensive hardware development effort. Resulting in more development time and cost • Off-the-shelf packaged chips: – Highly optimized for speed, energy efficiency, and/or cost. – Limited performance, integration options. – Tools, 3 rd-party support often more mature EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP ARCHITECTURE Enabling Technologies Time Frame Early 1970’s Approach · Primary Application Enabling Technologies

DSP ARCHITECTURE Enabling Technologies Time Frame Early 1970’s Approach · Primary Application Enabling Technologies Discrete logic · · Non-real time processing Simulation Military radars Digital Comm. · · Bipolar SSI, MSI FFT algorithm · · Single chip bipolar multiplier Flash A/D Late 1970’s · Building block 1 Early 1980’s · Single Chip DSP P · · Telecom Control · · P architectures NMOS/CMOS 2 Late 1980’s · Function/Application specific chips · · Computers Communication · · Vector processing Parallel processing 3 Early 1990’s · Multiprocessing · Video/Image Processing · · 4 Late 1990’s · Single-chip multiprocessing · · Wireless telephony Internet related First microprocessor DSP TI TMS 32010 · · Advanced multiprocessing VLIW, MIMD, etc. Low power single-chip DSP VLIW/Multiprocessing Generations of single-chip (microprocessor) DSPs EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Texas Instruments TMS 320 Family Multiple DSP P Generations 1 2 3 (VLIW) 4

Texas Instruments TMS 320 Family Multiple DSP P Generations 1 2 3 (VLIW) 4 Generations of single-chip (microprocessor) DSPs EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP Applications • • • Digital audio applications – MPEG Audio – Portable audio

DSP Applications • • • Digital audio applications – MPEG Audio – Portable audio Digital cameras Cellular telephones Wearable medical appliances Storage products: – disk drive servo control Military applications: – radar – sonar • Industrial control • Seismic exploration • Networking: (Telecom infrastructure) – Wireless – Base station – Cable modems – ADSL – VDSL – …. . . Current DSP Killer Applications: Cell phones and telecom infrastructure EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP Algorithms & Applications EECC 722 - Shaaban # lec # 8 Fall 2006

DSP Algorithms & Applications EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Another Look at DSP Applications – – Increasing Cost • High-end: Military applications (e.

Another Look at DSP Applications – – Increasing Cost • High-end: Military applications (e. g. radar/sonar) Wireless Base Station - TMS 320 C 6000 Cable modem Gateways • Mid-range: – Industrial control – Cellular phone - TMS 320 C 540 – Fax/ voice server – – – Increasing volume • Low end: Storage products - TMS 320 C 27 (hard drive controllers) Digital camera - TMS 320 C 5000 Portable phones Wireless headsets Consumer audio Automobiles, thermostats, . . . EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP range of applications & Possible Target DSPs EECC 722 - Shaaban # lec

DSP range of applications & Possible Target DSPs EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Cellular Phone System 123 456 789 0 PHYSICAL LAYER PROCESSING A/D 415 -555 -1212

Cellular Phone System 123 456 789 0 PHYSICAL LAYER PROCESSING A/D 415 -555 -1212 CONTROLLER SPEECH ENCODE BASEBAND CONVERTER SPEECH DECODE Example DSP Application RF MODEM DAC EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Cellular Phone: HW/SW/IC Partitioning MICROCONTROLLER 123 456 789 0 ASIC A/D 415 -555 -1212

Cellular Phone: HW/SW/IC Partitioning MICROCONTROLLER 123 456 789 0 ASIC A/D 415 -555 -1212 CONTROLLER PHYSICAL LAYER PROCESSING SPEECH ENCODE BASEBAND CONVERTER SPEECH DECODE RF MODEM DAC DSP ANALOG IC Example DSP Application EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Mapping Onto System-on-Chip (So. C) (Cellular Phone) S/P RAM DMA µC S/P phone keypad

Mapping Onto System-on-Chip (So. C) (Cellular Phone) S/P RAM DMA µC S/P phone keypad book intfc DMA control protocol speech quality ASIC LOGIC enhancment DSP CORE voice recognition de-intl & RPE-LTP decoder speech decoder demodulator and synchronizer Example DSP Application Viterbi equalizer EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Example Cellular Phone Organization C 540 (DSP) ARM 7 (µC) Example DSP Application EECC

Example Cellular Phone Organization C 540 (DSP) ARM 7 (µC) Example DSP Application EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Multimedia System-on-Chip (So. C) e. g. Multimedia terminal electronics Graphics Out Uplink Radio Video

Multimedia System-on-Chip (So. C) e. g. Multimedia terminal electronics Graphics Out Uplink Radio Video I/O Downlink Radio ASIC Co-processor Or ASP Voice I/O Pen In Example DSP Application µP Video Unit (ASIC) Memory Coms • Future chips will be a mix of processors, memory and dedicated hardware for specific algorithms and I/O custom DSP EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP Algorithm Format • DSP culture has a graphical format to represent formulas. •

DSP Algorithm Format • DSP culture has a graphical format to represent formulas. • Like a flowchart formulas, inner loops, not programs. • Some seem natural: is add, X is multiply • Others are obtuse: z– 1 means take variable from earlier iteration (delay). • These graphs are trivial to decode EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP Algorithm Notation • Uses “flowchart” notation instead of equations • Multiply is or

DSP Algorithm Notation • Uses “flowchart” notation instead of equations • Multiply is or X • Add is + • Delay/Storage or Delay or is z– 1 or D EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Typical DSP Algorithm: Finite-Impulse Response (FIR) Filter • Filters reduce signal noise and enhance

Typical DSP Algorithm: Finite-Impulse Response (FIR) Filter • Filters reduce signal noise and enhance image or signal quality by removing unwanted frequencies. • Finite Impulse Response (FIR) filters compute: where – – x is the input sequence y is the output sequence h is the impulse response (filter coefficients) N is the number of taps (coefficients) in the filter • Output sequence depends only on input sequence and impulse response. EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Typical DSP Algorithms: Finite-impulse Response (FIR) Filter • • N most recent samples in

Typical DSP Algorithms: Finite-impulse Response (FIR) Filter • • N most recent samples in the delay line (Xi) New sample moves data down delay line Filter “Tap” is a multiply-add (Multiply And Accumulate, MAC) Each tap (N taps total) nominally requires: – Two data fetches Requires real-time data sample streaming • Predictable data bandwidth/latency • Special addressing modes – Multiply Repetitive computations, multiply and accumulate (MAC) • Requires efficient MAC support – Accumulate – Memory write-back to update delay line • Special addressing modes (e. g modulo) • Goal: At least 1 FIR Tap / DSP instruction cycle EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

FINITE-IMPULSE RESPONSE (FIR) FILTER Delay (accumulator register) X h 0 . . h 1

FINITE-IMPULSE RESPONSE (FIR) FILTER Delay (accumulator register) X h 0 . . h 1 h. N-2 Y A Filter Tap One FIR Filter Tap i. e. Vector dot product Performance Goal: at least 1 FIR Tap / DSP instruction cycle DSP must meet application signal sampling rate computational requirements: A faster DSP is overkill (more cost/power than really needed) EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Sample Computational Rates for FIR Filtering FIR Type 1 -D 2 -D (4. 37

Sample Computational Rates for FIR Filtering FIR Type 1 -D 2 -D (4. 37 GOPs) 2 -D (23. 3 GOPs) 1 -D FIR has nop = 2 N and a 2 -D FIR has nop = 2 N 2. OP = Operation DSP must meet application signal sampling rate computational requirements: • A faster DSP is overkill (higher DSP cost, power. . ) EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

FIR filter on (simple) General Purpose Processor loop: lw x 0, 0(r 0) lw

FIR filter on (simple) General Purpose Processor loop: lw x 0, 0(r 0) lw y 0, 0(r 1) mul a, x 0, y 0 add y 0, a, b sw y 0, (r 2) inc r 0 inc r 1 inc r 2 dec ctr tst ctr jnz loop • Problems: + GPP Real-time performance may (to meet signal sampling rate) not be fully predictable (due to dynamic processor architectural features): • Superscalar: dynamic scheduling, hardware speculation, branch prediction, cache. • Bus / memory bandwidth bottleneck, • control/loop code overhead • No suitable addressing modes, instructions – e. g. multiply and accumulate (MAC) instruction EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Typical DSP Algorithms: Infinite-Impulse Response (IIR) Filter • Infinite Impulse Response (IIR) filters compute:

Typical DSP Algorithms: Infinite-Impulse Response (IIR) Filter • Infinite Impulse Response (IIR) filters compute: • Output sequence depends on input sequence, previous outputs, and impulse response. i. e Filter coefficients: a(k), b(k) • Both FIR and IIR filters – Require vector dot product (multiply-accumulate) operations MAC – Use fixed coefficients • Adaptive filters update their coefficients to minimize the distance between the filter output and the desired signal. EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Typical DSP Algorithms: Discrete Fourier Transform (DFT) • The Discrete Fourier Transform (DFT) allows

Typical DSP Algorithms: Discrete Fourier Transform (DFT) • The Discrete Fourier Transform (DFT) allows for spectral analysis in the frequency domain. • It is computed as for k = 0, 1, … , N-1, where – x is the input sequence in the time domain – y is an output sequence in the frequency domain • The Inverse Discrete Fourier Transform is computed as • The Fast Fourier Transform (FFT) provides an efficient method for computing the DFT. EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Typical DSP Algorithms: Discrete Cosine Transform (DCT) • The Discrete Cosine Transform (DCT) is

Typical DSP Algorithms: Discrete Cosine Transform (DCT) • The Discrete Cosine Transform (DCT) is frequently used in image & video compression (e. g. JPEG, MPEG-2). • The DCT and Inverse DCT (IDCT) are computed as: where e(k) = 1/sqrt(2) if k = 0; otherwise e(k) = 1. • A N-Point, 1 D-DCT requires N 2 MAC operations. EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP BENCHMARKS • DSPstone: University of Aachen, application benchmarks – – ADPCM TRANSCODER -

DSP BENCHMARKS • DSPstone: University of Aachen, application benchmarks – – ADPCM TRANSCODER - CCITT G. 721, REAL_UPDATE, COMPLEX_UPDATES DOT_PRODUCT, MATRIX_1 X 3, CONVOLUTION FIR, FIR 2 DIM, HR_ONE_BIQUAD LMS, FFT_INPUT_SCALED • BDTImark 2000: Berkeley Design Technology Inc BDTI – 12 DSP kernels in hand-optimized assembly language: • FIR, IIR, Vector dot product, Vector add, Vector maximum, FFT …. – Returns single number (higher means faster) per processor – Use only on-chip memory (memory bandwidth is the major bottleneck in performance of embedded applications). • EEMBC (pronounced “embassy”): EDN Embedded Microprocessor Benchmark Consortium – 30 companies formed by Electronic Data News (EDN) – Benchmark evaluates compiled C code on a variety of embedded processors (microcontrollers, DSPs, etc. ) – Application domains: automotive-industrial, consumer, office automation, networking and telecommunications EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

4 th Generation 3 rd Generation 2 nd Generation > 800 x Faster than

4 th Generation 3 rd Generation 2 nd Generation > 800 x Faster than first generation 1 st Generation EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Basic ISA/Architectural Features of DSPs • Data path configured for DSP algorithms – Fixed-point

Basic ISA/Architectural Features of DSPs • Data path configured for DSP algorithms – Fixed-point arithmetic (most DSPs) DSP ISA Feature • Modulo arithmetic (saturation to handle overflow) – MAC- Multiply-accumulate unit(s) – Hardware rounding support DSP Architectural Features DSP Architectural Feature • Multiple memory banks and buses – Harvard Architecture – Multiple data memories DSP ISA Feature • Specialized addressing modes DSP Architectural Feature Dedicated address generation units – Bit-reversed addressing are usually used – Circular buffers Specialized instruction set and execution control – Zero-overhead loops To meet real-time signal – Support for fast MAC sampling/processing constraints – Fast Interrupt Handling Specialized peripherals for DSP - (System on Chip - So. C style) • DSP Architectural Feature Usually with no data cache for predictable fast data sample streaming EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP ISA Features DSP Data Path: Arithmetic • DSPs dealing with numbers representing real

DSP ISA Features DSP Data Path: Arithmetic • DSPs dealing with numbers representing real world signals => Want “reals”/ fractions • DSPs dealing with numbers for addresses => Want integers • DSP ISA (and DSP) must Support “fixed point” as well as integers . S -1 Š x < 1 radix point In DSP ISAs: Fixed-point arithmetic must be supported, floating point support is optional and is much less common S Usually 16 -bit . radix DSP ISA Feature – 2 N– 1 Š x < 2 N– 1 point EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP ISA Features DSP Data Path: Precision • Word size affects precision of fixed

DSP ISA Features DSP Data Path: Precision • Word size affects precision of fixed point numbers • DSPs have 16 -bit, 20 -bit, or 24 -bit data words 16 -bit most common • Floating Point DSPs cost 2 X - 4 X vs. fixed point, slower In DSP ISAs: Fixed-point arithmetic must be supported, floating point than fixed point support is optional and is much less common • DSP programmers will scale values inside code – SW Libraries – Separate explicit exponent • “Blocked Floating Point” single exponent for a group of fractions • Floating point support simplify development for high-end DSP applications. EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP ISA Feature DSP Data Path: Overflow • DSP are descended from analog :

DSP ISA Feature DSP Data Path: Overflow • DSP are descended from analog : – Modulo Arithmetic. • Set to most positive (2 N– 1– 1) or most negative value(– 2 N– 1) : “saturation” • Many DSP algorithms were developed in this model. 2 N– 1– 1 Saturation Why Support? Due to physical nature of signals Saturation – 2 N– 1 EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP Architectural Features DSP Data Path: Specialized Hardware • Specialized hardware functional units performs

DSP Architectural Features DSP Data Path: Specialized Hardware • Specialized hardware functional units performs all key arithmetic operations in 1 cycle, including: – – – Shifters Saturation Guard bits Rounding modes Multiplication/addition (MAC) • 50% of instructions can involve multiplier => single cycle latency multiplier • Need to perform multiply-accumulate (MAC) fast • n-bit multiplier => 2 n-bit product EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP Data Path: Accumulator • Don’t want overflow or have to scale accumulator •

DSP Data Path: Accumulator • Don’t want overflow or have to scale accumulator • Option 1: accumalator wider than product: “guard bits” – Motorola DSP: 24 b x 24 b => 48 b product, 56 b Accumulator • Option 2: shift right and round product before adder Multiplier Shift ALU Accumulator G ALU } MAC Unit Accumulator EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP Data Path: Rounding Modes • Even with guard bits, will need to round

DSP Data Path: Rounding Modes • Even with guard bits, will need to round when storing accumulator into memory • 3 DSP standard options (supported in hardware) • Truncation: chop results => biases results up • Round to nearest: < 1/2 round down, � 1/2 round up (more positive) => smaller bias • Convergent: < 1/2 round down, > 1/2 round up (more positive), = 1/2 round to make lsb a zero (+1 if 1, +0 if 0) => no bias IEEE 754 calls this round to nearest even EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Data Path Comparison DSP Processor • Specialized hardware performs all key arithmetic operations in

Data Path Comparison DSP Processor • Specialized hardware performs all key arithmetic operations in 1 cycle. – e. g MAC • Hardware support for managing numeric fidelity: – Shifters – Guard bits – Saturation General-Purpose Processor • Multiplies often take>1 cycle • Shifts often take >1 cycle • Other operations (e. g. , saturation, rounding) typically take multiple cycles. EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

TI 320 C 54 x DSP (1995) Functional Block Diagram Multiple memory banks and

TI 320 C 54 x DSP (1995) Functional Block Diagram Multiple memory banks and buses MAC Unit Hardware support for rounding/saturation EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

First Commercial DSP (1982): Texas Instruments TMS 32010 • 16 -bit fixed-point arithmetic •

First Commercial DSP (1982): Texas Instruments TMS 32010 • 16 -bit fixed-point arithmetic • Introduced at 5 Mhz (200 ns) instruction cycle. • “Harvard architecture” – separate instruction, data memories Instruction Memory Processor Data Memory Datapath: Mem T-Register • Accumulator • Specialized instruction set – Load and Accumulate • Two-cycle (400 ns) Multiply. Accumulate (MAC) time. Multiplier ALU P-Register Accumulator EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

First Generation DSP P Texas Instruments TMS 32010 - 1982 Features • • •

First Generation DSP P Texas Instruments TMS 32010 - 1982 Features • • • 200 ns instruction cycle (5 MIPS) 144 words (16 bit) on-chip data RAM 1. 5 K words (16 bit) on-chip program ROM - TMS 32010 External program memory expansion to a total of 4 K words at full speed 16 -bit instruction/data word single cycle 32 -bit ALU/accumulator Single cycle 16 x 16 -bit multiply in 200 ns Two cycle MAC (5 MOPS) Zero to 15 -bit barrel shifter Eight input and eight output channels EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

First Generation DSP P TI TMS 32010 Block Diagram Program Memory (ROM/EPROM) MAC Unit

First Generation DSP P TI TMS 32010 Block Diagram Program Memory (ROM/EPROM) MAC Unit EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

TMS 32010 FIR Filter Code • Here X 4, H 4, . . .

TMS 32010 FIR Filter Code • Here X 4, H 4, . . . are direct (absolute) memory addresses: LT X 4 ; Load T with x(n-4) MPY H 4 ; P = H 4*X 4 LTD X 3 ; Load T with x(n-3); x(n-4) = x(n 3); ; Acc = Acc + P MPY H 3 ; P = H 3*X 3 LTD X 2 Load and Accumulate MPY H 2. . . • Two instructions per tap, but requires unrolling EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP Architectural Features DSP Memory • FIR Tap implies multiple memory accesses • DSPs

DSP Architectural Features DSP Memory • FIR Tap implies multiple memory accesses • DSPs require multiple data ports • Some DSPs have ad hoc techniques to reduce memory bandwdith demand: – Instruction repeat buffer: do 1 instruction 256 times – Often disables interrupts, thereby increasing interrupt response time • Some recent DSPs have instruction caches – Even then may allow programmer to “lock in” instructions into cache – Option to turn cache into fast program memory • Usually DSPs have no data caches. • May have multiple data memories e. g one for signal data samples and one for filter coefficients For better real-time performance predictability EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Conventional ``Von Neumann’’ memory EECC 722 - Shaaban # lec # 8 Fall 2006

Conventional ``Von Neumann’’ memory EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

HARVARD MEMORY ARCHITECTURE in DSP e. g one for signal data samples and one

HARVARD MEMORY ARCHITECTURE in DSP e. g one for signal data samples and one for filter coefficients ROM/EPROM/ FLASH? Data Memory Banks (SRAM) PROGRAM MEMORY X MEMORY Y MEMORY GLOBAL P DATA X DATA Y DATA Multiple memory banks and buses EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Memory Architecture Comparison • • • DSP Processor Harvard architecture 2 -4 memory accesses/cycle

Memory Architecture Comparison • • • DSP Processor Harvard architecture 2 -4 memory accesses/cycle No caches: on-chip SRAM For real-time performance predictability • • • General-Purpose Processor Von Neumann architecture Typically 1 access/cycle Use caches Makes real-time performance harder to predict Program Memory Processor Memory Data Memory EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

TI TMS 320 C 3 x MEMORY BLOCK DIAGRAM - Harvard Architecture Instruction Cache

TI TMS 320 C 3 x MEMORY BLOCK DIAGRAM - Harvard Architecture Instruction Cache Multiple memory banks and buses Data Program Multiple memory banks and buses EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

TI 320 C 62 x/67 x DSP (1997) – (Fourth Generation DSP) EECC 722

TI 320 C 62 x/67 x DSP (1997) – (Fourth Generation DSP) EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP ISA Features DSP Addressing Modes • Have standard addressing modes: immediate, displacement, register

DSP ISA Features DSP Addressing Modes • Have standard addressing modes: immediate, displacement, register indirect • Want to keep MAC datapath busy • Assumption: any extra instructions imply clock cycles of overhead in inner loop To match data access patterns in DSP algorithms => complex addressing is good And reduced number of instructions (code size) • Autoincrement/Autodecrement register indirect – lw r 1, 0(r 2)+ => r 1 <- M[r 2]; r 2<-r 2+1 – Option to do it before addressing, positive or negative • “bit reverse” addressing mode. • “modulo” or “circular” addressing => don’t use normal datapath to calculate fancy addressing modes: – Use dedicated address generation units Related DSP Architectural Feature EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP ISA Features DSP Addressing: FFT • FFTs start or end with data in

DSP ISA Features DSP Addressing: FFT • FFTs start or end with data in bufferfly order 0 (000) => 0 (000) 1 (001) => 4 (100) 2 (010) => 2 (010) Bit Reversed 3 (011) => 6 (110) Addressing 4 (100) => 1 (001) 5 (101) => 5 (101) 6 (110) => 3 (011) 7 (111) => 7 (111) • How to avoid overhead of address checking instructions for FFT? • Have an optional “bit reverse” addressing mode for use with autoincrement addressing • Thus most DSPs have “bit reverse” addressing for radix-2 FFT EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP ISA Features Bit Reversed Addressing Data flow in the radix-2 decimation-in-time FFT algorithm

DSP ISA Features Bit Reversed Addressing Data flow in the radix-2 decimation-in-time FFT algorithm EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP Addressing: Circular Buffers and addressing • DSPs dealing with continuous I/O • Often

DSP Addressing: Circular Buffers and addressing • DSPs dealing with continuous I/O • Often interact with an I/O buffer (delay lines) • To save memory, buffers often organized as circular buffers • What can do to avoid overhead of address checking instructions for circular buffer? • Option 1: Keep start register and end register per address register for use with autoincrement addressing, reset to start when reach end of buffer • Option 2: Keep a buffer length register, assuming Circular Buffer buffers starts on aligned address, reset to start when addressing reach end • Every DSP has “modulo” or “circular” addressing EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP ISA Features Circular Buffers Addressing Support Every DSP has “modulo” or “circular” addressing

DSP ISA Features Circular Buffers Addressing Support Every DSP has “modulo” or “circular” addressing mode Instructions accommodate three elements: • Buffer address • Buffer size Why? • Increment Allows for cycling through: • delay elements (signal samples) • Filter coefficients in data memory EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP Architectural Features Address calculation for DSPs • Dedicated address generation units • Supports

DSP Architectural Features Address calculation for DSPs • Dedicated address generation units • Supports modulo and bit reversal arithmetic • Often duplicated to calculate multiple addresses per cycle EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Addressing Comparison DSP Processor • Dedicated address generation units • Specialized addressing DSP ISA

Addressing Comparison DSP Processor • Dedicated address generation units • Specialized addressing DSP ISA Feature modes; e. g. : – Autoincrement – Modulo (circular) – Bit-reversed (for FFT) • Good immediate data support General-Purpose Processor • Often, no separate address generation units • General-purpose addressing modes GPP ISA Feature EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP ISA Features DSP Instructions and Execution • May specify multiple operations in a

DSP ISA Features DSP Instructions and Execution • May specify multiple operations in a single instruction reduce number of instructions – e. g. A compound instruction may perform: To and reduce code size multiply + add + load + modify address register • Must support Multiply-Accumulate (MAC) • Need parallel move support • Usually have special loop support to reduce branch overhead – Loop an instruction or sequence – 0 value in register usually means loop maximum number of times – Must be sure if calculate loop count that 0 does not mean 0 • May have saturating shift left arithmetic • May have conditional execution to reduce branches In 4 th generation VLIW DSPs EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP ISA Features DSP Low/Zero Overhead Loops Example FIR inner loop on TI TMS

DSP ISA Features DSP Low/Zero Overhead Loops Example FIR inner loop on TI TMS 320 C 54 xx: Number of filter taps Repeat DO <addr> UNTIL condition” In ADSP 2100: DO X. . . Lowers loop overhead Address Generation PCS = PC + 1 if (PC = x && ! condition) PC = PCS else PC = PC +1 X • Eliminates a few instructions in loops • Important in loops with small bodies EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Instruction Set Comparison DSP Processor ISA General-Purpose Processor ISA • Specialized, complex instructions (e.

Instruction Set Comparison DSP Processor ISA General-Purpose Processor ISA • Specialized, complex instructions (e. g. MAC) • Multiple operations per instruction mac x 0, y 0, a x: (r 0) + , x 0 y: (r 4) + , y 0 Code Size = 16 bits • Zero or reduced overhead loops. • General-purpose instructions Less complex • Typically one operation per instruction mov *r 0, x 0 mov *r 1, y 0 mpy x 0, y 0, a add a, b mov y 0, *r 2 inc r 0 inc rl Code Size = 7 x 32 = 224 bits (14 X) • No zero or reduced overhead loops support EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP Architectural Features Specialized Peripherals for DSPs DSP Core A/D Converter D/A Converter Instruction

DSP Architectural Features Specialized Peripherals for DSPs DSP Core A/D Converter D/A Converter Instruction Memory Data Memory Serial Ports • Synchronous serial ports • Parallel ports • Timers • On-chip A/D, D/A converters • Host ports • Bit I/O ports • On-chip DMA controller • Clock generators System on Chip (So. C) • On-chip peripherals often designed for “background” operation, even when core is powered down. EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

TI TMS 320 C 203/LC 203 Block Diagram DSP Core Approach - 1995 Integrated

TI TMS 320 C 203/LC 203 Block Diagram DSP Core Approach - 1995 Integrated DSP Peripherals EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Summary of Architectural Features of DSPs • • • Data path configured for DSP

Summary of Architectural Features of DSPs • • • Data path configured for DSP – Fixed-point arithmetic Most common 95% of all DSPs – Fast MAC- Multiply-accumulate Multiple memory banks and buses • Avoiding dynamic processor architectural features that make real– Harvard Architecture time performance harder to predict (e. g dynamic scheduling, hardware – Multiple data memories speculation, branch prediction, cache). – Dedicated address generation units Why? Specialized addressing modes To achieve predictable real-time performance – Bit-reversed addressing – Circular buffers Specialized instruction set and execution control – Zero-overhead loops – Support for MAC Specialized peripherals for DSP (So. C) THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE DESIGN. (or algorithm driven, DSP algorithms in this case) EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

DSP Software Development Considerations • Different from general-purpose software development: – – – Resource-hungry,

DSP Software Development Considerations • Different from general-purpose software development: – – – Resource-hungry, complex algorithms. Specialized and/or complex processor architectures. Severe cost/storage limitations. Hard real-time constraints. Optimization is essential. Program in DSP Assembly Increased testing challenges. • Essential tools: • – Assembler, linker. – Instruction set simulator. – HLL Code generation: C compiler. – Debugging and profiling tools. Increasingly important: – DSP Software libraries. – Real-time operating systems. HLL/tools becoming more mature/ gaining popularity EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Classification of Current DSP Architectures • Modern Conventional DSPs: – Similar to the original

Classification of Current DSP Architectures • Modern Conventional DSPs: – Similar to the original DSPs of the early 1980 s – Single instruction/cycle. Example: TI TMS 320 C 54 x Lower Cost/ Power Second Generation – Complex instructions/Not compiler friendly Usually one MAC unit • Enhanced Conventional DSPs: – – Add parallel execution units: SIMD operation Complex, compound instructions. Example: TI TMS 320 C 55 x Not compiler friendly Usually more than one MAC unit • Multiple-Issue DSPs: Third Generation Fourth Generation – VLIW Example: TI TMS 320 C 62 xx, TMS 320 C 64 xx Higher Cost/ Power Performance • Simpler (RISC-like, fixed-width) instructions than conventional DSPs, more instructions and instruction bandwidth needed, • More compiler friendly - Higher cost/power • SIMD instructions support added to recent DSPs of this class – Superscalar, Example: LSI Logic ZPS 400, ZPS 500 EECC 722 - Shaaban DSPs from all these three generations are still available today # lec # 8 Fall 2006 10 -16 -2006

A Conventional DSP: TI TMSC 54 xx • • Second Generation DSP 16 -bit

A Conventional DSP: TI TMSC 54 xx • • Second Generation DSP 16 -bit fixed-point DSP. Issues one 16 -bit instruction/cycle Modified Harvard memory architecture Peripherals typical of conventional DSPs: – 2 -3 synch. Serial ports, parallel port – Bit I/O, Timer, DMA • Inexpensive (100 MHz ~$5 qty 10 K). • Low power (60 m. W @ 1. 8 V, 100 MHz). Has one MAC unit EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

A Current Conventional DSP: Second TI TMSC 54 xx Generation DSP One MAC Unit

A Current Conventional DSP: Second TI TMSC 54 xx Generation DSP One MAC Unit EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

An Enhanced Conventional DSP: Third TI TMSC 55 xx Generation DSP • The TMS

An Enhanced Conventional DSP: Third TI TMSC 55 xx Generation DSP • The TMS 320 C 55 xx is based on Texas Instruments' earlier TMS 320 C 54 xx family, but adds significant enhancements to the architecture and instruction set, including: – Two instructions/cycle (limited VLIW? ) • Instructions are scheduled for parallel execution by the assembly programmer or compiler. – Two MAC units. • Complex, compound instructions: – Assembly source code compatible with C 54 xx – Mixed-width instructions: 8 to 48 bits. – 200 MHz @ 1. 5 V, ~130 m. W , $17 qty 10 k • Poor compiler target. EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

An Enhanced Conventional DSP: Third TI TMSC 55 xx Generation DSP 2 MAC Units

An Enhanced Conventional DSP: Third TI TMSC 55 xx Generation DSP 2 MAC Units EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Multiple-Issue DSPs 16 -bit Fixed-Point VLIW DSP: TI TMS 320 C 6201 Revision 2

Multiple-Issue DSPs 16 -bit Fixed-Point VLIW DSP: TI TMS 320 C 6201 Revision 2 (1997) (1997 The TMS 320 C 62 xx is the first fixed-point DSP Program Cache / Program Memory processor from Texas 32 -bit address, 256 -Bit data 512 K Bits RAM Instruments that is based Pwr Dwn on a VLIW-like architecture which allows it to execute up to eight 32 -bit RISC-like instructions per clock cycle. • More compiler friendly • Higher cost/power • SIMD instructions support added to recent DSPs of this class (TMS 320 C 64 xx) Fourth Generation DSP Program Fetch Control Registers Instruction Dispatch Host Port Interface 4 -DMA TMS 320 C 67 xx Floating Point version C 6201 CPU Megamodule Instruction Decode Data Path 1 Data Path 2 A Register File Control Logic B Register File Test Emulation Ext. Memory Interface L 1 S 1 M 1 D 2 M 2 S 2 L 2 Interrupts 2 Timers Data Memory 32 -Bit address, 8 -, 16 -, 32 -Bit data 512 K Bits RAM 2 Multichannel buffered serial ports (T 1/E 1) EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

TI TMS 320 C 62 xx Internal Memory Architecture • Separate Internal Program and

TI TMS 320 C 62 xx Internal Memory Architecture • Separate Internal Program and Data Spaces • Program – 16 K 32 -bit instructions (2 K Fetch Packets) – 256 -bit Fetch Width – Configurable as either • Direct Mapped Cache, Memory Mapped Program Memory • Data – 32 K x 16 – Single Ported Accessible by Both CPU Data Buses – 4 x 8 K 16 -bit Banks 4 Banks • 2 Possible Simultaneous Memory Accesses (4 Banks) • 4 -Way Interleave, Banks and Interleave Minimize Access Conflicts EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Fourth Generation DSP TI TMS 320 C 62 xx Datapaths Registers A 0 -

Fourth Generation DSP TI TMS 320 C 62 xx Datapaths Registers A 0 - A 15 Registers B 0 - B 15 1 X S 1 2 X S 2 D DL SL L 1 SL D D S L 1 S 2 S D 1 S 2 M 1 DDATA_I 1 (load data) DDATA_O 1 (store data) D S S 1 2 D 1 S S D 2 1 D 2 S 1 D M 2 S D D SL 1 L SL DL D S 2 S 1 L 2 DDATA_I 2 (load data) DDATA_O 2 (store data) DADR 1 DADR 2 (address) Cross Paths 40 -bit Write Paths (8 MSBs) 40 -bit Read Paths/Store Paths EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

TI TMS 320 C 62 xx Functional Units • L-Unit (L 1, L 2)

TI TMS 320 C 62 xx Functional Units • L-Unit (L 1, L 2) – 40 -bit Integer ALU, Comparisons – Bit Counting, Normalization • S-Unit (S 1, S 2) – 32 -bit ALU, 40 -bit Shifter – Bitfield Operations, Branching • M-Unit (M 1, M 2) – 16 x 16 -> 32 • D-Unit (D 1, D 2) – 32 -bit Add/Subtract – Address Calculations (Statically Scheduled) EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

TI TMS 320 C 62 xx Instruction Packing Advanced 8 -way VLIW Example 1

TI TMS 320 C 62 xx Instruction Packing Advanced 8 -way VLIW Example 1 A B C D E F G H A B C D Example 2 E F G H A B C D Example 3 E F G H • Fetch Packet – CPU fetches 8 instructions/cycle • Execute Packet – CPU executes 1 to 8 instructions/cycle – Fetch packets can contain multiple execute packets • Parallelism determined at compile / assembly time • Examples – 1) 8 parallel instructions – 2) 8 serial instructions – 3) Mixed Serial/Parallel Groups • A // B • C • D • E // F // G // H • Reduces Codesize, Number of Program Fetches, Power Consumption (Statically Scheduled VLIW) EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

TI TMS 320 C 62 xx Pipeline Operation Pipeline Phases Fetch Decode Execute PG

TI TMS 320 C 62 xx Pipeline Operation Pipeline Phases Fetch Decode Execute PG PS PW PR DP DC E 1 E 2 E 3 E 4 E 5 • Single-Cycle Throughput • Operate in Lock Step • Fetch – PG Program Address Generate – PS Program Address Send – PW Program Access Ready Wait – PR Program Fetch Packet Receive PG PS PW PR DP DC Execute Packet 2 PG PS PW PR DP Execute Packet 3 PG PS PW PR Execute Packet 4 PG PS PW Execute Packet 5 PG PS Execute Packet 6 PG Execute Packet 7 • • E 1 DC DP PR PW PS PG Decode – DP – DC Execute – E 1 - E 5 E 2 E 1 DC DP PR PW PS E 3 E 2 E 1 DC DP PR PW E 4 E 3 E 2 E 1 DC DP PR Instruction Dispatch Instruction Decode Execute 1 through Execute 5 E 4 E 3 E 2 E 1 DC DP E 5 E 4 E 3 E 2 E 1 DC E 5 E 4 E 3 E 2 E 1 E 5 E 4 E 5 E 3 E 4 E 5 E 2 E 3 E 4 E 5 EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

C 62 x Pipeline Operation Delay Slots • Delay Slots: number of extra cycles

C 62 x Pipeline Operation Delay Slots • Delay Slots: number of extra cycles until result is: – written to register file – available for use by a subsequent instructions – Multi-cycle NOP instruction can fill delay slots while minimizing code size impact Most Instructions Integer Multiply Loads Branches E 1 No Delay E 1 E 2 1 Delay Slots E 1 E 2 E 3 E 4 E 5 4 Delay Slots E 1 Branch Target PG PSPWPR DPDC E 1 5 Delay Slots (Statically Scheduled VLIW) For better real-time performance predictability EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

C 6000 Instruction Set Features Conditional Instruction Execution • All Instructions can be Conditional

C 6000 Instruction Set Features Conditional Instruction Execution • All Instructions can be Conditional (similar to Intel IA-64) – A 1, A 2, B 0, B 1, B 2 can be used as Conditions – Based on Zero or Non-Zero Value – Compare Instructions can allow other Conditions (<, >, etc) • Reduces Branching • Increases Parallelism EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

C 6000 Instruction Set Addressing Features • Load-Store Architecture • Two Addressing Units (D

C 6000 Instruction Set Addressing Features • Load-Store Architecture • Two Addressing Units (D 1, D 2) • Orthogonal – Any Register can be used for Addressing or Indexing • Signed/Unsigned Byte, Half-Word, Double. Word Addressable – Indexes are Scaled by Type • Register or 5 -Bit Unsigned Constant Index EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

C 6000 Instruction Set Addressing Features • Indirect Addressing Modes – Pre-Increment *++R[index] –

C 6000 Instruction Set Addressing Features • Indirect Addressing Modes – Pre-Increment *++R[index] – Post-Increment *R++[index] – Pre-Decrement *--R[index] – Post-Decrement *R--[index] – Positive Offset *+R[index] – Negative Offset *-R[index] • 15 -bit Positive/Negative Constant Offset from Either B 14 or B 15 • Circular Addressing – Fast and Low Cost: Power of 2 Sizes and Alignment – Up to 8 Different Pointers/Buffers, Up to 2 Different Buffer Sizes • Bit-reversal Addressing • Dual Endian Support EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

FIR Filter On TMS 320 C 54 xx vs. TMS 320 C 62 xx

FIR Filter On TMS 320 C 54 xx vs. TMS 320 C 62 xx 2 nd Gen Conventional DSP 4 th Gen VLIW DSP VLIW: Larger code size Two filter taps EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

TI TMS 320 C 64 xx • Announced in February 2000, the TMS 320

TI TMS 320 C 64 xx • Announced in February 2000, the TMS 320 C 64 xx is an extension of Texas Instruments' earlier TMS 320 C 62 xx architecture. • The TMS 320 C 64 xx has 64 32 -bit general-purpose registers, twice as many as the TMS 320 C 62 xx. • The TMS 320 C 64 xx instruction set is a superset of that used in the TMS 320 C 62 xx, and, among other enhancements, adds significant SIMD/media processing capabilities: – 8 -bit operations for image/video processing. Media Processing SIMD • Introduced at 600 MHz clock speed (1 GHz now), but: – 11 -stage pipeline with long latencies – Dynamic caches. • $100 qty 10 k. • The only DSP family with compatible fixed and floating-point versions. EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

C 64 xx (also C 62 xx and C 67 xx) VLIW have higher

C 64 xx (also C 62 xx and C 67 xx) VLIW have higher memory use due to simpler (RISC-like, fixed-width) instructions than conventional DSPs, more instructions and instruction bandwidth needed, Also VLIW but with variable-length instruction encoding (less memory use than C 64 xx) (16 -32 bits) EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Computational (XScale) EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16

Computational (XScale) EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

Multiple-Issue 4 th Generation DSPs Superscalar DSP: LSI Logic ZSP 400 • A 4

Multiple-Issue 4 th Generation DSPs Superscalar DSP: LSI Logic ZSP 400 • A 4 -way superscalar dynamically scheduled 16 -bit fixedpoint DSP core. • 16 -bit RISC-like instructions • Separate on-chip caches for instructions and data • Two MAC units, two ALU/shifter units – Limited SIMD support. – MACS can be combined for 32 -bit operations. • Possible Disadvantage: – Dynamic behavior complicates DSP software development: • Ensuring real-time behavior • Optimizing code. EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006

TI not actively improving their flagship FP DSP (fixed-point more important!) EECC 722 -

TI not actively improving their flagship FP DSP (fixed-point more important!) EECC 722 - Shaaban # lec # 8 Fall 2006 10 -16 -2006