Motivation Mobile embedded systems are present in Cell
Motivation • Mobile embedded systems are present in: – Cell phones – PDA’s – MP 3 players – GPS units
Mobile Computing Design Considerations • • • Low power Real-time data processing Small size Low cost Quick time to market
Metric Introduction • • • Processor specialization Instruction set Interconnect Memory specialization Functional & Data path units Power Specialization
Metric: Processor Specialization • Central controlling point of embedded system • Examples: – VLIW to perform multiple instructions in parallel. – RISC architecture
Metric: Instruction Set Specialization • Introduction of new instructions to extract optimal performance from the processor • Examples: – Multiply-accumulate – Vector operations
Metric: Interconnect • Provides means for different modules to communicate • Optimizations can lead to reduced complexity, cost, and power consumption
Metric: Memory Specialization • Specialization is achieved through optimization of number and size of memory banks, number and size of access ports • Optimizations can improve performance, power consumption, and chip area
Metric: Functional & Data Path Units • Functional units are often specialized hardware units implementing a frequently used software algorithm • Examples: – DSP co-processors, interrupt priority coprocessors, memory access modules, and timer modules
Metric: Power Specialization • Major concern in mobile systems • Kept under control by: – Using low voltage – Slow clock speed – Custom circuit solutions
Architectures to be discussed • • • M*CORE D 30 V/MPEG Super. ENC 1. 3 -GOPS Parallel DSP IA-32 w/ Enhanced Data Streaming
M*CORE • Low power embedded applications • Wireless mobile devices • Cellular phones
M*CORE Processor Specialization • • Simple RISC architecture 4 stage pipeline 16 -bit instruction word length Compiler designed in parallel with architecture • Barrel shifter built into ALU
M*CORE Instruction Set Specialization • Multimedia instructions – Multiple data transfers from memory to register and register to memory. – Fast register saves • FF 1 – Find First 1 – Finding highest priority interrupt in hardware
M*CORE Interconnect Specialization • 16 – bit data bus to match 16 bit word length – Reduces memory bandwidth, complexity, chip area layout, and power consumption • MDI – MCU–to-DSP Interface – Dual access memory messaging unit • General I/O bus for a peripherals
M*CORE Memory Specialization • Alternate register bank – Fast register saves for context switches
M*CORE Functional & Data Path Units • 32 channel programmable interrupt controller • Protocol timer • DSP core
M*CORE Power Specialization • • 1. 8 Volts Uses 0. 5 Watts Power aware pipeline Programmable power states – Stop – Wait – Dose – Normal
M*CORE Summary • Low power and programmable power states make it ideal for mobile devices • Interface to built in DSP core makes it ideal for cell phone applications
650 MHZ IA-32 • Microprocessor designed to accelerate datastreaming applications • Three-dimensional graphics • Video encode/decode
650 MHZ IA-32 Processor Specialization • • IA-32 architecture 70 new instructions SIMD floating point data type Improvements in regard to circuit implementation
650 MHZ IA-32 Instruction Set Specialization • 70 new instructions – SIMD FP operations – Control for new 8 -entry register file – Multimedia extension • 12 new integer instructions
650 MHZ IA-32 Interconnect Specialization • Front Side Bus of 66, 100, 133 MHz • Back Side Bus – Half the clock frequency for mobile and desktop applications – Full clock frequency for server/workstation applications
650 MHZ IA-32 Memory Specialization • 3 new non-temporal store instructions with write combining buffers – Burst write protocol – Write data throughput of 1. 066 Gbytes/sec on a 133 MHz bus • 4 new data pre-fetch instructions – Overlap, reduces cache miss penalties
650 MHZ IA-32 Functional Specialization • 8 entry register file – Reduces register starvation for SIMD unit – 128 bits wide • four independent single precision elements packed in parallel • Dedicated table based lookup unit for reciprocal operations – Completes reciprocal operations in one clock cycle – Error of 1. 5 * 2^-12
650 MHZ IA-32 Low Power Usage • 1. 4 V ~ 2. 2 V at 650 MHz close to room temperature
650 MHZ IA-32 Performance • 1. 5 X to 2. 0 X performance boost for 3 -D transform and lighting kernels • Real-time MPEG-2 video/audio encoding at 30 frames per second – Achieved through improvement to SIMD unit, at a cost of only 2% increase of unit area size
D 30 V/MPEG • Multimedia applications – Decoding MPEG-2
D 30 V/MPEG Processor Specialization • • 2 way VLIW Dual issue RISC pipeline 2 way assigned SIMD module Pipeline has ability to re-route data through execution path
D 30 V/MPEG Instruction Set Specialization • Saturate and Add • DSP instructions built in – Modular addressing – Block repeat – Multiply accumulate • Half word instructions – Effectively double number of useable registers
D 30 V/MPEG Interconnect Specialization • Chip layout specialized for decoding streaming mpeg data
D 30 V/MPEG Memory Specialization • 32 Kbyte data RAM • 64 Kbyte instruction RAM • 4 Kbyte RAM for Variable Length Encoder/Decoder (VLC/VLD) tables • Special Registers – MOD_S & MOD_E for modulo addressing – RPT_S, RPT_E, and RPT_C for looping
D 30 V/MPEG Functional Specialization • VLC/VLD Variable Length Encoding/Decoding units
D 30 V/MPEG Low Power Usage • 2. 5 Volts at 243 MHz • Uses 2. 0 Watts
D 30 V/MPEG Performance • 12 % speedup from inter-pipe bypasses • Special VLC/VLD functional blocks speedup MPEG decoding
1. 3 GOPS Parallel DSP • Achieve real-time image processing capability • Employ data parallelism to achieve goal – High level algorithms, non-parallelizable • Arithmetic encoding – Medium level algorithms, medium parallelizable • Contour tracking of binary images – Low level algorithms, high parallelizable • Filters and transforms • Data independent control and data flow • 80 % of MPEG-2, 60% of MPEG-4
1. 3 GOPS Parallel DSP Processor Specialization • Central control unit – RISC based – Controls multiple SIMD units
1. 3 GOPS Parallel DSP Instruction Set Specialization • VLIW instructions – 3 instructions per issue • 1 load/store 16 bit data • 2 arithmetic operations on 16/32 bit data
1. 3 GOPS Parallel DSP Interconnect Specialization • DMA/MCU (Direct Memory Access/Memory Control Unit) – Handles cache misses – Performs prefetch operations from matrix memory – Interfaces with external 64 bit data bus and 32 bit address bus for SRAM and DRAM modules
1. 3 GOPS Parallel DSP Memory Specialization • Memory tailored to image processing needs – Provides parallel high bandwidth access to shared data with matrix shaped access patterns • Individual Cache Memory – Services irregular memory requests
1. 3 GOPS Parallel DSP Functional Specialization Multiple SIMD units – Currently 4 units for prototype – 16 units planned for future versions – SIMD approach has been extended with ASIMD, autonomous instruction selection capability • Improves handling of conditional branches
1. 3 GOPS Parallel DSP Low Power Usage • 3. 3 Volts • Using 650 milliwatts
1. 3 GOPS Summary • Sustained performance 380 MIPS – Around 90% utilization
Super. ENC • MPEG-2 video encoder
Super. ENC Processor Specialization • Software implemented RISC architecture – 5 stage pipeline – 81 MHz, 32 bit wide data/instruction path • Software implemented SIMD/SDIF (SDRAM Interface) modules
Super. ENC Instruction Set Specialization • There is no instruction set specialization mentioned in the paper.
Super. ENC Interconnect Specialization • SDIF – All memory access goes through SDIF – Relay data without going to external memory • Reduces memory bandwidth and power consumption
Super. ENC Memory Specialization • Uses external RAM – Can access two 16 Mbit SDRAMS or one 64 Mbit SDRAM
Super. ENC Functional Specialization • MPEG algorithm is broken up into hardware functional blocks – Example • • DCT, Discrete Cosine Transfer IDCT, Inverse Discrete Cosine Transfer ME. Motion Estimation MC, Motion Compensation
Super. ENC Low Power Usage • 2. 5 Volts internal • 3. 3 Volts I/O • 1. 5 Watts
Super. ENC Summary • Super. ENC makes use of many hardware functional blocks to implement the MPEG decoding algorithm
Metric Results • D 30 V/MPEG highest rated
- Slides: 52