Lecture 6 Embedded Processors Embedded Computing Systems Mikko

Topics n n Embedded microprocessor market. Categories of CPUs. RISC, DSP, and Multimedia processors.

Demand for Embedded Processors n Embedded processors account for q q n Over 97% of total processors sold Over 60% of total sales from processors Sales expected to increase by roughly 15% each year High Performance Embedded Computing © 2007 Elsevier

Flynn’s taxonomy of processors n n n Single-instruction single-data (SISD) Single-instruction multiple-data (SIMD) Multiple-instruction multiple-data (MIMD) Multiple-instruction single data (MISD) What is an example of each? Which would you expect to see in embedded systems? High Performance Embedded Computing © 2007 Elsevier

Other axes of comparison n n n RISC vs. CISC---Instruction set style. Instruction issue width. Static vs. dynamic scheduling for multipleissue machines. Scalar vs. vector processing. Single-threaded vs. multithreading. A single CPU can fit into multiple categories. High Performance Embedded Computing © 2007 Elsevier

Embedded vs. general-purpose processors n Embedded processors may be customized for a category of applications. q n Customization may be narrow or broad. We may judge embedded processors using different metrics: q q Code size. Energy efficiency. Memory system performance. Predictability. High Performance Embedded Computing © 2007 Elsevier

Embedded RISC processors n n RISC processors often have simple, highlypipelinable instructions Pipelines of embedded RISC processors have grown over time: q q q ARM 7 has 3 -stage pipeline. ARM 9 has 5 -stage pipeline ARM 11 has 8 -stage pipeline. ARM 11 pipeline [ARM 05]. High Performance Embedded Computing © 2007 Elsevier

RISC processor families n ARM: q q n MIPS: q q q n ARM 7 has in-order execution, and no memory management or branch prediction; ARM 9 ARM 11 has out of order execution, memory management, and branch prediction, MIPS 32 4 K has 5 -stage pipeline; 4 KE family has DSP extension; 4 KS is designed for security. Power. PC: q q Power. PC 400 series includes several embedded processors; Motorola and IBM offer superscalar versions of the Power. PC High Performance Embedded Computing © 2007 Elsevier

Embedded DSP Processors n n Embedded DSP processors are optimized to perform DSP algorithms; speech coding, filtering, convolution, fast Fourier transforms, discrete cosine transforms DSP processors feature q q q Deterministic execution times Fast multiply-accumulate instructions Multiple data accesses per cycle Specialized addressing modes Efficient support for loops and interrupts Efficient processing of “streaming” data High Performance Embedded Computing © 2007 Elsevier

Example: TI C 55 x/C 54 x DSPs n n n 40 -bit arithmetic (32 -bit values + 8 guard bits). Barrel shifter. 17 x 17 multiplier. Two address generators. Lots of special purpose registers and addressing modes Coprocessors for compute-intensive functions including pixel interpolation, motion estimation, and DCT/IDCT computations High Performance Embedded Computing © 2007 Elsevier

TI C 55 x microarchitecture High Performance Embedded Computing © 2007 Elsevier

Parallelism extraction n Static: q q n Use compiler to analyze program. Simpler CPU. Can’t depend on data values. VLIW Dynamic: q q Use hardware to identify opportunities. More complex CPU. Can make use of data values. Superscalar High Performance Embedded Computing © 2007 Elsevier

VLIW architectures n Each very long instruction word (VLIW) erforms multiple operations in parallel Branch Memory Arithmetic Logic Vector n n n Needs a good compiler that understands the architecture Allows deterministic execution times Code growth can be reduced by allowing q Operations within an instruction to be performed sequentially q A given field to specify different types of operations Seq Branch/Mem Mem/Arith/Logic Vector High Performance Embedded Computing © 2007 Elsevier

Simple VLIW architecture n Large register file feeds multiple function units. E box Add r 1, r 2, r 3; Sub r 4, r 5, r 6; Ld r 7, foo; St r 8, baz; NOP Register file ALU Load/store FU High Performance Embedded Computing © 2007 Elsevier

Clustered VLIW architecture n n Register file, function units divided into clusters. What are advantages/disadvantages of having clusters in VLIW architectures? Cluster bus Execution Register file High Performance Embedded Computing © 2007 Elsevier

TI C 62 x/C 67 x DSPs n n n VLIW with up to 8 instructions/cycle. 32 32 -bit registers. Function units: q q n Two multipliers. Six ALUs. All instructions execute conditionally. High Performance Embedded Computing © 2007 Elsevier

TI C 6 x data operations n n 8/16/32 -bit arithmetic. 40 -bit operations. Bit manipulation operations. C 67 x processors add floating-point arithmetic. High Performance Embedded Computing © 2007 Elsevier

C 6 x block diagram Data RAM 512 K bits Program RAM/cache 512 K bits JTAG bus timers Execute DMA Serial Data path 1/ Reg file 1 Data path 2/ Reg file 2 High Performance Embedded Computing © 2007 Elsevier PLL

Texas Instruments C 62 x N. Seshan, “High Veloci. TI processing VLIW DSP architecture]”, High Performance[Texas Embedded Instruments Computing Elsevier IEEE Signal Processing Magazine, ©v. 2007 15, no. 2, pp. 86 -101, 117, 1998.

Emerging DSP Architectures n Parallelism at multiple levels q Multiple processors n q Multiple simultaneous tasks n q Very Long Instruction Word (VLIW) architectures Multiple operation per instruction n n Multithreaded processors Multiple instruction per cycle n q System-on-a-chip designs Single Instruction Multiple Data (SIMD) instructions Architecture/compiler pairs improve performance and help manage application complexity High Performance Embedded Computing © 2007 Elsevier

Superscalar processors n Instructions are dynamically scheduled. q n Used to some extent in embedded processors. q q n Dependencies are checked at run time in hardware. Embedded Pentium is two-issue in-order. Some Power. PCs are superscalar What advantages/disadvantages do VLIW processors compared to superscalar? High Performance Embedded Computing © 2007 Elsevier

SIMD and subword parallelism n Many special-purpose SIMD machines q n Subword parallelism is widely used for video. q n n All processors perform same operation on different data ALU is divided into subwords for independent operations on small operands. Vector processing is another form of SIMD processing Lots of times these terms are interchanged High Performance Embedded Computing © 2007 Elsevier

SIMD Instructions n n n Recent multimedia processors commonly support Single Instruction Multiple data (SIMD) instructions The same operation is performed on multiple data operands using a single instruction A 3 A 2 A 1 A 0 B 3 B 2 B 1 B 0 A 3+B 3 A 2+B 2 A 1+B 1 A 0+B 0 Exploits low precision and high data parallelism of multimedia applications High Performance Embedded Computing © 2007 Elsevier

Operand characteristics in Media. Bench High Performance Embedded Computing © 2007 Elsevier

Dynamic behavior of loops in Media. Bench n The loops of media n n applications in many cases are not very deep Path ratio = (instructions executed per iteration) / (total number of loop instructions). What does the path ratio reveal? High Performance Embedded Computing © 2007 Elsevier

Tri. Media TM-1 characteristics n Characteristics q q Floating point support Sub-word parallelism support

Trimedia TM-1 memory interface video in video out audio in audio out I 2

TM-1 VLIW CPU register file read/write crossbar FU 1 . . . FU 27

Multithreading n n Low-level parallelism mechanism. Interleaved multithreading (IMT) alternately fetches instructions from separate threads. q n Simultaneous multithreading (SMT) fetches instructions from several threads on each cycle. q n Often used with VLIW and vector processors Often used with superscalar processors What advantages/disadvantages does IMT have relative to SMT? High Performance Embedded Computing © 2007 Elsevier

Dynamic voltage scaling (DVS) n n n Power scales with V 2 while performance scales roughly as V. Reduce operating voltage, add parallel operating units to make up for lower clock speed. DVS doesn’t work well in processors with highleakage power. High Performance Embedded Computing © 2007 Elsevier

Dynamic voltage and frequency scaling (DVFS) n n Scale both voltage and clock frequency. Can use control algorithms to match performance to application, reduce power. High Performance Embedded Computing © 2007 Elsevier

Razor architecture n n n Razor runs clock faster than worst case allows Used specialized latch to detect errors. Recovers only on errors, gains averagecase performance. High Performance Embedded Computing © 2007 Elsevier