Chapter 2 part 1 CPUs High Performance Embedded
Chapter 2, part 1: CPUs High Performance Embedded Computing Wayne Wolf © 2007 Elsevier
Topics n n n CPU metrics. Categories of CPUs. CPU mechanisms. High Performance Embedded Computing © 2007 Elsevier
Performance as a design metric n Performance = speed: q q n n Latency. Throughput. Average vs. peak performance. Worst-case and bestcase performance. High Performance Embedded Computing © 2007 Elsevier
Other metrics n n Cost (area). Energy and power. Predictability. Security. High Performance Embedded Computing © 2007 Elsevier
Flynn’s taxonomy of processors n n Single-instruction single-data (SISD): RISC, etc. Single-instruction multiple-data (SIMD): all processors perform the same operations. Multiple-instruction multiple-data (MIMD): homogeneous or heterogeneous multiprocessor. Multiple-instruction multiple data (MISD). High Performance Embedded Computing © 2007 Elsevier
Other axes of comparison n n RISC vs. CISC---Instruction set style. Instruction issue width. Static vs. dynamic scheduling for multipleissue machines. Vector processing. Multithreading. High Performance Embedded Computing © 2007 Elsevier
Embedded vs. general-purpose processors n Embedded processors may be optimized for a category of applications. q n Customization may be narrow or broad. We may judge embedded processors using different metrics: q q q Code size. Memory system performance. Preditability. High Performance Embedded Computing © 2007 Elsevier
RISC processors n n RISC generally means highly-pipelinable, one instruction per cycle. Pipelines of embedded RISC processors have grown over time: q q q ARM 7 has 3 -stage pipeline. ARM 9 has 5 -stage pipeline. ARM 11 has eight-stage pipeline. ARM 11 pipeline [ARM 05]. High Performance Embedded Computing © 2007 Elsevier
RISC processor families n n n ARM: ARM 7 is relatively simple, no memory management; ARM 11 has memory management, other features. MIPS: MIPS 32 4 K has 5 -stage pipeline; 4 KE family has DSP extension; 4 KS is designed for security. Power. PC: 400 series includes several embedded processors; MPD 7410 is twoissue machine; 970 FX has 16 -stage pipeline. High Performance Embedded Computing © 2007 Elsevier
Digital signal processors n First DSP was AT&T DSP 16: q q n n Hardware multiplyaccumulate unit. Harvard architecture. Today, DSP is often used as a marketing term. Modern DSPs are heavily pipelined. High Performance Embedded Computing © 2007 Elsevier
Example: TI C 5 x DSP n n n 40 -bit arithmetic unit (32 -bit values with 8 guard bits). Barrel shifter. 17 x 17 multiplier. Comparison unit for Viterbi encoding/decoding. Single-cycle exponent encoder for widedynamic-range arithmetic. Two address generators. High Performance Embedded Computing © 2007 Elsevier
TI C 55 x microarchitecture High Performance Embedded Computing © 2007 Elsevier
Parallelism extraction n Static: q q n Use compiler to analyze program. Simpler CPU. Can make use of highlevel language constructs. Can’t depend on data values. Dynamic: q q q High Performance Embedded Computing © 2007 Elsevier Use hardware to identify opportunities. More complex CPU. Can make use of data values.
Simple VLIW architecture n Large register file feeds multiple function units. E box Add r 1, r 2, r 3; Sub r 4, r 5, r 6; Ld r 7, foo; St r 8, baz; NOP Register file ALU Load/store FU High Performance Embedded Computing © 2007 Elsevier
Clustered VLIW architecture n Register file, function units divided into clusters. Cluster bus Execution Register file High Performance Embedded Computing © 2007 Elsevier
Superscalar processors n Instructions are dynamically scheduled. q n Dependencies are checked at run time in hardware. Used to some extent in embedded processors. q Embedded Pentium is two-issue in-order. High Performance Embedded Computing © 2007 Elsevier
SIMD and subword parallelism n n Many special-purpose SIMD machines. Subword parallelism is widely used for video. q n ALU is divided into subwords for independent operations on small operands. Vector processing is widely used for integer values. High Performance Embedded Computing © 2007 Elsevier
Multithreading n n n Low-level parallelism mechanism. Hardware multithreading alternately fetches instructions from separate threads. Simultaneous multithreading (SMT) fetches instructions from several threads on each cycle. High Performance Embedded Computing © 2007 Elsevier
Available parallelism in multimedia applications (Talla et al. ) High Performance Embedded Computing © 2007 Elsevier
Operand characteristics in Media. Bench (Fritts) High Performance Embedded Computing © 2007 Elsevier
Dynamic behavior of loops in Media. Bench (Fritts) n n Path ratio = (instructions executed per iteration) / (total number of loop instructions). Media. Bench shows small path ratio -> considerable conditional behavior in loops. High Performance Embedded Computing © 2007 Elsevier
Dynamic voltage scaling (DVS) n n n Power scales with V 2 while performance scales roughly as V. Reduce operating voltage, add parallel operating units to make up for lower clock speed. DVS doesn’t work in high-leakage processors. High Performance Embedded Computing © 2007 Elsevier
Dynamic voltage and frequency scaling (DVFS) n n Scale both voltage and clock frequency. Can use control algorithms to match performance to application, reduce power. High Performance Embedded Computing © 2007 Elsevier
Razor architecture n n Critical path not always executed Reduce clock frequency to match average path Used specialized latch to detect errors. Recovers only on errors, gains averagecase performance. High Performance Embedded Computing © 2007 Elsevier
- Slides: 24