Lecture 15 Embedded Multiprocessor Architectures Embedded Computing Systems

  • Slides: 26
Download presentation
Lecture 15: Embedded Multiprocessor Architectures Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte

Lecture 15: Embedded Multiprocessor Architectures Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based on slides and textbook from Wayne Wolf High Performance Embedded Computing © 2007 Elsevier

Topics n n Overview and Motivation. Embedded Multiprocessor Design Techniques Embedded Multiprocessor Architectures. Processing

Topics n n Overview and Motivation. Embedded Multiprocessor Design Techniques Embedded Multiprocessor Architectures. Processing Elements © 2006 Elsevier

Generic multiprocessors n Shared memory: PE PE … n PE Message passing: mem PE

Generic multiprocessors n Shared memory: PE PE … n PE Message passing: mem PE PE … mem PE Interconnect network mem … mem © 2006 Elsevier

Design choices n Processing elements: q q q n Memory: q q n Number.

Design choices n Processing elements: q q q n Memory: q q n Number. Type. Homogeneous or heterogeneous. Size and configuration. Shared or. private memories. Interconnection networks: q q Topology. Protocol. © 2006 Elsevier

Why embedded multiprocessors? n n n Real-time performance---segregate tasks to improve predictability and performance.

Why embedded multiprocessors? n n n Real-time performance---segregate tasks to improve predictability and performance. Low power/energy---segregate tasks to allow idling, segregate memory traffic. Cost---several small processors may be more efficient than one large processor. © 2006 Elsevier

Example: cell phones n Variety of tasks: q q q q Error detection and

Example: cell phones n Variety of tasks: q q q q Error detection and correction. Voice compression/decompression. Protocol processing. Position sensing. Music. Cameras. Web browsing. © 2006 Elsevier

Example: video compression n QCIF (177 x 144) used in cell phones and portable

Example: video compression n QCIF (177 x 144) used in cell phones and portable devices: q q 11 x 9 macroblocks of 16 x 16. Frame rate of 15 or 30 frames/sec. Seven correlations per macroblock = 177, 408 pixel comparisons per frame. Feig/Winograd DCT algorithm uses 94 multiplications and 454 additions per 8 x 8 2 D DCT. © 2006 Elsevier

Austin et al. : portable supercomputer n Next-generation workloads on portable device: q q

Austin et al. : portable supercomputer n Next-generation workloads on portable device: q q n n Speech compression. Video compression and analysis. High-resolution graphics. High-bandwidth wireless communications. Workload is 10, 000 SPECint = 16 x 2 GHz Pentium 4. Power budget of 75 m. W. © 2006 Elsevier

Performance trends on desktop © 2006 Elsevier[Aus 04] © 2004 IEEE Computer Society

Performance trends on desktop © 2006 Elsevier[Aus 04] © 2004 IEEE Computer Society

Energy trends on desktop © 2006 Elsevier[Aus 04] © 2004 IEEE Computer Society

Energy trends on desktop © 2006 Elsevier[Aus 04] © 2004 IEEE Computer Society

Specialization and multiprocessing n Many embedded multiprocessors are heterogeneous: q q q n Why

Specialization and multiprocessing n Many embedded multiprocessors are heterogeneous: q q q n Why use heterogeneous multiprocessors? q q q n n Processing elements. Interconnect. Memory. Some operations (8 x 8 DCT) are standardized. Some operations are specialized. High-throughput operations may require specialized units. Heterogeneity reduces power consumption. Heterogeneity improves real-time performance. © 2006 Elsevier

Multiprocessor design methodologies Analyze workload that n represents system’s usage. q n n n

Multiprocessor design methodologies Analyze workload that n represents system’s usage. q n n n May include multiple programs. Platform-independent optimizations eliminate side effects due to reference software implementation. Platform design is based on operations, memory, etc. Software can be further optimized to take advantage of platform. © 2006 Elsevier

Cai and Gajski modeling levels n n n Implementation: corresponds directly to hardware. Cycle-accurate

Cai and Gajski modeling levels n n n Implementation: corresponds directly to hardware. Cycle-accurate computation: captures accurate computation times, approximate communication times. Time-accurate communication: captures communication times accurately but computation times only approximately. Bus-transaction: models bus operations but is not cycle-accurate. PE-assembly: communication is untimed, PE execution is approximately timed. Specification: functional model. © 2006 Elsevier

Multiprocessor systems-on-chips n MPSo. C is a complete platform for an application. q n

Multiprocessor systems-on-chips n MPSo. C is a complete platform for an application. q n n Platform is usually tailored for a particular application domain. Generally heterogeneous processing elements. Combine off-chip bulk memory with on-chip specialized memory. © 2006 Elsevier

Qualcomm MSM 5100 n n n Cell phone system-onchip. Two CDMA standards, analog cell

Qualcomm MSM 5100 n n n Cell phone system-onchip. Two CDMA standards, analog cell phone standard (AMPS). GPS, Bluetooth, music, mass storage. © 2006 Elsevier

Qualcomm MSM 5100 Integration © 2006 Elsevier

Qualcomm MSM 5100 Integration © 2006 Elsevier

Philips Viper Nexperia © 2006 Elsevier

Philips Viper Nexperia © 2006 Elsevier

Viper Nexperia characteristics n n n Designed to decode 1920 x 1080 HDTV. Trimedia

Viper Nexperia characteristics n n n Designed to decode 1920 x 1080 HDTV. Trimedia runs video processing functions. MIPS runs operating system. Synchronous DRAM interface for bulk storage. Variety of I/O devices. Accelerators: image composition, scaler, MPEG-2 decoder, video input processors, etc. © 2006 Elsevier

Lucent Daytona n n MIMD for signal processing apps. Processing element is based on

Lucent Daytona n n MIMD for signal processing apps. Processing element is based on SPARC V 8. q n n Reduced precision vector unit has 16 x 64 -bit vector register file. Reconfigurable 8 KB level 1 cache q n DSP extensions 16 banks configured as I-cache, D-cache, or scratchpad Daytona split transaction bus. © 2006 Elsevier

Lucent Daytona PE n SPARC V 8 core q q 5 stage pipleine Windowed

Lucent Daytona PE n SPARC V 8 core q q 5 stage pipleine Windowed register file – Eight 16 -entry register windows plus 16 global registers. © 2006 Elsevier

STMicro Nomadik n n Designed for mobile multimedia. Accelerators built around MMDSP+ core: q

STMicro Nomadik n n Designed for mobile multimedia. Accelerators built around MMDSP+ core: q q One instruction per cycle. 16 - and 24 -bit fixed-point, 32 -bit floating-point. © 2006 Elsevier

STMicro Nomadik accelerators audio video © 2006 Elsevier

STMicro Nomadik accelerators audio video © 2006 Elsevier

TI OMAP n n n Designed for mobile multimedia. C 55 x DSP performs

TI OMAP n n n Designed for mobile multimedia. C 55 x DSP performs signal processing as slave. ARM runs operating system, dispatches tasks to DSP. © 2006 Elsevier

TI OMAP 5912 © 2006 Elsevier

TI OMAP 5912 © 2006 Elsevier

Processing elements issues n n n How many do we need? What types of

Processing elements issues n n n How many do we need? What types of processing elements do we need? Analyze performance/power requirements of each process in the application. Choose a processor type for each process. Determine what processes should share processing elements © 2006 Elsevier

Embedded Multiprocessor Questions n Of the embedded multiprocessors we discussed in this lecture, which

Embedded Multiprocessor Questions n Of the embedded multiprocessors we discussed in this lecture, which one seemed q q n n n The most general purpose? Why? The most application-specific? Why? What are advantages and disadvantages of the configurable cache used in the Lucent Daytona architecture? What benefits do the accelerators in the Viper Nexperia processor provide? For what types of applications are accelerators most important? © 2006 Elsevier