Advanced Embedded Systems Lecture 11 Multiprocessors in Embedded
Advanced Embedded Systems Lecture 11 Multiprocessors in Embedded Systems (1) 1
Advanced Embedded Systems n Embedded multiprocessors: q q n homogeneous or heterogeneous; A multiprocessor is made of: q q q Processing elements; Memory blocks; Interconnection networks; 2
Advanced Embedded Systems n Embedded multiprocessors vs. typical multiprocessors: q q Different types of PEs: PEs with different features, PEs programmable and non-programmable; Memory blocks with different sizes; private and shared memory blocks; Specialized interconnection networks; Both have to offer high performance but EMs must add: n n Real time performance: scientific multiprocessors improve average performance at the expense of predictability; EMs must offer predictable performance; EMs must frequently run at low energy and power levels; low power reduces heating problems and cost, while low energy consumption increases battery life; typical multiprocessors are less sensitive to power and energy consumption; EMs must be cost-effective: they must provide high performance without excessive hardware; Design techniques: q q q Heterogeneous multiprocessors are more energy efficient and cost effective than homogeneous multiprocessors; Heterogeneous memory systems improve real time performance; Networks-on-chip; 3
Advanced Embedded Systems n The combination of high performance, low power and real time leads toward heterogeneous multiprocessors: q q It is desirable to specialize the blocks of an EM: the processing elements, the memories and the interconnection network; Specialization leads to lower power consumption; examples of operations needing specialized hardware: n n q q Bit level operations: in a CPU, it requires too many registers; Intensive input/ output operations: if data must be read, processed and written to meet a tight deadline, for example in an engine control; Heterogeneity reduces power consumption because unnecessary hardware is removed; additional hardware is always necessary for generalizing functions; Drawback: specialization increases communication; Using multiple CPUs can increase real time performance; allocating time for critical processes on separate CPUs helps to meet deadlines; Specialized memories and interconnections increases the predictability of the response time of a process; 4
Advanced Embedded Systems n Embedded multiprocessors design techniques: q q n Design methodologies; Modeling and simulation; Multiprocessor design methodologies: 5
Advanced Embedded Systems q q q The program used to design and evaluate the EM is called workload (benchmarks in computers); Many such programs are not written for ESs (real time performance, low power, limited memory) and their use may lead to wrong decisions; a workload must be tailored to EMs requirements with platformindependent optimizations; Next, platform-independent measurements must be performed for defining an architecture; examples are: dynamic instruction count and data access patterns; they show close is the workload to the EM which must be designed; An initial candidate architecture is delimitated; platform-dependent characteristics are measured and the architecture is evaluated; if the platform is not appropriate it is modified and new measurements are done; if it is appropriate, the blocks of the EM are designed; The software is mapped onto the platform; during this phase, compilers and libraries may be useful; most of the optimizations are platform dependent; operations must be allocated to processing elements, data to memories and communications to the interconnection network; 6
Advanced Embedded Systems n Multiprocessor modeling and simulation: q q Most multiprocessor simulators are systems of communicating simulators; the component simulators are PEs, memory elements and interconnection networks; the simulator itself ensures the communication between those component simulators; The multiprocessor simulator can be built using techniques of parallel computing: n n n q Each component simulator is a process both in the multiprocessor simulator and in the host CPU’s operating system; The operating system provides the abstraction necessary for multiprocessing: each simulator has its own state, just as each PE in the implementation has its own state; The simulator uses the host’s computer communication mechanisms, such as semaphores, shared memory and so on, to manage the communication between the component simulators; Simulators for classical multiprocessors assume that all the PEs are the same type; they must be adapted to heterogeneous multiprocessors which requires additional software; 7
Advanced Embedded Systems n Multiprocessor architectures: q q n The ESs separated or The ESs implemented on the same chip, known also as multiprocessor system-on-chip (MPSo. C); Philips Nexperia: MPSo. C for digital video and television applications: 8
Advanced Embedded Systems q q q It includes two processors: MIPS PR 3940 RISC CPU running the real time operating system and Trimedia TM 32 VLIW processor for media operations; It includes a synchronous DRAM satisfying the requirements of the video memory; the memory controller is connected to the rest of the circuit through a bus; The MIPS processor is connected to a fast bus and this one is connected to a slower bus for the low speed peripherals through a bridge; the TM 32 processor has its own bus; Various peripherals are implemented on the chip: a USB controller, 3 UARTs, 2 I 2 C interfaces, digital audio interfaces, general-purpose I/ O pins; The circuit contains special-purpose function units and accelerators for media applications: n n An image composition engine, a scale unit, a MPEG-2 video decoder, two video input processors that can be used to receive the NTSC and PAL broadcast standards, a drawing engine; These units bring efficiency by off-loading some work from the CPUs; 9
Advanced Embedded Systems n TI OMAP Multiprocessor q q q It was designed for mobile multimedia applications: camera phones, portable imaging devices and so forth; The OMAP standard conforms to the OMAPI standard which defines hardware and software interfaces for multimedia multiprocessors; The fig. shows the overall structure of the OMAP hardware/ software architecture; it is based on a RISC processor, an ARM 9, and a DSP, a TI C 55 x; the two processors communicate through a shared memory; 10
Advanced Embedded Systems q OMAP 5912 n n n It contains a frame buffer for video as a separate block of memory, distinct from the main data and program memory; the frame buffer is contained onchip while the flash and SDRAM memories are off-chip; There are 4 mailboxes, in hardware, for multiprocessor communications; two are writable by the ARM 9 and two are writable by the C 55 x; all are readable by either processor; Each processor has some dedicated I/ O devices; there also some common devices accessible through a peripheral bridge; 11
Advanced Embedded Systems n The components of an EM are: q q q n n The processing elements perform the computations; a PE may run only one process or may run several processes; frequently, an EM uses different CPUs for implementing the PEs: programmable processors, hardwired processors, single-function blocks etc. For determining the number of PEs and their type the following design methodology is recommended: q q q n n Processing elements; Memories; Interconnection networks; Analyze each application to determine the performance and power requirements of each process in the application; Choose a processor type for each process, usual from a predetermined set of processor types; Determine which process can share a CPU to determine the required number of PEs; Software performance analysis can be used to determine how fast a process will run on a particular type of CPU; Standard CPUs or configurable processors can be used; 12
Advanced Embedded Systems n n The memory system is a classical bottleneck in computing: the memories are slower than processors and, worse, processor clock rates are increasing much faster than memory cycle times decrease; Traditional parallel memory systems q q Used in classical multiprocessors; memories are homogeneous; Each bank is separately addressable; If there are n banks, n accesses can be performed in parallel, offering the peak access rate; it can be achieved only in particular cases, for example if the banks are accessed in the order 0, 1, 2, 3, … In reality, the probability of a k long sequential access sequence is: , where λ is the probability of a nonsequential memory access (for example a branch); 13
Advanced Embedded Systems n Heterogeneous memory systems: are preferred in EMs but can coexist with homogeneous memory systems; q HMS improve real time performance: n n n q Common memories are good when we are concerned only by functionality and less when real time performance and predictability are desired; If a memory block is shared by several PEs, they will contend for that memory; in general, one PE will have to wait for another PE to finish its access; in most cases it is not possible to predict when these conflicts will occur; Avoiding conflicts can be guaranteed if only one, or a few, PEs access a memory, that is if a specialized memory for those PEs was foresight; HMS contributes to reduce power consumption: n n n One component of the power consumption when a memory access is done, is given by the size of the memory block (because of the access time); A heterogeneous memory can be built with smaller memory blocks, reducing the access time, thus the power consumption; Energy per access also depends on the number of ports on the memory block, so reducing the number of the units that can access a given part of memory leads to a reduction in the energy consumption; 14
Advanced Embedded Systems n Interconnection networks q q Connect the PEs to the memories; Terminology: n n n q Client: a sender or receiver connected to a network; Port: a connection to a network on a client; Link: a connection between two clients; Half-duplex and full-duplex: … Topology: organization of the links; determines properties of the network; Attributes for evaluating and comparing the INs are: n n Throughput: the maximum available throughput from one node to another and the variations in data rates over time and the effect of those variations on network behavior are useful; Latency: the amount of time it takes a packet to travel from a source to a destination is of interest; also, the best-case and worst-case latency are important when the latency varies; Energy consumption: a typical measure is the amount of energy required to send a bit through the network; Area: influences the cost and the dynamic energy consumption (the metal area of the wires); the total area is given by the metal area of the wires and the silicon area of the transistors; 15
The simplest interconnection network is the bus Advanced Embedded Systems n q q q q Small size, low performance, high energy consumption; For estimating the performance it is assumed that the bus is operated by a master clock; Considering an one word per bus transaction, the bus throughput is: words/ sec. ; P = clock period, C = no. of clock cycles required for transaction overhead (addressing, etc. ); If the bus supports block transfers, then the block transactions of n word blocks is: , words/ sec. The main part of the energy consumption is due to the dynamic energy consumption; this is determined by the capacitance that must be driven; The capacitance of a bus is given by two components: the bus wires and the loads at the clients; if the number of clients is large, this capacitance becomes important; The energy consumption may be high because of the length of the wires; Bus is not recommended because it becomes easily saturated with traffic, so a small number of PEs can be connected; 16
Advanced Embedded Systems n The crossbar: the most complex IN: q q Is a fully connected network; it provides a path from every input port to every output port; ex. of a 4 x 4 crossbar: Provides full connectivity to any combination of inputs and outputs; Broadcast from an input to all outputs and multicast from an input to several selected outputs is possible; The disadvantage is its size: for n inputs and n outputs n 2 switches are necessary; however, because of the simplicity of the switches and their small sizes, crossbars for moderate number of inputs and outputs (for example 8 x 8 with words of reasonable width) can be built in a modern VLSI chip; a 10000 x 10000 crossbar for even 1 bit wide word is not reasonable; 17
Advanced Embedded Systems q q q If the number of inputs is too large, for a given area of the crossbar, the solution is to use buffers; Queues can be added to the inputs of the crossbar, several sources of traffic being connected to a queue; a queue controller is needed to decide the order in which the packets will enter in the queue and what to do when the queue is full; Buffers can be added to switches; this will increase the physical size but also the flexibility in transfers; 18
Advanced Embedded Systems n Mesh networks: q q Every node is connected to all of its neighbors; A mesh network is scalable in that a network of dimension n + 1 includes subnetworks that are meshes of dimension n; The links are short but their number is high establishing multiple paths for data; The shortest path between two nodes is equal to its Manhattan distance, which is the sum of the differences between the indexes of the source and destination nodes; 19
Advanced Embedded Systems n Application-specific networks: are appropriate for ESs; q q n It is a topology matched on the characteristics of the application; ASNs are less energy consuming than a regular network of equal overall performance; Because most embedded applications perform several different tasks simultaneously, different parts of the architecture require different network bandwidth; The network becomes more efficient, without sacrificing performance for a given application, by placing bandwidth where it is necessary; Routing and flow control determines the cost and the performance of the network; q q Routing determines the paths; routing algorithms can be deterministic or adaptive, they may drop packets occasionally or guarantee packet delivery; types of algorithms: circuit switching, store-and-forward, wormhole and virtual cut-through; Flow control determines the way that links and buffers are allocated as packets move through the network; 20
Advanced Embedded Systems n Networks on chips q q q No. Cs are the interconnection networks for single-chip multiprocessors; Each switch is connected to its four nearest neighbors with two unidirectional links and to a resource; In a 60 nm CMOS technology: n n A single chip could include a 10 x 10 mesh with switches and resources; Each network link would have 256 data bits plus control signals; Each switch has a queue at each input; The selection logic at the outputs determines the order of the packets; 21
Advanced Embedded Systems q Another example is the SPIN network: it is a scalable network with a fat tree topology; n This topology offers more bandwidth at higher levels in order to reduce contention; n The leaf nodes are the processing and memory elements; when a PE wants to send a message to another, the message goes up, in the tree, until a common ancestor node is reached, then it goes back down; One advantage of the fat tree topology is that all the routing nodes use the same routing function this allowing to use the same routers in all the network; The SPIN network uses two 32 bit data paths, one for each direction, for a fullduplex communication; a router can choose any of the several equivalent paths that are available at that moment to it; n n 22
Advanced Embedded Systems q Design methodologies for No. Cs were developed; ex. : a methodology for designing networks for Qo. S intense applications such as multimedia: n n n The application requirements are specified; The performance required from the network is determined; The topology is determined and the network is configured with PEs and memories; The network is simulated to evaluate its actual performance; The network may be modified based on the performance results; 23
Advanced Embedded Systems n Physically distributed embedded systems and networks q q q Frequently used for cars, airplanes etc. These systems are more loosely coupled than multiprocessors, they generally do not share memories; The application is distributed over the PEs; The distributed system must provide guaranteed real time behavior; Reasons to build network based embedded systems: n n n q To execute tasks near the events; ex. : an engine control may ask short time delays; Data reduction: ex. : some initial signal processing on the data inputs for reducing its volume; the allocation of these operations to a dedicated processor will fasten the process and will reduce the load on the processor that uses the data for taking decisions; Modularity: for easier design and assembling, for easier debugging (a verified module can be used to probe components in another part of the network), for fault tolerance; The design of a distributed embedded system is an example of hardware/ software co-design since both the network topology design and the software running on the network nodes design must be thought together; 24
Advanced Embedded Systems n Time-triggered architecture q q TTA is a distributed architecture for real time control; it offers reliability for safety-critical systems and accuracy for high-rate physical processes; It is different from other distributed architectures in that it takes time into account; TTA represents time as a 64 bit value, with the three lower bytes meaning fractions of seconds and the five upper bytes meaning seconds; Next fig. presents the communication network interface; it links the communications controller, which is the low-level interface and the host node, which is the TTA’s PE; 25
Advanced Embedded Systems n The TTA can be implemented on bus and star topologies; q q A bus based system uses replicated busses; they are passive to avoid components that may fail; Each physical node is made by a node, two guardians and a bus transceiver; the guardians monitor the transmissions of the node; 26
Advanced Embedded Systems n Flex. Ray q q q Is a second generation standard for automotive networks; it provides higher bandwidth and more abstract services than CAN; It is based on the TTA; Next fig. shows a block diagram of a generic Flex. Ray system: n n n The host run applications; The host communicates with the communication controller, which provides high-level functions and with the low-level bus driver; Bus guardians are nodes that monitors the behavior of the network and takes actions when the behavior is erroneous; 27
Advanced Embedded Systems q Flex. Ray is organized around 5 levels of abstraction: n n n Physical level: defines the structure of connections; Interface level: defines the physical connections; Protocol engine: defines frame formats and communication nodes and services such as messages and synchronization; Controller host interface: provides information on status, configuration, messages and control for the host layer; Host layer: provides applications; 28
Advanced Embedded Systems q Flex. Ray has an active star topology (the router node is active): n A node may be connected to more than one star to provide redundant connections; 29
Advanced Embedded Systems q q q Data is coded with the differential non-return-to-zero scheme; The transmission rate is 10 Mbps, independent of the length of the link; arbitration on bits is not done, so arbitration contention does not limit the link’s length; Data is encapsulated in frames; a frame’s form is: n n n Frame ID: identifies the frame’s slot; its value Є {0, …, 2047}; Payload length: gives the number of 16 bit words in the payload section; Header CRC: provides error correction; Cycle count: enumerates the protocol cycles; this information is used within the protocol engines to guide clock synchronization; Data field: provides payload from 0 – 254 bytes in size; Trailer CRC: provides additional error correction; 30
Advanced Embedded Systems q q There are 2 timing structures: static and dynamic segment; The static segment is scheduled using a TDMA discipline; n n q Static segments are divided into slots of fixed end equal length; all the slots are used in every segment in the same order; The static segment is split across two channels; synchronization frames are provided on both channels; messages can be sent on either one or both channels; less critical messages are sent on only one channel; the slots are occupied by messages with ascending frame ID numbers; The dynamic segment: n n Provides bandwidth for asynchronous, unpredictable communication; the slots are arbitrated using a deterministic mechanism; The dynamic segment has two channels and each of which can have its own message queue; 31
Advanced Embedded Systems q Because of its complex timing, Flex. Ray must be started properly: n n n q Flex. Ray has a global time source to synchronize messages: n q The global time is synthesized by the clock synchronization process from the nodes’ clocks using distributed timekeeping algorithms; The bus guardians: n n q The operation begins with a wake-up procedure that turns on the nodes; Then a coldstart that initiates the TDMA process is done; At least two nodes must have the possibility to perform a coldstart; Prevent the nodes from transmitting outside their schedules; It is not mandatory to include a bus guardian in a Flex. Ray system, it is only recommended; The bus guardian sends an enable signal to every node in the system it guards; by removing the enable signal, the transmission will be stopped; The bus guardian uses its own clock to watch the bus operation; if it detects a message coming at the wrong time, it will remove the enable signal; The controller host interface provides services to the host, regarding: status, control (interrupt service, startup), data (buffering messages) and configuration; 32
Advanced Embedded Systems n Aircraft networks: q The aircraft area is somehow similar to the automotive area but with more severe requirements: n n n q Aircraft electronics is divided into 3 categories: n n n q q q The weight is a more sensitive parameter than in the case of cars; Planes must have more complex control because they are driven in 3 D; Most aspects of aircraft design, operation and maintenance are regulated; Instrumentation; Navigation/ communication; Control; Instrumentation (such as the altimeter or artificial horizon) use mechanical, pneumatic or hydraulic methods; the electronics has to display the data and send them to other systems; Navigation/ communication: is done by radio, and is regulated; communication is done by voice or data; digital electronics control the radios and display navigation data, such as moving maps that integrate navigation data onto a map; Control: operate the engines and flight surfaces (such as aileron, elevator, rudder) 33
Advanced Embedded Systems q Generally, aircraft use different types of networks, such as: n n n q Control networks: they perform hard real time tasks for instrumentation and control; Management networks: they control noncritical devices; they can use nonguaranteed modes, such as Ethernet, to improve average performance and limit weight; Passenger networks: ex. : Internet service to passengers; a satellite link is used; these networks are separated from the operation networks by firewalls; Aircraft data networks are governed by several standards; ex. ARINC 664: n n n It is based on Ethernet, providing higher bandwidth than previous aircraft data networks and allows aircraft manufacturers to use classical network components; However, the basic Ethernet is used with protocols and architectures that provide the needed real-time performance and reliability; It divides the aircraft network into 4 domains, with firewalls between them: q q The flight deck network for real time control; A network for equipment supplied by outside vendors; A subnetwork for secondary operations, such as inflight entertainment; The passenger subnetwork which provides Internet access to passengers. 34
- Slides: 34