- Slides: 30
Parallel Computer Architectures • Two distinct models have been proposed for how CPUs in a parallel computer system should communicate. § In the first model, all CPUs share a common physical memory. • This kind of system is called a multiprocessor or shared memory system. § In the second design, each CPU has its own private memory. • Such a design is called a multicomputer or distributed memory system.
Multiprocessors § Consider a program to find all of the objects in a bitmap image. • One copy of the image is kept in memory. • Each CPU runs a single process which inspects one section of the image. • Some objects occupy multiple sections, so it is essential that each process have access to the entire image. § Example multiprocessors include: § § Sun Enterprise 1000 Sequent NUMA-Q SGI Origin 2000 HP/Convex Exemplar
Multicomputers § In a multicomputer solving the same problem, each CPU has a section of the image in its local memory. • If the CPUs need to follow an object across the border, they must request the information from a neighboring CPU. § This is done via message passing. § Programming multicomputers is more difficult than programming multiprocessors, but they are more scalable. • Building a multicomputer with 10, 000 CPUs is straightforward.
Multicomputers § Example multicomputers include: • IBM SP/2 • Intel/Sandia Option Red • Wisconsin COW § Much research focuses on hybrid systems combining the best of both worlds. § Shared memory might be implemented at a higher-level than the hardware. • The operating system might simulate a shared memory by providing a single system-wide paged shared address space. § This approach is called DSM (Distributed Shared Memory).
Shared Memory • Each machine has its own virtual memory and its own page table. • When a CPU does a LOAD or STORE on a page it does not have, a trap to the OS occurs. • The OS locates the page and asks the CPU currently holding it to unmap the page and send it over the interconnection network. • When it arrives, the page is mapped in and the faulting instruction restarted. § A third possibility is to have a user-level runtime system implement a form of shared memory.
Shared Memory § The programming language provides a shared memory abstraction implemented by the compiler and runtime system. • The Linda model is based on the abstraction of a shared space of tuples. § Processes can input a tuple from the shared tuple space or output a tuple to the shared tuple space. • The Orca model allows shared generic objects. § Processes can execute object-specific methods on shared objects. § When a change occurs to the internal state of some object, it is up to the runtime system to simultaneously update all copies of the object.
Interconnection Networks • Multicomputers are held together by interconnection networks which move packets between CPUs and memory. § The CPUs and memory modules of multiprocessors are also interconnected. § Interconnection networks consist of: • • • CPUs Memory modules Interfaces Links Switches
Interconnection Networks § The links are the physical channels over which bits move. They can be • electrical or optical fiber • serial or parallel • simplex, half-duplex, or full duplex § The switches are devices with multiple input ports and multiple output ports. • When a packet arrives at an input port on a switch some bits are used to select the output port to which the packet is sent.
Switching § An interconnection network consists of switches and wires connecting them. § The following slide shows an example. • Each switch has four input ports and four output ports. • In addition each switch has some CPUs and interconnect circuitry. • The job of the switch is to accept packets arriving on any input port and send each one out on the correct output port. • Each output port is connected to an input port of another switch by a parallel or serial line.
Switching § Several switching strategies are possible. • In circuit switching, before a packet is sent, the entire path from the source to the destination is reserved in advance. § All ports and buffers are claimed, so that when transmission starts, all necessary resources are guaranteed to be available and the bits can move at full speed from the source, through the switches to the destination. • In store-and-forward packet switching, no advance reservation is needed. § The source sends a complete packet to the first switch where it is stored in its entirety. § The switches may need to buffer packets if an output port is busy.
Communication Methods § When a program is split up into pieces, the pieces (processes) often need to communicate with one another. § This communication can be done in one of two ways: • shared variables • explicit message passing • Logical sharing of variables is possible even on a multicomputer. • Message passing is easy to implement on a multiprocessor by simply copying from the sender to the receiver.
Taxonomy of Parallel Computers • Although many researchers have tried to come up with a taxonomy of parallel computers, the only one which is widely used is that of Flynn (1972). • This classification is based on two concepts § instruction streams • corresponding to a program counter § data streams • consisting of a set of operands
Taxonomy of Parallel Computers
Taxonomy of Parallel Computers
Memory Semantics § Even though all multiprocessors present the CPUs with the image of a single shared address space, often there are many memory modules present, each holding some portion of the physical memory. • The CPUs and memories are often interconnected by a complex interconnection network. • Several CPUs may be attempting to read a memory word at the same time several other CPUs are attempting to write the same word. • Multiple copies of some blocks may be in caches.
Memory Semantics § One view of memory semantics is to view it as a contract between the software and the memory hardware. • The rules are called consistency models, and many different ones have been proposed and implemented. • For example, suppose that CPU 0 writes the value 1 to some memory word and a little later CPU 1 writes the value 2 to the same word. • Now CPU 2 reads the word and gets the value 1. • Is this an error?
Memory Semantics § The simplest model is strict consistency. • With this model, any read to a location x, always returns the value of the most recent write to x. • This model is great for programmers, but almost impossible to implement. § The next best model is called sequential consistency. • The basic idea is that in the presence of multiple read and write requests, some interleaving of all the requests is chosen by the hardware (nondeterministically), but all CPUs see the same order.
Memory Semantics § A looser consistency model, but one that is easier to implement on large multiprocessors, is processor consistency. It has two properties: • Writes by any CPU are seen by all CPUs in the order they were issued. • For every memory word, all CPUs see all writes to it in the same order. • If CPU 1 issues writes with values 1 A, 1 B, and 1 C to some memory location in that sequence, then all other processors see them in that order too. • Every memory word has an unambiguous value after several CPUs write to it and stop.
Memory Semantics § Weak consistency does not even guarantee that writes from a single CPU are seen in that order. • One CPU might see 1 A before 1 B and another CPU might see 1 A after 1 B. • However, to add some order, weakly consistent memories have synchronization variables or a synchronization variable. § When a synchronization is executed, all pending writes are finished and no new ones are started until all the old ones are done and the synchronization itself is done. § In effect a synchronization “flushes the pipeline” and brings the memory to a stable state with no operations pending. § Time is divided into epochs delimited by the synchronizations.
Memory Semantics § Weak consistency has the problem that it is quite inefficient because it must finish off all pending memory operations and hold all new ones until the current ones are done. § Release consistency improves matters by adopting a model akin to critical sections. • The idea behind this model is that when a process exits a critical region it is not necessary to force all writes to complete immediately. It is only necessary to make sure that they are done before any process enters the critical region again.
Memory Semantics § In this model, the synchronization operation offered by weak consistency is split into two different operations. • To read or write a shared data variable, a CPU must first do an acquire operation on the synchronization variable to get exclusive access to the shared data. • When it is done, the CPU does a release operation on the synchronization variable to indicate that it is finished.