Superficial Review of Uniprocessor Architecture Parallel Architectures and

Early machines • We will present a series of idealized and simplified models –

Early machines • Early machines: Complex instruction sets, (lets say) no registers – Processor

Registers • Processors are faster than memory – they can deal with data within

Load-Store architectures (RISC) • Do not allow memory locations to be operands – For

Caches • The processor still has to wait for data from memory – I.

Caches Processor still issues load and store instructions as before, but the cache controller

Cache Issues • Level 2 cache • Cache lines – Bring a bunch of

Cache blocks and Cache Lines A cache block is a physical part of the

Cache Management • How is cache managed? – Its job: given an address, find

Cache management • Alternative scheme: – Each cache line (I. e. address) has exactly

Parallel Machines: an abstract introduction • Our main focus will be on three kinds

Bus based machines Mem 0 PE 0 Mem 1 PE 1 Memk PE N-1

Bus based machines • Any processor can access any memory location – Read and

Scalable shared memory m/cs PE 0 Interconnection Network with support for remote memory access

Distributed memory m/cs PE 0 Mem 0 PE 1 Mem 1 Interconnection Network PEp

Introducing caches into the picture! • Now, we have more complex problems : –

Distributed memory m/cs PE 0 PE 1 cache Mem 0 Pep-1 cache Memp-1 Interconnection

Writing parallel programs • Programming model – How should a programmer view the parallel

Slides: 19

Download presentation

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign Department of Computer Science

Early machines • We will present a series of idealized and simplified models – Read more about the real models in architecture textbooks • official prereq: cs 232, cs 333 – The idea here to review the concepts and define our vocabulary Location 1 Location 0 Processor Memory Location k

Early machines • Early machines: Complex instruction sets, (lets say) no registers – Processor can access any memory location equally fast – Instructions: • Operations: Add L 1, L 2, L 3 (Add contents of Location L 1 to that of Location L 2, and store results in L 3. ) • Branching: Branch to L 4 (Note that some locations store program instructions), • Coonditional Branching: If (L 1>L 2) goto L 3 Location 1 Location 0 Processor Memory Location k

Registers • Processors are faster than memory – they can deal with data within the processor much faster • So, create some locations in processor for storing data – Called registers; Often with a special register called Accumulator • Now we need new instructions for dealing with data in registers: – Data movement instructions • Move from register to memory, memory to register, register to register, and memory to memory – Computation instructions: • In addition to the previous ones, we now add instructions to allow one or more operands being a register Processor registers CPU Memory

Load-Store architectures (RISC) • Do not allow memory locations to be operands – For computations as well as control instructions • Only instructions to reference memory are: – Load R, L # move contents of L into register R – Store R, L # move contents of register R into memory location L • Notice that the number of instructions is now dramatically reduced – Further, allow only relatively simple instructions to do register-toregister operations – More complex operations implemented in software – Compiler has a bigger responsibility now

Caches • The processor still has to wait for data from memory – I. e. Load and Store instructions are slower – Although more often the CPU is executing register-only instructions – Load and store latency • Dictionary meaning: latency is the delay between stimulus and response • OR: delay between a data-transfer instruction and beginning of data transfer • But, faster SRAM memory is available (although expensive) • Idea: just like registers, put some more of data in faster memory – Which data? ? – Principle of locality: (empirical observation) • Data accessed correlates with past accesses, spatially and temporarily • Without this, caches will be worthless (unless most data fits in cache)

Caches Processor still issues load and store instructions as before, but the cache controller intercepts the requests, and if the location has been cached, deals with it using cache Data transfer between cache and memory is not seen by the processor Processor Cache controller Memory Cache

Cache Issues • Level 2 cache • Cache lines – Bring a bunch of data “at once” : • exploit spatial locality • block transfers are faster – 64 -128 byte cache lines typical – Trade-off: or why larger and large cache lines aren’t good either

Cache blocks and Cache Lines A cache block is a physical part of the cache. A cache line is a section of the address space. A line is brought into a cache block. Of course, line-size and block-size are the same. Processor block Cache controller Cache Memory L 1

Cache Management • How is cache managed? – Its job: given an address, find if it is cache, and return contents if so. • Also, write data back to memory when needed • and bring data from the memory when needed – Ideally, a fully associative cache will be good • Keep cache lines anywhere in the physical cache • But looking up is hard

Cache management • Alternative scheme: – Each cache line (I. e. address) has exactly one place in the cache memory where it can be stored. – Of course, there are more than one cache lines that will have the same area of cache memory as their possible target • Why? – Only one cache line can live inside a cache block at a time – If you want to bring in a new one, the old one must be “emptied” • A tradeoff: set-associative caches – Have each line map to more than 1 (say 4) physical locations

Parallel Machines: an abstract introduction • Our main focus will be on three kinds of machines – Bus-based shared memory machines – Scalable shared memory machines • Cache coherent • Hardware support for remote memory access – Distributed memory machines

Bus based machines Mem 0 PE 0 Mem 1 PE 1 Memk PE N-1

Bus based machines • Any processor can access any memory location – Read and write • Bus bandwidth is a limiting factor • Also, how do you deal with 2 processors changing the same data? – Locks (more on this later)

Scalable shared memory m/cs PE 0 Interconnection Network with support for remote memory access Mem 0 Not popular, as all data is slow to access Mem 0

Distributed memory m/cs PE 0 Mem 0 PE 1 Mem 1 Interconnection Network PEp Memp

Introducing caches into the picture! • Now, we have more complex problems : – can’t be fixed by locks alone: – copy of the same variables in two different caches may contain different values. • Cache controller must do more Mem 0 Mem 1 Mem p-1 cache PE 0 PE 1 PE p-1

Distributed memory m/cs PE 0 PE 1 cache Mem 0 Pep-1 cache Memp-1 Interconnection Network

Writing parallel programs • Programming model – How should a programmer view the parallel machine? – Sequential programming: von Neumann model • Parallel programming models: – Shared memory (Shared address space) model – Message passing model – Shared Objects model