Chapter 2 Parallel Architecture Moores Law The number

Chapter 2 Parallel Architecture

Moore’s Law • The number of transistors on a chip doubles every 18 -24 years. – Has been valid for over 40 years – Can’t improve performance with increased frequency due to heat issues • What to do with the additional transistors? – Faster ways to compute operations – Pipelining – Cache – MULTICORE!!

Levels of Parallelism • Bit Level – Word size related – how many bits do we work on at one time (4, 8, 16, 32, 64). Parallel vs. serial addition • Pipelining – Break an instruction into components and process them like an assembly line: • • Fetch Decode Execute Write-back

Parallelism Levels • Pipeline (cont. ) – Each component can work on its stage on a different instruction in parallel. – Superpipeline – many stages. • Multiple function units – Have a separate unit for various functions (integer arithmetic, float arithmetic, load/store, etc. ) that can be executed in parallel as long as there are no dependencies.

Parallelism Levels (cont. ) • Process/Thread Level – Used for multicore/multiple processors – Issues with shared memory/cache vs. distributed memory – Done at the programming level

Flynn’s Taxonomy Data/Instruction Single Multiple Single SISD Multiple SIMD MIMD • SISD – Standard single processor working on a single data item (pair) at a time • MISD – NOT FEASIBLE • SIMD – One instruction is performed on multiple data items – also called a vector processor since it looks like the operands are vectors of data • MIMD – Multiple processing elements working on data independently.

Memory Organization • Distributed Memory – Memory is local and private on each processor – Sharing information is done via message passing between nodes – Faster system if the nodes have DMA • Direct Memory Access – controller can get values from or put values into memory without using the processor • The processor can keep processing while information is transferred. – Faster with routers to handle data transfer without bothering the processor

Memory Organization (cont. ) • Shared Memory – Memory is global and “public” – Processes share variables for communications – concerned about “race” conditions – Different results with different execution orders. Ex. : Processor 1 LW $t 1, X Processor 2 (A) LW $t 1, X (D) ADDI $t 1, 1 (B) ADDI $t 1, 1 (E) SW $t 1, X (C) SW $t 1, X (F) ABCDEF gives different result from ADBECF

Shared Memory • Generally easier programming • Does not scale well (hardware issues with many processors hitting memory at the same time) • Cache coherence – if each processor/core has its own cache, then the same global memory location may be mapped to 2 caches which get updated independently (non-shared cache) • NUMA – non-uniform memory access. There is a hierarchy of memory that have different access times.

Memory Organization • Virtually shared memory – There is a difference between the programmer’s view and the hardware – The programmer writes the code as if the memory is shared, but the memory, in reality, is distributed – The system automatically generates the messages to get values to the proper processor – Definitely NUMA

Cache Memory • Small, fast memory between processor and main memory – Is feasible because of temporal and spatial locality – Holds a subset of main memory – Cache hit vs. miss – Cache mapping issues – COHERENCY

Thread Level Parallelism • Multithreading – multiple threads executing “simultaneously” on a single processor. – Can’t be simultaneous since single processor – CPU swaps between threads based on time (timeslicing) and/or switch-on-event (one thread waiting for I/O). • Multiple cores allow true simultaneousness.

Simultaneous Multithreading • Requires multiple functional units and replication of the PC register and all the general user registers (state of the machine) – Creates “Logical Processors” • Allow multiple instructions from different threads to execute simultaneously as long as they do not overlap functional units

Multicore Processors • Needs an OS that recognizes and schedules tasks on the different cores. • Can run different programs on different cores • Programs need to be written in such a way that parts of program can be run simultaneously on separate cores to have any improvement in time with multiple cores

Multicore Architecture • One main issue is the location of the Cache(s) – Each core has a local cache – Every core shares a cache – Each core has a local L 1 cache and shares a L 2 cache. • Caches need to communicate for Coherency – Network communication – Pipeline (go to the next)

Interconnection Networks • Ideally, every node would be connected to every other node so that 2 mutually exclusive pairs of nodes could communicate simultaneously. – Requires O(n 2) connections – does not scale well • Most use a significantly reduced set of connections (cost vs. speed) – Routing technique needed if not fully connected.