Other Architectures Examples Multithreaded architectures Dataflow architectures Multiprocessor

  • Slides: 23
Download presentation
Other Architectures & Examples Multithreaded architectures Dataflow architectures Multiprocessor examples 1 st May, 2006

Other Architectures & Examples Multithreaded architectures Dataflow architectures Multiprocessor examples 1 st May, 2006 Anshul Kumar, CSE IITD

Context switching • Delays and poor resource utilization due to – Data/control hazards –

Context switching • Delays and poor resource utilization due to – Data/control hazards – cache misses – waiting for some event • Solution – – context switch to another thread • Context switch mechanism – – operating system - slow – hardware - fast Anshul Kumar, CSE IITD

Multithreaded architecture • Hardware context switching • Models – control flow or hybrid (control

Multithreaded architecture • Hardware context switching • Models – control flow or hybrid (control flow, data flow) • Granularity – fine grain or coarse grain • Memory organization – shared? , distributed? , cache coherent? • No. of threads – small, medium, large Anshul Kumar, CSE IITD

ILP and Multithreading Coarse MT Hennessy and Patterson ILP Anshul Kumar, CSE IITD Fine

ILP and Multithreading Coarse MT Hennessy and Patterson ILP Anshul Kumar, CSE IITD Fine MT SMT

Wikipedia Chip level multithreading Executing instructions from multiple threads within one processor chip at

Wikipedia Chip level multithreading Executing instructions from multiple threads within one processor chip at the same time. • Multithreading: Interleaved issue of multiple instructions from different threads • Simultaneous multithreading (SMT): Issue multiple instructions from multiple threads in one cycle. • Chip-level multiprocessing (CMP or Multicore): integrate two or more superscalar processors into one chip, each execute one thread independently • Any combination of multithreading/SMT/CMP Anshul Kumar, CSE IITD

Historical Examples Machine Granu. Year larity HEP from fine 1978 Denelcor centralized Procs Threads/

Historical Examples Machine Granu. Year larity HEP from fine 1978 Denelcor centralized Procs Threads/ Memory proc max 168 active shared Tera max 256 fine distributed shared Anshul Kumar, CSE IITD 1990 64 max 128

Modern examples • • Pentium 4 MIPS MT IBM Power 5 Ultrasparc T 1

Modern examples • • Pentium 4 MIPS MT IBM Power 5 Ultrasparc T 1 Anshul Kumar, CSE IITD Hyperthreading 8 cores with 4 threads each dual core, 2 threads each fine grained multithreading

HEP Control loop 8 stage pipeline scheduler function unit PSW queue Matching unit Program

HEP Control loop 8 stage pipeline scheduler function unit PSW queue Matching unit Program memory Increment control Operand fetch Registers SFU FU 1 FU 2 Anshul Kumar, CSE IITD FUn To/from data memory

Control Flow & Data Flow models • Control Flow (von Neumann) – control flows

Control Flow & Data Flow models • Control Flow (von Neumann) – control flows through a sequence of instructions, branches can alter the flow – instructions get data from or put data in memory – explicit parallelism through control operators – fork/join • Data Flow – instructions are triggered by availability of data – data flows from instruction to instruction – explicit parallelism Anshul Kumar, CSE IITD

Dataflow Model A B - 1 + A-B B+1 * R=(A-B)*(B+1) Anshul Kumar, CSE

Dataflow Model A B - 1 + A-B B+1 * R=(A-B)*(B+1) Anshul Kumar, CSE IITD

Dataflow Program L 1: A L 2: L 4/1 A-B - Compute B L

Dataflow Program L 1: A L 2: L 4/1 A-B - Compute B L 2/2 L 3/1 B B L 4: * L 3: + 1 L 4/2 B+1 L 6/1 R=(A-B)*(B+1) Anshul Kumar, CSE IITD

Static Dataflow Architecture Fetch unit Instruction queue FU 1 FU 2 FUn Update unit

Static Dataflow Architecture Fetch unit Instruction queue FU 1 FU 2 FUn Update unit to/from other PEs Anshul Kumar, CSE IITD Activity Store

Tagged-token dataflow architecture Matching unit Matching store Fetch unit Token queue FU 1 FU

Tagged-token dataflow architecture Matching unit Matching store Fetch unit Token queue FU 1 FU 2 Instruction/ data memory FUn Form token unit to/from other PEs Anshul Kumar, CSE IITD

UMA Examples • Earlier approach : Large number of processors (e. g. Denelcor HEP,

UMA Examples • Earlier approach : Large number of processors (e. g. Denelcor HEP, NYU Ultracomputer) • Now realized : Good only for small number of processors (e. g. Encore Multimax 1980’s, SGI Power Challenge - 1990’s) Anshul Kumar, CSE IITD

SGI Power Challenge • • 18 MIPS R 8000 16 GB RAM, 8 -way

SGI Power Challenge • • 18 MIPS R 8000 16 GB RAM, 8 -way interleaved 4 power channel-2, each 320 MB/s (I/O bus) Power path-2 : split transaction shared bus (256 bit data, 40 bit address) • Snoopy cache coherence protocol Anshul Kumar, CSE IITD

NUMA Examples • • BBN TC 2000 IBM RP 3 Hector Cray T 3

NUMA Examples • • BBN TC 2000 IBM RP 3 Hector Cray T 3 D Anshul Kumar, CSE IITD

Hector • Hierarchical Structure global ring local rings stations Proc module (P+C+M) I/O module

Hector • Hierarchical Structure global ring local rings stations Proc module (P+C+M) I/O module Anshul Kumar, CSE IITD

Hector station local ring global ring local ring station Station bus Proc module Anshul

Hector station local ring global ring local ring station Station bus Proc module Anshul Kumar, CSE IITD Proc module I/O module station Station controller

Cray T 3 D • • Alpha 21064 Proc Cray Y-MP host upto 128

Cray T 3 D • • Alpha 21064 Proc Cray Y-MP host upto 128 GB memory 4 x 4 x 4 3 D torus - config upto 8 x 8 x 8 2 PEs in each node Anshul Kumar, CSE IITD

CC-NUMA examples Machine Nodes Mem Cache Wisconsin Multicube Aquarius Multimulti Stanford Dash single proc

CC-NUMA examples Machine Nodes Mem Cache Wisconsin Multicube Aquarius Multimulti Stanford Dash single proc per col bus snoopy bus grid single proc per node snoopy+ directory Stanford Flash Convex Exemplar cluster per cluster 4 R 3000+ FPU on bus single proc per node T 5+magic chip hyper node per 8 PA-RISC hyper node directory SCI Magic chip : memory + I/O + network controller Anshul Kumar, CSE IITD Net bus grid pair of meshes 2 D mesh X bar (hyper node) multi rings

COMA examples • DDM (Data Diffusion Machine) – single bus (split transaction) – can

COMA examples • DDM (Data Diffusion Machine) – single bus (split transaction) – can be made hierarchical • KSR 1 – hierarchical rings – distributed directory is a matrix : rows for pages, columns for caches Anshul Kumar, CSE IITD

Distr Mem Arch Examples Machine Comp. Topology n. CUBE 2 i. PSC 2 Intel

Distr Mem Arch Examples Machine Comp. Topology n. CUBE 2 i. PSC 2 Intel yes Paragon Genesis custom Manna custom i 386 yes Comm. Vec. Switch proc hyper cube i 860 custom i 870 hyper cube i 860 2 D mesh i 870 2 level X bar i 860 16 x 16 X bar hierarch. Parsytec P. PC 601 T 805 C 004 3 D mesh Transtech i 860 T 805 Anshul Kumar, CSE IITD C 004 variable proc

References • D. Sima, T. Fountain, P. Kacsuk, "Advanced Computer Architectures : A Design

References • D. Sima, T. Fountain, P. Kacsuk, "Advanced Computer Architectures : A Design Space Approach", Addison Wesley, 1997. Anshul Kumar, CSE IITD