Vector IRAM Overview Vector IRAM C E Kozyrakis

  • Slides: 47
Download presentation
Vector IRAM Overview Vector IRAM C. E. Kozyrakis, 8/2000 • A processor architecture for

Vector IRAM Overview Vector IRAM C. E. Kozyrakis, 8/2000 • A processor architecture for embedded/portable systems running media applications – Based on vector processing and embedded DRAM – Simple, scalable, and efficient – Good compiler target • Microprocessor prototype with – – – 256 -bit vector processor, 16 MBytes DRAM 150 million transistors, 290 mm 2 3. 2 Gops, 2 W at 200 MHz Industrial strength vectorizing compiler Implemented by 6 graduate students 2

The IRAM Team Vector IRAM C. E. Kozyrakis, 8/2000 • Hardware: – Joe Gebis,

The IRAM Team Vector IRAM C. E. Kozyrakis, 8/2000 • Hardware: – Joe Gebis, Christoforos Kozyrakis, Ioannis Mavroidis, Iakovos Mavroidis, Steve Pope, Sam Williams • Software: – Alan Janin, David Judd, David Martin, Randi Thomas • Advisors: – David Patterson, Katherine Yelick • Help from: – IBM Microelectronics, MIPS Technologies, Cray 3

Outline Vector IRAM C. E. Kozyrakis, 8/2000 • Motivation and goals • Vector instruction

Outline Vector IRAM C. E. Kozyrakis, 8/2000 • Motivation and goals • Vector instruction set • Vector IRAM prototype – Microarchitecture and design • Vectorizing compiler • Performance – Comparison with SIMD • Future work – On vector processors for media applications 4

Post. PC processor applications Vector IRAM C. E. Kozyrakis, 8/2000 • Multimedia processing –

Post. PC processor applications Vector IRAM C. E. Kozyrakis, 8/2000 • Multimedia processing – image/video processing, voice/pattern recognition, 3 D graphics, animation, digital music, encryption – narrow data types, streaming data, real-time response • Embedded and portable systems – notebooks, PDAs, digital cameras, cellular phones, pagers, game consoles, set-top boxes – limited chip count, limited power/energy budget • Significantly different environment from that of workstations and servers 5

Motivation and Goals Vector IRAM C. E. Kozyrakis, 8/2000 • Processor features for Post.

Motivation and Goals Vector IRAM C. E. Kozyrakis, 8/2000 • Processor features for Post. PC systems: – High performance on demand for multimedia without continuous high power consumption – Tolerance to memory latency – Scalable – Mature, HLL-based software model • Design a prototype processor chip – Complete proof of concept – Explore detailed architecture and design issues – Motivation for software development 6

Key Technologies Vector IRAM C. E. Kozyrakis, 8/2000 • Vector processing – – High

Key Technologies Vector IRAM C. E. Kozyrakis, 8/2000 • Vector processing – – High performance on demand for media processing Low power for issue and control logic Low design complexity Well understood compiler technology • Embedded DRAM – High bandwidth for vector processing – Low power/energy for memory accesses – “System on a chip” 7

Outline Vector IRAM C. E. Kozyrakis, 8/2000 • Motivation and goals • Vector instruction

Outline Vector IRAM C. E. Kozyrakis, 8/2000 • Motivation and goals • Vector instruction set • Vector IRAM prototype – Microarchitecture and design • Vectorizing compiler • Performance – Comparison with SIMD • Future work – For vector processors for multimedia applications 8

Vector Instruction Set Vector IRAM C. E. Kozyrakis, 8/2000 • Complete load-store vector instruction

Vector Instruction Set Vector IRAM C. E. Kozyrakis, 8/2000 • Complete load-store vector instruction set – Uses the MIPS 64™ ISA coprocessor 2 opcode space – Architecture state • 32 general-purpose vector registers • 32 vector flag registers – Data types supported in vectors: • 64 b, 32 b, 16 b (and 8 b) – 91 arithmetic and memory instructions • Not specified by the ISA – Maximum vector register length – Functional unit datapath width 9

Vector Architecture State Vector IRAM C. E. Kozyrakis, 8/2000 Virtual Processors ($vlr) VP 0

Vector Architecture State Vector IRAM C. E. Kozyrakis, 8/2000 Virtual Processors ($vlr) VP 0 0 General vr vr 1 Purpose Registers vr 31 (32) Flag Registers (32) VP 1 VP$vlr-1 Scalar Regs $vpw vf 0 vf 1 vs 0 vs 15 64 b vf 31 1 b 10

Vector IRAM ISA Summary Vector IRAM C. E. Kozyrakis, 8/2000 Scalar MIPS 64 scalar

Vector IRAM ISA Summary Vector IRAM C. E. Kozyrakis, 8/2000 Scalar MIPS 64 scalar instruction set Vector ALU Vector Memory alu op load store s. int u. int s. fp d. fp . v. vs. sv s. int u. int 8 16 32 64 • 91 instructions • 660 opcodes unit stride constant stride indexed ALU operations: integer, floating-point, convert, logical, vector processing, flag processing 11

Support for DSP Vector IRAM x n/2 y n/2 C. E. Kozyrakis, 8/2000 zn

Support for DSP Vector IRAM x n/2 y n/2 C. E. Kozyrakis, 8/2000 zn * + n Round sat n w n a • Support for fixed-point numbers, saturation, rounding modes • Simple instructions for intra-register permutations for reductions and butterfly operations – High performance for dot-products and FFT without the complexity of a random permutation 12

Compiler/OS Enhancements Vector IRAM C. E. Kozyrakis, 8/2000 • Compiler support – Conditional execution

Compiler/OS Enhancements Vector IRAM C. E. Kozyrakis, 8/2000 • Compiler support – Conditional execution of vector instruction • Using the vector flag registers – Support for software speculation of load operations • Operating system support – – MMU-based virtual memory Restartable arithmetic exceptions Valid and dirty bits for vector registers Tracking of maximum vector length used 13

Outline Vector IRAM C. E. Kozyrakis, 8/2000 • Motivation and goals • Vector instruction

Outline Vector IRAM C. E. Kozyrakis, 8/2000 • Motivation and goals • Vector instruction set • Vector IRAM prototype – Microarchitecture and design • Vectorizing compiler • Performance – Comparison with SIMD • Future work – For vector processors for multimedia applications 14

VIRAM Prototype Architecture Vector IRAM C. E. Kozyrakis, 8/2000 Flag Unit 0 Instr. Cache

VIRAM Prototype Architecture Vector IRAM C. E. Kozyrakis, 8/2000 Flag Unit 0 Instr. Cache (8 KB) Flag Unit 1 FPU MIPS 64™ 5 Kc Core CP IF Flag Register File (512 B) Arithmetic Unit 0 Arithmetic Unit 1 256 b Sys. AD IF Vector Register File (8 KB) Data Cache (8 KB) 64 b 256 b 64 b Memory Unit TLB 256 b JTAG IF DMA JTAG Memory Crossbar DRAM 0 DRAM 1 (2 MB) … DRAM 7 (2 MB) 15

Vector Unit Pipeline Vector IRAM C. E. Kozyrakis, 8/2000 • Single-issue, in-order pipeline •

Vector Unit Pipeline Vector IRAM C. E. Kozyrakis, 8/2000 • Single-issue, in-order pipeline • Efficient for short vectors – Pipelined instruction start-up – Full support for instruction chaining, the vector equivalent of result forwarding • Hides long DRAM access latency – Random access latency could lead to stalls due to long load®use RAW hazards – Simple solution: “delayed” vector pipeline 16

Delayed Vector Pipeline Vector IRAM C. E. Kozyrakis, 8/2000 . . . F D

Delayed Vector Pipeline Vector IRAM C. E. Kozyrakis, 8/2000 . . . F D R E M W DRAM latency: >25 ns VLD A T VW Load ® Add RAW hazard VADD VST DELAY A T VR VX VW VR vld vadd vst . . . • Random access latency included in the vector unit pipeline • Arithmetic operations and stores are delayed to shorten RAW hazards • Long hazards eliminated for the common loop cases • Vector pipeline length: 15 stages 17

Handling Memory Conflicts Vector IRAM C. E. Kozyrakis, 8/2000 • Single sub-bank DRAM macro

Handling Memory Conflicts Vector IRAM C. E. Kozyrakis, 8/2000 • Single sub-bank DRAM macro can lead to memory conflicts for non-sequential access patterns • Solution 1: address interleaving – Selects between 3 address interleaving modes for each virtual page • Solution 2: address decoupling buffer (128 slots) – Allows scheduling of long indexed accesses without stalling the arithmetic operations executing in parallel 18

Modular Vector Unit Design Vector IRAM C. E. Kozyrakis, 8/2000 256 b Control Integer

Modular Vector Unit Design Vector IRAM C. E. Kozyrakis, 8/2000 256 b Control Integer Datapath 0 FP Datapath Vector Reg. Elements Flag Reg. Elements & Datapaths Integer Datapath 1 Xbar IF 64 b 64 b • Single 64 b “lane” design replicated 4 times – Reduces design and testing time – Provides a simple scaling model (up or down) without major control or datapath redesign • Most instructions require only intra-lane interconnect – Tolerance to interconnect delay scaling 19

Floorplan Vector IRAM C. E. Kozyrakis, 8/2000 14. 5 mm • Technology: IBM SA-27

Floorplan Vector IRAM C. E. Kozyrakis, 8/2000 14. 5 mm • Technology: IBM SA-27 E – 0. 18 mm CMOS – 6 metal layers (copper) 20. 0 mm • 290 mm 2 die area – 225 mm 2 for memory/logic – DRAM: 161 mm 2 – Vector lanes: 51 mm 2 • Transistor count: ~150 M • Power supply – 1. 2 V for logic, 1. 8 V for DRAM • Peak vector performance – 1. 6/3. 2/6. 4 Gops wo. multiply-add (64 b/32 b/16 b operations) – 3. 2/6. 4 /12. 8 Gops w. multiply-add – 1. 6 Gflops (single-precision) 20

Alternative Floorplans (1) Vector IRAM C. E. Kozyrakis, 8/2000 “VIRAM-8 MB” “VIRAM-2 Lanes” “VIRAM-Lite”

Alternative Floorplans (1) Vector IRAM C. E. Kozyrakis, 8/2000 “VIRAM-8 MB” “VIRAM-2 Lanes” “VIRAM-Lite” 4 lanes, 8 Mbytes 2 lanes, 4 Mbytes 1 lane, 2 Mbytes 190 mm 2 120 mm 2 60 mm 2 3. 2 Gops at 200 MHz 1. 6 Gops at 200 MHz 0. 8 Gops at 200 MHz 21

Alternative Floorplans (2) Vector IRAM C. E. Kozyrakis, 8/2000 • “RAMless” VIRAM – 2

Alternative Floorplans (2) Vector IRAM C. E. Kozyrakis, 8/2000 • “RAMless” VIRAM – 2 lanes, 55 mm 2, 1. 6 Gops at 200 MHz – 2 high-bandwidth DRAM interfaces and decoupling buffers – Vector processors need high bandwidth, but they can tolerate latency 22

Power Consumption Vector IRAM C. E. Kozyrakis, 8/2000 • Power saving techniques – Low

Power Consumption Vector IRAM C. E. Kozyrakis, 8/2000 • Power saving techniques – Low power supply for logic (1. 2 V) • Possible because of the low clock rate (200 MHz) • Wide vector datapaths provide high performance – Extensive clock gating and datapath disabling • Utilizing the explicit parallelism information of vector instructions and conditional execution – Simple, single-issue, in-order pipeline • Typical power consumption: 2. 0 W – – MIPS core: Vector unit: DRAM: Misc. : 0. 5 W 1. 0 W (min ~0 W) 0. 2 W (min ~0 W) 0. 3 W (min ~0 W) 23

Outline Vector IRAM C. E. Kozyrakis, 8/2000 • Motivation and goals • Vector instruction

Outline Vector IRAM C. E. Kozyrakis, 8/2000 • Motivation and goals • Vector instruction set • Vector IRAM prototype – Microarchitecture and design • Vectorizing compiler • Performance – Comparison with SIMD • Future work – For vector processors for multimedia applications 24

VIRAM Compiler Vector IRAM C. E. Kozyrakis, 8/2000 Frontends C C++ Fortran 95 Optimizer

VIRAM Compiler Vector IRAM C. E. Kozyrakis, 8/2000 Frontends C C++ Fortran 95 Optimizer Cray’s PDGCS Code Generators T 3 D/T 3 E C 90/T 90/SV 1 SV 2/VIRAM • Based on the Cray’s PDGCS production environment for vector supercomputers • Extensive vectorization and optimization capabilities including outer loop vectorization • No need to use special libraries or variable types for vectorization 25

Compiler Performance Vector IRAM C. E. Kozyrakis, 8/2000 64 x 64 matrix-matrix multiply, single

Compiler Performance Vector IRAM C. E. Kozyrakis, 8/2000 64 x 64 matrix-matrix multiply, single precision Performance Theoretical peak 1. 60 GFLOPS Handcoded assembly 1. 58 GFLOPS Compiler 0. 85 GFLOPS Compiler with outer loop vectorization 1. 51 GFLOPS – Performance tuning is currently in progress 26

Compiler Challenges Vector IRAM C. E. Kozyrakis, 8/2000 • Generate code for variable data

Compiler Challenges Vector IRAM C. E. Kozyrakis, 8/2000 • Generate code for variable data type width – Vectorizer starts with largest width (64 b) – At the end, vectorization discarded if greatest width met is smaller; vectorization restarted – For simplicity, a single loop will use the largest width present in it • Consistency between scalar cache and DRAM – Problem when vector unit writes cached data – Vector unit invalidates cache entries on writes – Compiler generates synchronization instructions • Vector after scalar, scalar after vector • Read after write, write after read, write after write 27

Outline Vector IRAM C. E. Kozyrakis, 8/2000 • Motivation and goals • Vector instruction

Outline Vector IRAM C. E. Kozyrakis, 8/2000 • Motivation and goals • Vector instruction set • Vector IRAM prototype – Microarchitecture and design • Vectorizing compiler • Performance – Comparison with SIMD • Future work – For vector processors for multimedia applications 28

Performance: Efficiency Vector IRAM C. E. Kozyrakis, 8/2000 Peak Sustained % of Peak Image

Performance: Efficiency Vector IRAM C. E. Kozyrakis, 8/2000 Peak Sustained % of Peak Image Composition 6. 4 GOPS 6. 40 GOPS 100% i. DCT 6. 4 GOPS 3. 10 GOPS 48. 4% Color Conversion 3. 2 GOPS 3. 07 GOPS 96. 0% Image Convolution 3. 2 GOPS 3. 16 GOPS 98. 7% Integer VM Multiply 3. 2 GOPS 3. 00 GOPS 93. 7% 1. 6 GFLOPS 1. 59 GFLOPS 99. 6% FP VM Multiply Average 89. 4% 29

Performance: Comparison Vector IRAM C. E. Kozyrakis, 8/2000 VIRAM MMX i. DCT 0. 75

Performance: Comparison Vector IRAM C. E. Kozyrakis, 8/2000 VIRAM MMX i. DCT 0. 75 3. 75 (5. 0 x) Color Conversion 0. 78 8. 00 (10. 2 x) Image Convolution 1. 23 5. 49 (4. 5 x) QCIF (176 x 144) 7. 1 M 33 M (4. 6 x) CIF (352 x 288) 28 M 140 M (5. 0 x) • QCIF and CIF numbers are in clock cycles per frame • All other numbers are in clock cycles per pixel • MMX results assume no first level cache misses 30

Vector Vs. SIMD Vector IRAM C. E. Kozyrakis, 8/2000 Vector SIMD One instruction keeps

Vector Vs. SIMD Vector IRAM C. E. Kozyrakis, 8/2000 Vector SIMD One instruction keeps multiple datapaths busy for many cycles One instruction keeps one datapath busy for one cycle Wide datapaths can be used without changes in ISA or issue logic redesign Wide datapaths can be used either after changing the ISA or after changing the issue width Strided and indexed vector load and store instructions Simple scalar loads; multiple instructions needed to load a vector No alignment restriction for vectors; only individual elements must be aligned to their width Short vectors must be aligned in memory; otherwise multiple instructions needed to load them 31

Vector Vs. SIMD: Example Vector IRAM C. E. Kozyrakis, 8/2000 • Simple example: conversion

Vector Vs. SIMD: Example Vector IRAM C. E. Kozyrakis, 8/2000 • Simple example: conversion from RGB to YUV Y = [( 9798*R + 19235*G + 3736*B) / 32768] U = [(-4784*R - 9437*G + 4221*B) / 32768] + 128 V = [(20218*R – 16941*G – 3277*B) / 32768] + 128 32

VIRAM Code Vector IRAM RGBto. YUV: vlds. u. b xlmul. u. sv xlmadd. u.

VIRAM Code Vector IRAM RGBto. YUV: vlds. u. b xlmul. u. sv xlmadd. u. sv vsra. vs vadd. sv vsts. b subu bnez C. E. Kozyrakis, 8/2000 r_v, r_addr, stride 3, g_v, g_addr, stride 3, b_v, b_addr, stride 3, o 1_v, t 0_s, r_v o 1_v, t 1_s, g_v o 1_v, t 2_s, b_v o 1_v, s_s o 2_v, t 3_s, r_v o 2_v, t 4_s, g_v o 2_v, t 5_s, b_v o 2_v, s_s o 2_v, a_s, o 2_v o 3_v, t 6_s, r_v o 3_v, t 7_s, g_v o 3_v, t 8_s, b_v o 3_v, s_s o 3_v, a_s, o 3_v o 1_v, y_addr, stride 3, o 2_v, u_addr, stride 3, o 3_v, v_addr, stride 3, pix_s, len_s pix_s, RGBto. YUV addr_inc # # load R load G load B calculate Y # calculate U # calculate V addr_inc # store Y # store U # store V 33

MMX Code (1) Vector IRAM RGBto. YUV: movq mm 1, pxor mm 6, movq

MMX Code (1) Vector IRAM RGBto. YUV: movq mm 1, pxor mm 6, movq mm 0, psrlq mm 1, punpcklbw movq mm 7, punpcklbw movq mm 2, pmaddwd mm 0, movq mm 3, pmaddwd mm 1, movq mm 4, pmaddwd mm 2, movq mm 5, pmaddwd mm 3, punpckhbw pmaddwd mm 4, paddd mm 0, pmaddwd mm 5, movq mm 1, paddd mm 2, movq mm 6, C. E. Kozyrakis, 8/2000 [eax] mm 6 mm 1 16 mm 0, mm 1, mm 0 YR 0 GR mm 1 YBG 0 B mm 2 UR 0 GR mm 3 UBG 0 B mm 7, VR 0 GR mm 1 VBG 0 B 8[eax] mm 3 mm 1 ZEROS mm 6; paddd mm 4, movq mm 5, psllq mm 1, paddd mm 1, punpckhbw movq mm 3, pmaddwd mm 1, movq mm 7, pmaddwd mm 5, psrad mm 0, movq TEMP 0, movq mm 6, pmaddwd mm 6, psrad mm 2, paddd mm 1, movq mm 5, pmaddwd mm 7, psrad mm 1, pmaddwd mm 3, packssdw pmaddwd mm 5, psrad mm 4, movq mm 1, mm 5 mm 1 32 mm 7 mm 6, ZEROS mm 1 YR 0 GR mm 5 YBG 0 B 15 mm 6 mm 3 UR 0 GR 15 mm 7 UBG 0 B 15 VR 0 GR mm 0, mm 1 VBG 0 B 15 16[eax] 34

MMX Code (2) Vector IRAM paddd mm 6, movq mm 7, psrad mm 6,

MMX Code (2) Vector IRAM paddd mm 6, movq mm 7, psrad mm 6, paddd mm 3, psllq mm 7, movq mm 5, psrad mm 3, movq TEMPY, packssdw movq mm 0, punpcklbw movq mm 6, movq TEMPU, psrlq mm 0, paddw mm 7, movq mm 2, pmaddwd mm 2, movq mm 0, pmaddwd mm 7, packssdw add eax, add edx, movq TEMPV, C. E. Kozyrakis, 8/2000 mm 7 mm 1 15 mm 5 16 mm 7 15 mm 0 mm 2, TEMP 0 mm 7, mm 0 mm 2 32 mm 0 mm 6 YR 0 GR mm 7 YBG 0 B mm 4, 24 8 mm 4 mm 6 ZEROS mm 3 movq mm 4, pmaddwd mm 6, movq mm 3, pmaddwd mm 0, paddd mm 2, pmaddwd pxor mm 7, pmaddwd mm 3, punpckhbw paddd mm 0, movq mm 6, pmaddwd mm 6, punpckhbw movq mm 7, paddd mm 3, pmaddwd mm 5, movq mm 4, pmaddwd mm 4, psrad mm 0, paddd mm 0, psrad mm 2, paddd mm 6, movq mm 5, mm 6 UR 0 GR mm 0 UBG 0 B mm 7 mm 4, mm 7 VBG 0 B mm 1, mm 6 mm 1 YBG 0 B mm 5, mm 5 mm 4 YR 0 GR mm 1 UBG 0 B 15 OFFSETW 15 mm 7 35

MMX Code (3) Vector IRAM pmaddwd mm 7, psrad mm 3, pmaddwd mm 1,

MMX Code (3) Vector IRAM pmaddwd mm 7, psrad mm 3, pmaddwd mm 1, psrad mm 6, paddd mm 4, packssdw pmaddwd mm 5, paddd mm 7, psrad mm 7, movq mm 6, packssdw movq mm 4, packuswb movq mm 7, paddd mm 1, paddw mm 4, psrad mm 1, movq [ebx], packuswb movq mm 5, packssdw paddw mm 5, paddw mm 3, C. E. Kozyrakis, 8/2000 UR 0 GR 15 VBG 0 B 15 OFFSETD mm 2, VR 0 GR mm 4 15 TEMPY mm 0, TEMPU mm 6, OFFSETB mm 5 mm 7 15 mm 6 mm 4, TEMPV mm 3, mm 7 mm 6 movq [ecx], mm 4 packuswb mm 5, add ebx, 8 add ecx, 8 movq [edx], mm 5 dec edi jnz RGBto. YUV mm 3 mm 7 mm 2 mm 4 36

Performance: FFT (1) Vector IRAM C. E. Kozyrakis, 8/2000 37

Performance: FFT (1) Vector IRAM C. E. Kozyrakis, 8/2000 37

Performance: FFT (2) Vector IRAM C. E. Kozyrakis, 8/2000 38

Performance: FFT (2) Vector IRAM C. E. Kozyrakis, 8/2000 38

Outline Vector IRAM C. E. Kozyrakis, 8/2000 • Motivation and goals • Vector instruction

Outline Vector IRAM C. E. Kozyrakis, 8/2000 • Motivation and goals • Vector instruction set • Vector IRAM prototype – Microarchitecture and design • Vectorizing compiler • Performance – Comparison with SIMD • Future work – For vector processors for multimedia applications 39

Future Work Vector IRAM C. E. Kozyrakis, 8/2000 • A platform for ultra-scalable vector

Future Work Vector IRAM C. E. Kozyrakis, 8/2000 • A platform for ultra-scalable vector coprocessors • Goals – Balance data level and random ILP in the vector design – Add another scaling dimension to vector processors • Work around the scaling problems of a large register file – Allow the generation of numerous configuration for different performance, area (cost), power requirements • Approach – Cluster-based architecture within lanes – Local register files for datapaths – Decoupled everything 40

Ultra-scalable Architecture Vector IRAM C. E. Kozyrakis, 8/2000 41

Ultra-scalable Architecture Vector IRAM C. E. Kozyrakis, 8/2000 41

Benefits Vector IRAM C. E. Kozyrakis, 8/2000 • Two scaling models – More lanes:

Benefits Vector IRAM C. E. Kozyrakis, 8/2000 • Two scaling models – More lanes: when data level parallelism is plenty – More clusters: when random ILP is available • Performance, power, cost on demand – Simple to derive of tens of configuration optimized for specific applications • Simpler design – Simple clusters, simpler register files, trivial chaining control – No need for strictly synchronous clusters 42

Questions to Answer Vector IRAM C. E. Kozyrakis, 8/2000 • Cluster organization – How

Questions to Answer Vector IRAM C. E. Kozyrakis, 8/2000 • Cluster organization – How many local registers • Assignment of instructions to clusters • Frequency of inter-cluster communication – Dependence on the number of clusters, registers per cluster etc. • Balancing the two scaling methods – Scaling the number of lanes vs. scaling the number of clusters • Special ISA support for the clustered architecture • Compiler support for the clustered architecture 43

Conclusions Vector IRAM C. E. Kozyrakis, 8/2000 • Vector IRAM – An integrated architecture

Conclusions Vector IRAM C. E. Kozyrakis, 8/2000 • Vector IRAM – An integrated architecture for media processing – Based on vector processing and embedded DRAM – Simple, scalable, and efficient • One thing to keep in mind – Use the most efficient solution to exploit each level of parallelism – Make the best solutions for each level work together – Vector processing is very efficient for data level parallelism Levels of Parallelism Efficient Solution Multi-programming Thread MPP? NOW? MT? SMT? CMP? Irregular ILP Data VLIW? Superscalar? VECTOR 44

Backup slides Vector IRAM C. E. Kozyrakis, 8/2000 45

Backup slides Vector IRAM C. E. Kozyrakis, 8/2000 45

Architecture Details (1) Vector IRAM C. E. Kozyrakis, 8/2000 • MIPS 64™ 5 Kc

Architecture Details (1) Vector IRAM C. E. Kozyrakis, 8/2000 • MIPS 64™ 5 Kc core (200 MHz) – Single-issue core with 6 stage pipeline – 8 KByte, direct-map instruction and data caches – Single-precision scalar FPU • Vector unit (200 MHz) – 8 KByte register file (32 64 b elements per register) – 4 functional units: • 2 arithmetic (1 FP), 2 flag processing • 256 b datapaths per functional unit – Memory unit • 4 address generators for strided/indexed accesses • 2 -level TLB structure: 4 -ported, 4 -entry micro. TLB and singleported, 32 -entry main TLB • Pipelined to sustain up to 64 pending memory accesses 46

Architecture Details (2) Vector IRAM C. E. Kozyrakis, 8/2000 • Main memory system –

Architecture Details (2) Vector IRAM C. E. Kozyrakis, 8/2000 • Main memory system – No SRAM cache for the vector unit – 8 2 -MByte DRAM macros • Single bank per macro, 2 Kb page size • 256 b synchronous, non-multiplexed I/O interface • 25 ns random access time, 7. 5 ns page access time – Crossbar interconnect • 12. 8 GBytes/s peak bandwidth per direction (load/store) • Up to 5 independent addresses transmitted per cycle • Off-chip interface – 64 b Sys. AD bus to external chip-set (100 MHz) – 2 channel DMA engine 47

Hardware Exposed to Software Vector IRAM C. E. Kozyrakis, 8/2000 Pentium® III • <25%

Hardware Exposed to Software Vector IRAM C. E. Kozyrakis, 8/2000 Pentium® III • <25% of area for registers and datapaths • The rest is still useful, but not visible to software – Cannot turn off is not needed 48