Vector Processing What is a Vector Processor History

















- Slides: 17

Vector Processing What is a Vector Processor? History Vector Processing Applications Vector Programming Example, Instructions and Instruction Format • Advantages & Disadvantages • Pipelining : Example & Speedup • •

What is a Vector Processor? • Also called an Array Processor. • Runs multiple mathematical operations on multiple data elements simultaneously. • Common in supercomputers of the 1970’s 80’s and 90’s. • Today most CPU designs contains at least some vector processing instructions, typically referred to as SIMD.

Vector Processor (computer) • Ability to process vectors, and related data structures such as matrices and multi-dimensional arrays, much faster than conventional computers • Vector Processors may also be pipelined • Typically operate on a few vectors elements per clock cycle in a pipeline.

History • 1962 University of Illinois Illiac IV - completed 1972 with 64 ALUs 100 -150 MFlops (massively parallel computer) • (1973) TI’s Advance Scientific Computer (ASC) 2080 MFlops • (1975) Cray-1 first to have vector registers instead of keeping data in memory (8 registers with 64 64 -bit words in each) • Cray-1 had separate pipelines for different instruction types allowing vector chaining. 80240 MFlops

Vector Processing Applications • Problems that can be efficiently formulated in terms of vectors – Long-range weather forecasting – Petroleum explorations – Seismic data analysis – Medical diagnosis – Aerodynamics and space flight simulations – Artificial intelligence and expert systems – Mapping the human genome – Image processing

VECTOR PROGRAMMING • Problem : Add two arrays of size 100 • DO 20 I = 1, 100 • 20 C(I) = B(I) + A(I) • Conventional computer • Initialize I = 0 • 20 Read A(I) • Read B(I) • Store C(I) = A(I) + B(I) • Increment I = i + 1 • If I 100 goto 20 • Vector computer • C(1: 100) = A(1: 100) + B(1: 100)

VECTOR INSTRUCTIONS Type Mnemonic Description (I = 1, . . . , n) 1. f 1 2. 3. 4. f 2 5. 6. f 3 7. 8. 9. 10. VSQR VSIN VCOM VSUM VMAX VADD VMPY VAND VLAR VTGE Vector square root Vector sine Vector complement Vector summation Vector maximum Vector add Vector multiply Vector AND Vector larger Vector test > 1. 2. SADD SDIV Vector-scalar add Vector-scalar divide f 4 B(I) * SQR(A(I)) B(I) * sin(A(I)) A(I) * A(I) S * S A(I) S * max{A(I)} C(I) * A(I) + B(I) C(I) * A(I) * B(I) C(I) * A(I). B(I) C(I) * max(A(I), B(I)) C(I) * 0 if A(I) < B(I) C(I) * 1 if A(I) > B(I) * S + A(I) B(I) * A(I) / S

VECTOR INSTRUCTION FORMAT

Advantages • Each result is independent of previous results allowing deep pipelines and high clock rates. • A single vector instruction performs a great deal of work - meaning less fetches and ewer branches (and in turn fewer mispredictions). • Vector instructions access memory a block at a time which allows memory latency to be amortized over many elements. • Vector instructions access memory with known patterns, which allows multiple memory banks to simultaneously supply operands. • Less memory access = faster processing time.

Disadvantages • Difficulties implementing precise exceptions • High price of on-chip vector memory systems • Increased code complexity

PIPELINING • A technique of decomposing a sequential process into sub-operations, with each sub-process being executed in a partial dedicated segment that operates concurrently with all other segments. • Example – Ai * Bi + Ci for i = 1, 2, 3, . . . , 7 • Implementation – R 1 Ai, R 2 Bi Load Ai and Bi – R 3 R 1 * R 2, R 4 Ci Multiply and load Ci – R 5 R 3 + R 4 Add

PIPELINING with three Stages S 1, S 2, S 3 for performing Ai * Bi + Ci

OPERATIONS IN EACH PIPELINE STAGE

GENERAL PIPELINE

PIPELINE SPEEDUP • • – n: Number of tasks to be performed Conventional Machine (Non-Pipelined) – tn: Clock cycle – t 1: Time required to complete the n tasks – t 1 = n * t n Pipelined Machine (k stages) – tp: Clock cycle (time to complete each sub-operation) – tk: Time required to complete the n tasks – tk = (k + n - 1) * tp Speedup – Sk: Speedup Sk = n*tn / (k + n - 1)*tp

PIPELINE SPEEDUP • Example • - 4 -stage pipeline • - subopertion in each stage; tp = 20 n. S • - 100 tasks to be executed • - 1 task in non-pipelined system; 20*4 = 80 n. S • • Pipelined System • (k + n - 1)*tp = (4 + 99) * 20 = 2060 n. S • • Non-Pipelined System n*k*tp = 100 * 80 = 8000 n. S • • Speedup Sk = 8000 / 2060 = 3. 88

Levels of Parallel Processing • Job or Program level • Task or Procedure level • Inter-Instruction level • Intra-Instruction level