COMP 4211 Advance Computer Architecture Vector Processor 25102021

  • Slides: 25
Download presentation
COMP 4211 : Advance Computer Architecture Vector Processor 25/10/2021 COMP 4211 - Advanced Computer

COMP 4211 : Advance Computer Architecture Vector Processor 25/10/2021 COMP 4211 - Advanced Computer Architecture Yian Sun 1

Overview n n n n 25/10/2021 Introduction: What and Why? Basic Vector Architecture Example:

Overview n n n n 25/10/2021 Introduction: What and Why? Basic Vector Architecture Example: MIPS Vs VMIPS Parallelism using convoys Vector Memory Systems Real World Issues: u Vector Length u Stride Introduction into Cray-1 COMP 4211 - Advanced Computer Architecture Yian Sun 2

Introduction What is a Vector Processor? n Consider an operation D = A +C

Introduction What is a Vector Processor? n Consider an operation D = A +C n Vector processor provides high-level operations that work on vectors. n A typical instruction might add two 64 element FP vectors. n Commercialized long before ILP machines. 25/10/2021 COMP 4211 - Advanced Computer Architecture Yian Sun 3

Introduction cont. Why Vector Processors? n It is equivalent to executing an entire loop

Introduction cont. Why Vector Processors? n It is equivalent to executing an entire loop u Reducing instruction fetch and decode bandwidth. n Each instruction guarantees each result is independent on other results in same vector u No data hazard check needed in an instruction. u Executed using array of paralleled functional units, or deep pipeline. 25/10/2021 COMP 4211 - Advanced Computer Architecture Yian Sun 4

Introduction cont. n n n 25/10/2021 Hardware need only check for data hazards between

Introduction cont. n n n 25/10/2021 Hardware need only check for data hazards between two instructions, once per operand. u More instructions per data check. Memory access for entire vector, not a single word. u Reduced Latency Multiple vector instructions in progress. u Further parallelism COMP 4211 - Advanced Computer Architecture Yian Sun 5

Basic Vector Architecture n n n 25/10/2021 Ordinary scalar pipeline unit + Vector unit.

Basic Vector Architecture n n n 25/10/2021 Ordinary scalar pipeline unit + Vector unit. Two Types – u Vector-register -> all operations except load and store based on registers. u Memory-memory -> all operations are memory to memory. Concentrate on Vector-register, particularly VMIPS architecture. COMP 4211 - Advanced Computer Architecture Yian Sun 6

BVA – the components Vector register u Fixed length, holds a single vector u

BVA – the components Vector register u Fixed length, holds a single vector u In VMIPS « 2 read and 1 write port. « 8 vector registers, 64 elements each Vector functional units u Fully pipelined, start new operations every cycle. u Might contain scalar function unit. Control unit u Detect structural and data hazards. 25/10/2021 COMP 4211 - Advanced Computer Architecture Yian Sun 7

BVA – the components cont. n n n 25/10/2021 Vector load-store unit u Loads

BVA – the components cont. n n n 25/10/2021 Vector load-store unit u Loads and stores vector to and from memory. Special-purpose registers u Vector length u Vector mask registers Set of Scalar registers u Provide data as input to the vector functional units. u Compute addresses to pass to the Load-Store unit. u In VMIPS « 32 general purpose and 32 floating-point registers. COMP 4211 - Advanced Computer Architecture Yian Sun 8

Example: MIPS Vs VMIPS n 25/10/2021 Greatly reduced instruction bandwidth u Six instructions instead

Example: MIPS Vs VMIPS n 25/10/2021 Greatly reduced instruction bandwidth u Six instructions instead of 600. COMP 4211 - Advanced Computer Architecture Yian Sun 9

Parallelism using convoys Convoys u A set of instructions that could begin execution together.

Parallelism using convoys Convoys u A set of instructions that could begin execution together. u Consider this sequence of code. • Using Convoys, results in 25/10/2021 COMP 4211 - Advanced Computer Architecture Yian Sun 10

Vector Memory Systems n n 25/10/2021 Problem u Memory system needs to be able

Vector Memory Systems n n 25/10/2021 Problem u Memory system needs to be able to produce and accept large amounts of data. u But how do we achieve this when there is poor access time? Solution u Creating multiple memory banks. « Useful for fragmented accesses. « Support multiple loads per clock cycle. « Allows for multi-processor sharing. COMP 4211 - Advanced Computer Architecture Yian Sun 11

Vector Memory System Example 25/10/2021 COMP 4211 - Advanced Computer Architecture Yian Sun 12

Vector Memory System Example 25/10/2021 COMP 4211 - Advanced Computer Architecture Yian Sun 12

Real World Issues (1) Vector – Length Control n Problem u How do we

Real World Issues (1) Vector – Length Control n Problem u How do we support operations where the length is unknown or not the vector length? n Solution u Provide a vector-length register, solves problem only if real length is less than Maximum Vector Length. u Use Technique Called strip mining. 25/10/2021 COMP 4211 - Advanced Computer Architecture Yian Sun 13

Strip mining n n 25/10/2021 Generating code where vector operations are done for a

Strip mining n n 25/10/2021 Generating code where vector operations are done for a size no greater than MVL. Create 2 loops u One that handles any number of iterations multiple of MVL. u Another that handles the remaining iterations. Code becomes vectorizable. Careful handling of VLR needed. COMP 4211 - Advanced Computer Architecture Yian Sun 14

Example: Strip Mining n For the DAXPY loop, a we can generate a C

Example: Strip Mining n For the DAXPY loop, a we can generate a C code as below. low=1; /*Assume start element at 1*/ v. L = n % mv. L; /*find the odd – size piece */ for(j=0; j<=n/mv. L; j++){ /*Outer Loop*/ for(i=low; i<=low+v. L-1; i++){ /*Inner loop-runs for length v. L*/ y[i] = a*x[i] + y[i]; /*Start of next vector*/ } low = low + v. L; /*Find start of next vector*/ v. L = mv. L; /* reset length to max */ } 25/10/2021 COMP 4211 - Advanced Computer Architecture Yian Sun 15

Real World Issues (2) Vector Stride n n 25/10/2021 Problem u Position in memory

Real World Issues (2) Vector Stride n n 25/10/2021 Problem u Position in memory of adjacent elements in may not be sequential. Set up time could be enormous. u E. g. Matrix Multiplication. Solution u Distance seperating elements is called the Stride. u Store the stride in a register, so only a single load or store is required. COMP 4211 - Advanced Computer Architecture Yian Sun 16

Vector Stride Access time u Vector processors use interleave memory banks. Nonunit Strides can

Vector Stride Access time u Vector processors use interleave memory banks. Nonunit Strides can cause stalls. u Stall will occur if No. of banks /LCM (Stride, No. of Banks) < Bank Busy time u No conflicts if Stride and no. of banks are relatively prime. u Increasing the no. of banks to greater than minimum. u Most vector supercomputers have at least 64, with some having up to 1024. 25/10/2021 COMP 4211 - Advanced Computer Architecture Yian Sun 17

Example-Vector Stride 25/10/2021 COMP 4211 - Advanced Computer Architecture Yian Sun 18

Example-Vector Stride 25/10/2021 COMP 4211 - Advanced Computer Architecture Yian Sun 18

Cray - 1 n n 25/10/2021 Most well-known vector processor, released in 1976. Fastest

Cray - 1 n n 25/10/2021 Most well-known vector processor, released in 1976. Fastest super-computer in the late 70 s. 32 bit instruction length. Architecture Consists of 3 sections: u The Main Memory u The Scalar Subsystem u The Vector Subsystem COMP 4211 - Advanced Computer Architecture Yian Sun 19

25/10/2021 COMP 4211 - Advanced Computer Architecture Yian Sun 20

25/10/2021 COMP 4211 - Advanced Computer Architecture Yian Sun 20

Cray-1: Main Memory n n 25/10/2021 16 banks, each consisting of 72 64 K,

Cray-1: Main Memory n n 25/10/2021 16 banks, each consisting of 72 64 K, 64 -bit words. Cycle time of 50 n. Sec, which is equivalent to 4 cycles. Can transfer 1 -4 words per clock period depending on the register or buffer. 4 words per clock cycle for instruction buffer, resulting in a bandwidth of 1280 m. B/sec. COMP 4211 - Advanced Computer Architecture Yian Sun 21

Cray-1: Scalar subsystem n 25/10/2021 Consists of u Instruction buffers u 2 file scalar

Cray-1: Scalar subsystem n 25/10/2021 Consists of u Instruction buffers u 2 file scalar registers u 2 address functional registers u Scalar functional unit u Shared floating point functional unit COMP 4211 - Advanced Computer Architecture Yian Sun 22

Cray-1: Vector subsystem n 25/10/2021 Consist of u 8 vector registers u Set of

Cray-1: Vector subsystem n 25/10/2021 Consist of u 8 vector registers u Set of 3 vector functional units u Shared set of 3 floating point functional units COMP 4211 - Advanced Computer Architecture Yian Sun 23

Cray-1: Instruction Format n n 25/10/2021 Binary arithmetic and logic instructions (a) Unary shift

Cray-1: Instruction Format n n 25/10/2021 Binary arithmetic and logic instructions (a) Unary shift and mask instructions (b) Memory read and store instructions (c) Branch instructions use lower 24 bit for branch address. COMP 4211 - Advanced Computer Architecture Yian Sun 24

References n n n 25/10/2021 Computer Architecture: A quantitative Approach, Patterson and Hennessy, Appendix

References n n n 25/10/2021 Computer Architecture: A quantitative Approach, Patterson and Hennessy, Appendix G, section 1 -3. Computer Architecture: A modern Synthesis, Subrata Dasgupta, Chapter 7, P 246 – P 249. http: //www. crhc. uiuc. edu/IMPACT/ece 412/public_ht ml/Notes/412_lec 20/ The Cray-1 Computer System, Richard M Russell, Cray Research Inc. http: //csep 1. phy. ornl. gov/ca/node 24. html COMP 4211 - Advanced Computer Architecture Yian Sun 25