Chapter 14 Superscalar Processors What is Superscalar A

  • Slides: 31
Download presentation
Chapter 14 Superscalar Processors

Chapter 14 Superscalar Processors

What is Superscalar? A Superscalar machine executes multiple independent instructions in parallel. • “Common”

What is Superscalar? A Superscalar machine executes multiple independent instructions in parallel. • “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently. • Equally applicable to RISC & CISC, but more straightforward in RISC machines. • The order of execution is usually determined by the compiler.

Example Superscalar Organization • 2 Integer ALU pipelines, • 2 FP ALU pipelines, •

Example Superscalar Organization • 2 Integer ALU pipelines, • 2 FP ALU pipelines, • 1 memory pipeline (? )

Superpipelined Machines Superpiplined machines overlap pipe stages - rely on stages being able to

Superpipelined Machines Superpiplined machines overlap pipe stages - rely on stages being able to begin operations before the last is complete. Superscaler machines have multiple instruction pipelines - process multiple instructions in parallel

Superscalar v Superpipeline

Superscalar v Superpipeline

Limitations of Superscalar Dependent upon: • Instruction level parallelism • Compiler based optimization •

Limitations of Superscalar Dependent upon: • Instruction level parallelism • Compiler based optimization • Hardware support • Limited by — — True Data dependency Procedural dependency Resource conflicts Output dependency or Antidependency (another form of data dependency)

True Data Dependency ADD r 1, r 2 MOVE r 3, r 1 (r

True Data Dependency ADD r 1, r 2 MOVE r 3, r 1 (r 1+r 2 r 1) (r 1 r 3) • Can fetch and decode second instruction in parallel with first • Can NOT execute second instruction until first is finished Compare with the following? LOAD r 1, X (x r 1) MOVE r 3, r 1 (r 1 r 3) • What additional problem do we have here?

Procedural Dependency • Can’t execute instructions after a branch in parallel with instructions before

Procedural Dependency • Can’t execute instructions after a branch in parallel with instructions before a branch, because? Note: Also, if instruction length is not fixed, instructions have to be decoded to find out how many fetches are needed

Resource Conflict • Two or more instructions requiring access to the same resource at

Resource Conflict • Two or more instructions requiring access to the same resource at the same time —e. g. two arithmetic instructions • Solution - Can possibly duplicate resources —e. g. have two arithmetic units

Antidependancy ADD R 4, R 3, 1 R 3 + 1 R 4 ADD

Antidependancy ADD R 4, R 3, 1 R 3 + 1 R 4 ADD R 3, R 5, 1 R 5 + 1 R 3 • Cannot complete the second instruction before the first has read R 3 Why?

True data dependency & Antidependency • True data dependency: result of 1 st instr

True data dependency & Antidependency • True data dependency: result of 1 st instr used in 2 nd instr (can’t complete 1 st too soon) • Antidenpendency: out of order completion of 2 nd instr can write over value to be used in 1 st instr (must complete 1 st before 2 nd changes operand value)

Effect of Dependencies

Effect of Dependencies

Instruction-level Parallelism • Consider: LOAD R 1, R 2 ADD R 3, 1 ADD

Instruction-level Parallelism • Consider: LOAD R 1, R 2 ADD R 3, 1 ADD R 4, R 2 These can be handled in parallel. • Consider: ADD R 3, 1 ADD R 4, R 3 STO (R 4), R 0 These cannot. Why?

Instruction Issue Policies • Order in which instructions are fetched • Order in which

Instruction Issue Policies • Order in which instructions are fetched • Order in which instructions are executed • Order in which instructions update registers and memory values Note: there is also the issue of instruction completion policy

In-Order Issue -- In-Order Completion Issue instructions in the order they occur: • Not

In-Order Issue -- In-Order Completion Issue instructions in the order they occur: • Not very efficient • Instructions must stall if necessary

In-Order Issue -- In-Order Completion (Example) Assume: • I 1 requires 2 cycles to

In-Order Issue -- In-Order Completion (Example) Assume: • I 1 requires 2 cycles to execute • I 3 & I 4 conflict for the same functional unit • I 5 depends upon value produced by I 4 • I 5 & I 6 conflict for a functional unit

In-Order Issue -- Out-of-Order Completion (Example) Again: • I 1 requires 2 cycles to

In-Order Issue -- Out-of-Order Completion (Example) Again: • I 1 requires 2 cycles to execute • I 3 & I 4 conflict for the same functional unit • I 5 depends upon value produced by I 4 • I 5 & I 6 conflict for a functional unit How does this effect interrupts?

Out-of-Order Issue -- Out-of-Order Completion • Decouple decode pipeline from execution pipeline • Can

Out-of-Order Issue -- Out-of-Order Completion • Decouple decode pipeline from execution pipeline • Can continue to fetch and decode until this pipeline is full • When a functional unit becomes available an instruction can be executed • Since instructions have been decoded, processor can look ahead

Out-of-Order Issue -- Out-of-Order Completion (Example) Again: • I 1 requires 2 cycles to

Out-of-Order Issue -- Out-of-Order Completion (Example) Again: • I 1 requires 2 cycles to execute • I 3 & I 4 conflict for the same functional unit • I 5 depends upon value produced by I 4 • I 5 & I 6 conflict for a functional unit Note: I 5 depends upon I 4, but I 6 does not

Register Renaming • Output and antidependencies occur because register contents may not reflect the

Register Renaming • Output and antidependencies occur because register contents may not reflect the correct ordering from the program Can result in a pipeline stall • One solution: Allocate Registers dynamically (renaming registers)

Register Renaming example R 3 b: =R 3 a + R 5 a R

Register Renaming example R 3 b: =R 3 a + R 5 a R 4 b: =R 3 b + 1 R 3 c: =R 5 a + 1 R 7 b: =R 3 c + R 4 b (I 1) (I 2) (I 3) (I 4) • Without “subscript” refers to logical register in instruction • With subscript is hardware register allocated: R 3 a R 3 b R 3 c Note: R 3 c avoids: antidependency on I 2 output dependency I 1

Machine Parallelism Support • Duplication of Resources • Out of order issue • Renaming

Machine Parallelism Support • Duplication of Resources • Out of order issue • Renaming • Windowing

Speedups of Machine Organizations (Without Procedural Dependencies) • Not worth duplication of functional units

Speedups of Machine Organizations (Without Procedural Dependencies) • Not worth duplication of functional units without register renaming • Need instruction window large enough (more than 8, probably not more than 32)

Branch Prediction in Superscalar Machines • Delayed branch not used much. Why? Multiple instructions

Branch Prediction in Superscalar Machines • Delayed branch not used much. Why? Multiple instructions need to execute in the delay slot. This leads to much complexity in recovery. • Branch prediction may be used - Branch history MAY still be useful • Are there any alternatives ?

Superscalar Execution

Superscalar Execution

Committing or Retiring Instructions Results need to be put into order (commit or retire)

Committing or Retiring Instructions Results need to be put into order (commit or retire) • Results sometimes must be held in temporary storage until it is certain they can be placed in “permanent” storage. (commit or retire) • Temporary storage requires regular clean up - overhead.

Superscalar Hardware Support • Facilities to simultaneously fetch multiple instructions • Logic to determine

Superscalar Hardware Support • Facilities to simultaneously fetch multiple instructions • Logic to determine true dependencies involving register values and Mechanisms to communicate these values • Mechanisms to initiate multiple instructions in parallel • Resources for parallel execution of multiple instructions • Mechanisms for committing process state in correct order

Conclusions What are the relative benefits of • Superscalar • Superpipelining

Conclusions What are the relative benefits of • Superscalar • Superpipelining

Superscalar CISC machines • Can Superscalar design be applied to CISC machines ?

Superscalar CISC machines • Can Superscalar design be applied to CISC machines ?

javax. comm Basically, javax. comm is no longer supported on Windows (hasn't been since

javax. comm Basically, javax. comm is no longer supported on Windows (hasn't been since 2002), so we switched to Rx. Tx, which is nearly identical. /: According to http: //en. wikibooks. org/wiki/Serial_Programming: Serial_Java#Rx. Tx, "Converting a Java. Comm Application to Rx. Tx", all that is required to convert a javacomm application to an Rx. Tx application is simply changing the import statement: import javax. comm. *; to import gnu. io. *; Everything else in the program can remain exactly the same because the package gnu. io apparently encompasses the same classes as javax. comm. Indeed, rxtx version of Simple. Write is identical to the javacomm version of Simple. Write except that it imports gnu. io. * rather than javax. comm. *. "

Basic Concepts of the IA-64 Architecture Instruction level parallelism — Explicit in machine instruction

Basic Concepts of the IA-64 Architecture Instruction level parallelism — Explicit in machine instruction rather than determined at run time by processor Long or very long instruction words (LIW/VLIW) — Fetch bigger chunks already “preprocessed” Branch predication (not the same as branch prediction) — Go ahead and fetch & decode instructions, but keep track of them so the decision to “issue” them, or not, can be practically made later Speculative loading — Go ahead and load data so it is ready when need, and have a practical way to recover if speculation proved wrong Software Pipelining — Allows multiple iterations of a loop to execute in parallel