SUPERSCALAR EXECUTION twoway superscalar The DLW2 has two

two-way superscalar • The DLW-2 has two ALUs, so it’s able to execute two

The superscalar DLW-2 Superscalar processing adds a bit of complexity to the DLW-2’s design,

The superscalar DLW-2 • If they can be executed in parallel, the dispatch unit

The superscalar DLW-2 • The important thing to remember is that main memory still

The pipeline of the superscalar DLW-2 If the processor is to execute multiple instructions

Superscalar Computing and IPC Superscalar execution and pipelining combined

Superscalar execution and pipelining combined • in multiple write stages on each clock cycle,

Basic Number Formats and Computer Arithmetic Returning to the code/data distinction, we can say

Number formats and operation types �� Arithmetic operations are operations like addition, subtraction, multiplication,

Arithmetic Logic Units On early microprocessors, as on the DLW-1 and DLW-2, all integer

The Intel Pentium Eventually, floating-point capabilities were integrated onto the CPU as a separate

Memory-Access Units • In almost all of the processors There is a pair of

Reference • INSIDE THE MACHINE Jon Stokes 2007.

Slides: 14

Download presentation

SUPERSCALAR EXECUTION

two-way superscalar • The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way superscalar). • These two ALUs share a single register file, a situation that in terms of our file clerk analogy would correspond to the file clerk sharing his personal filing cabinet with a second file clerk.

The superscalar DLW-2 Superscalar processing adds a bit of complexity to the DLW-2’s design, because it needs new circuitry that enables it to reorder the linear instruction stream so that some of the stream’s instructions can execute in parallel. This circuitry has to ensure that it’s “safe” to dispatch two instructions in parallel to the two execution units. The second pipeline stage named decode/ dispatch. This is because attached to the latter part of the decode stage is a bit of dispatch circuitry whose job it is to determine whether or not two instructions can be executed in parallel, in other words, on the same clock cycle.

The superscalar DLW-2 • If they can be executed in parallel, the dispatch unit sends one instruction to the first integer ALU and one to the second integer ALU. If they can’t be dispatched in parallel, the dispatch unit sends them in program order to the first of the two ALUs. • There a few reasons why the dispatcher might decide that two instructions can’t be executed in parallel, and we’ll cover those in the following sections.

The superscalar DLW-2 • The important thing to remember is that main memory still sees one sequential code stream, one data stream, and one results stream, even though the code and data streams are carved up inside the computer and pushed through the two ALUs in parallel.

The pipeline of the superscalar DLW-2 If the processor is to execute multiple instructions at once, it must be able to fetch and decode multiple instructions at once. A two-way superscalar processor like the DLW-2 can fetch two instructions at once from memory on each clock cycle, and it can also decode and dispatch two instructions each clock cycle

Superscalar Computing and IPC Superscalar execution and pipelining combined

Superscalar execution and pipelining combined • in multiple write stages on each clock cycle, the superscalar machine can complete multiple instructions per cycle • Two instructions are added to the Completed Instructions box on each cycle once the pipeline is full. The more ALU pipelines that a processor has operating in parallel, the more instructions it can add to that box on each cycle. Thus superscalar computing allows you to increase a processor’s (instruction per clock) IPC by adding more hardware. There are some practical limits to how many instructions can be executed in parallel, and we’ll discuss those later.

Basic Number Formats and Computer Arithmetic Returning to the code/data distinction, we can say that the data stream consists of four types of numbers: scalar integers, scalar floating point numbers, vector integers, and vector floating-point numbers. (Note that even memory addresses fall into one of these four categories—scalar integers. ) The code stream, then, consists of instructions that operate on all four types of numbers.

Number formats and operation types �� Arithmetic operations are operations like addition, subtraction, multiplication, and division, all of which can be performed on any type of number. ��Logical operations are Boolean operations like AND, OR, NOT, and XOR, along with bit shifts and rotates. Such operations are performed on scalar and vector integers, as well as on the contents of special purpose registers like the processor status word (PSW).

Arithmetic Logic Units On early microprocessors, as on the DLW-1 and DLW-2, all integer arithmetic and logical operations were handled by the ALU. Floatingpoint operations were executed by a companion chip, commonly called an arithmetic coprocessor, that was attached to the motherboard and designed to work in conjunction with the microprocessor

The Intel Pentium Eventually, floating-point capabilities were integrated onto the CPU as a separate execution unit alongside the ALU. For instance, an integer execution unit (IU) is an ALU that executes integer arithmetic and logical instructions, a floating-point execution unit (FPU) is an ALU that executes floatingpoint arithmetic and logical instructions, and so on. Figure shows that the Pentium has two IUs—a simple integer unit (SIU) and a complex integer unit (CIU)—and a single FPU.

Memory-Access Units • In almost all of the processors There is a pair of execution units that execute memory-access instructions: 1. 2. The load-store unit (LSU): which is responsible for the execution of load and store instructions, as well as for address generation. LSUs have small, stripped-down integer addition hardware that can quickly perform the addition required to compute an address. The branch execution unit (BEU) is responsible for executing conditional and unconditional branch instructions. The BEU of the DLW series reads the processor status word and decides whether or not to replace the program counter with the branch target. The BEU also often has its own address generation unit for performing quick address calculations as needed.

Reference • INSIDE THE MACHINE Jon Stokes 2007.