Register Renaming Value Prediction Overview Need for PostRISC

  • Slides: 31
Download presentation
Register Renaming & Value Prediction

Register Renaming & Value Prediction

Overview ► Need for Post-RISC ► Register Renaming vs. Allocation Strategies ► How to

Overview ► Need for Post-RISC ► Register Renaming vs. Allocation Strategies ► How to compile for Post-RISC machines ► Dynamic Register Renaming through Virtual-Physical Registers

Software Outlives Hardware ► How to make old software run faster? • Faster CPU

Software Outlives Hardware ► How to make old software run faster? • Faster CPU clock and memory hierarchy • Adapt CPU’s to actual software (profiling/tuning) • More instructions per cycle ► Today’s software will run on tomorrow’s CPU’s • Need to keep software interface stable • More functional units and registers

Compile-time vs. Run-time ► Little is known about software at compile-time ► Space/time trade-offs

Compile-time vs. Run-time ► Little is known about software at compile-time ► Space/time trade-offs • Memory speeds cannot keep up with CPU speeds • When to apply optimizations that increase code size

Solutions ► New scalable architecture (IA-64) • Decouple physical/virtual registers using register windows •

Solutions ► New scalable architecture (IA-64) • Decouple physical/virtual registers using register windows • More explicit parallelism allows for more function units • Explicit speculative instructions ► Post-RISC architecture • Remove limits in super scalar implementation of existing architectures • Extract even more parallelism out of existing software

Anti- and Output Dependencies ► Also called read-after-write (RAW) hazards ► An instruction may

Anti- and Output Dependencies ► Also called read-after-write (RAW) hazards ► An instruction may use a result produced by the previous instruction • Both instructions may not execute simultaneously in multiple pipelines. • The second instruction must typically be stalled.

Structural Dependencies ► Stalls results in less than optimal performance We may have single

Structural Dependencies ► Stalls results in less than optimal performance We may have single issue cycles, which process only a single instruction. Worse, we may have zero issue cycles, which initiate no new instructions. ► Data dependencies can also limit performance for a scalar machine • Two cycle memory load/write • Intra instruction dependencies

Scheduling ► Scheduling can remove stalls ► Intra-instruction dependencies cannot be removed by scheduling

Scheduling ► Scheduling can remove stalls ► Intra-instruction dependencies cannot be removed by scheduling (CISC)

Need for Post-RISC ► Super-scalar has diminishing returns in CPI (Clocks Per Instruction) •

Need for Post-RISC ► Super-scalar has diminishing returns in CPI (Clocks Per Instruction) • 2 Way 1. 6 1. 8 (85%) • 4 Way 2. 6 (65%) • 8 Way ? ? ? ► More parallelism needed ► Look beyond set of 4 instructions

Post-RISC characteristics ► Out-of-order execution • (Existed 20 years ago on IBM and CDC)

Post-RISC characteristics ► Out-of-order execution • (Existed 20 years ago on IBM and CDC) • Innovative for single chip • Branch history bits ► Precise interrupts ► Fetch/Flow Prediction ► More caching • Instruction cache becomes CPU scratch space ► Register renaming • First in IBM 360/91 FPU

Specint 92 Trends ► Specint 92 numbers are increasing • DEC has historically been

Specint 92 Trends ► Specint 92 numbers are increasing • DEC has historically been the champ ► Specint 92/Clock rates • DEC low (21164@300 => 1. 14 10/95) • IBM strong early (580 H@55 => 1. 76 9/93) • HP (PA 8000@133 2. 7 10/95)

The Post-RISC Architecture

The Post-RISC Architecture

Post-RISC CPU’s ► Traditional RISC • DEC Alpha 21164 • Sun Ultra. SPARC 1

Post-RISC CPU’s ► Traditional RISC • DEC Alpha 21164 • Sun Ultra. SPARC 1 ► (partially) Post-RISC • Power. PC 604 • MIPS R 10000 • HP PA 8000 • Intel Pentium Pro • DEC Alpha 21264 • HAL SPARC 64

Automatic Register Renaming ► Every R-write allocates new R ► The register name A

Automatic Register Renaming ► Every R-write allocates new R ► The register name A is an alias for the last R allocated by a write to A ► An instruction reading and writing an register allocates a new R too

Advantages over More ISA Registers ► Smaller instructions ► Allow same software to run

Advantages over More ISA Registers ► Smaller instructions ► Allow same software to run on range of implementations • Compare the same program running on Pentium or AMD Ath ► Less state to save • Faster function calls • Faster context switches • Life times can be optimized

Renaming Implementation ► Rename Storage Locations • Reorder Buffer • Physical Register File ►

Renaming Implementation ► Rename Storage Locations • Reorder Buffer • Physical Register File ► Similarities: • Allocate at decode • Release at commit

Renaming using Reorder buffer ► Results are kept in reorder buffer ► Source operands

Renaming using Reorder buffer ► Results are kept in reorder buffer ► Source operands are read either from • the register file, or • a reorder buffer entry ► Not-yet-ready results are forwarded to instruction queue ► Used by Intel Pentium III, Power. PC 604, SPARC 64

Renaming on Pentium III ► All registers can be renamed (generic, floatingpoint, status) ►

Renaming on Pentium III ► All registers can be renamed (generic, floatingpoint, status) ► Renaming uses a set of 40 reorder buffers • FPU control/status cannot be renamed • Max 2 renamings per instruction

Register Allocation Example ► Minimal number of named registers ► Scheduling is limited ►

Register Allocation Example ► Minimal number of named registers ► Scheduling is limited ► Strictly serial execution Mem 2 : = Mem 1 * Mem 1; Mem 4 : = Mem 3 + 1; r. A : = Mem 1; r. A : = r. A * r. A; Mem 2 : = r. A; r. A : = Mem 3; r. A : = r. A + 1; Mem 4 : = r. A;

Renaming using Physical Register File ► Register file contains more registers than defined in

Renaming using Physical Register File ► Register file contains more registers than defined in ISA (logical registers) ► Map logical register to physical registers during decode ► Operands are always read from logical file ► Used by MIPS R 10000 and DEC 21264

Virtual-Physical Registers ► Motivation: better utilization of physical registers • Important in presence of

Virtual-Physical Registers ► Motivation: better utilization of physical registers • Important in presence of long latency instructions ► Conventional scheme “wastes” register for each: • Decoded instruction that has not finished execution • Committed instruction whose result is dead Can be eliminated by maintaining reference counter Example: load fdiv fmul fadd f 2, 0(r 6) f 2, f 10 f 2, f 12 f 2, 1

Virtual-Physical Register Renaming ► General Map Table • Indexed by logical register L •

Virtual-Physical Register Renaming ► General Map Table • Indexed by logical register L • VP register: last virtual physical register that L has been mapped to • P register: Last physical register that L and VP have been mapped to • V bit: indicates whether P is valid ► Physical Map Table • Has entry for each VP • Contains last physical register that VP has been mapped to

Functional Description ► For each logical source register S do a GMT lookup •

Functional Description ► For each logical source register S do a GMT lookup • If V bit is set, rename S to P • Otherwise, rename S to VP ► Rename the logical destination register to a new VP ► Update GMT: set VP to new mapping and reset V ► Save previous VP in reorder buffer to be able to roll back

Functional Description ► Instruction Queue Fields: • • Operation code Destination VP Source operands

Functional Description ► Instruction Queue Fields: • • Operation code Destination VP Source operands Ready bits for source operands: when ready Source operand contains a physical register number ► Reorder Buffer Entry • Destination logical register • Completion bit • VP mapping of last instruction with same logical destination

Functional Description ► When source operands are ready, instruction is issued ► When instruction

Functional Description ► When source operands are ready, instruction is issued ► When instruction completes: • new physical register R is allocated for result • PMT is updated to reflect new mapping • VP number of destination is broadcast to all entries in instruction queue with physical register identifier • GMT is updated: entry corresponding to logical destination is checked for match with the VP and if so, the physical register nr is copied to the P register field and the V flag is set • As a result a new instruction using same logical register will find corresponding physical register in GMT

Register Allocation Example ► Uses more named registers ► Scheduling more effective ► 2

Register Allocation Example ► Uses more named registers ► Scheduling more effective ► 2 -way super-scalar execution r. A : = Mem 1; r. B : = Mem 3; Mem 2 : = Mem 1 * Mem 1; r. A : = r. A * r. A; Mem 4 : = Mem 3 + 1; r. B : = r. B + 1; Mem 2 : = r. A; Mem 4 : = r. B;

Effect of Register Renaming ► Schedule uses 4 hardware registers ► 2 -way super-scalar

Effect of Register Renaming ► Schedule uses 4 hardware registers ► 2 -way super-scalar execution r. A 1 : = Mem 1; r. B 1 : = Mem 3; r. A 2 : = r. A 1 * r. A 1; r. B 2 : = r. B 1 + 1; Mem 2 : = r. A 2; Mem 4 : = r. B 2;

Effect of Register Renaming ► Schedule uses 4 hardware registers ► Can hide memory-write

Effect of Register Renaming ► Schedule uses 4 hardware registers ► Can hide memory-write latency ► Still no full use of multiple pipelines r. A 1 : = Mem 1; r. A 2 : = r. A 1 * r. A 1; Mem 2 : = r. A 2; r. A 3 : = Mem 3; r. A 4 : = r. A 3 + 1; Mem 4 : = r. A 4;

Renaming and O-O-O execution ► Instructions wait for: • • Availability of execution unit

Renaming and O-O-O execution ► Instructions wait for: • • Availability of execution unit Input dependencies Older instructions have priority Load instructions have priority ► Instructions do NOT wait for: • Program order • Branch resolution • Output dependencies (use “rename register”)

Renaming and O-O-O execution ► Schedule uses 4 hardware registers ► Can hide memory-write

Renaming and O-O-O execution ► Schedule uses 4 hardware registers ► Can hide memory-write latency ► “Bad” schedule uses both pipelines ► Only one register name used r. A 1 : = Mem 1; r. A 2 : = r. A 1 * r. A 1; Mem 2 : = r. A 2; r. A 3 : = Mem 3; r. A 4 : = r. A 3 + 1; Mem 4 : = r. A 4;

Renaming aware scheduling? ► Use Register Renaming in allocator • minimal number of named

Renaming aware scheduling? ► Use Register Renaming in allocator • minimal number of named registers • maximal number of register instances ► Do not do scheduling that CPU can do • over scheduling can be worse than no scheduling at all