Future Superscalar Processors Based on Instruction Compounding Stamatis
Future Superscalar Processors Based on Instruction Compounding Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Microprocessors
Instruction Compounding (Fusing) Instruction compounding, or “fusing” has become a key idea in high performance microprocessors “A compound instruction reflects the parallel issue of instructions; it comprises some number of independent instructions or interlocked instructions” “Instructions composing a compound instruction need not be consecutive. ” -- S. Vassiliadis et al. IBM Journal of R and D, Jan. 1994 Future Microprocessors 2
The Future Processor: Three Key Aspects q Instruction compounding or fusing 1. Based on S. Vassiliadis work 2. Employs compounding and 3 -input ALU 1. Co-designed VM for dynamic translation/fusing 1. Concealed from all software 2. Optimized (fused) instructions held in code-cache 2. Dual decoder front-end for fast startup 1. Hardware front-end decoder for fast startup 2. Software translator for sustained high performance Future Microprocessors 3
Processor Micro-architecture Future Microprocessors 4
Fusible Instruction Set q q RISC-ops with unique features: • A fusible bit per instruction fuses two dependent instructions • Dense instruction encoding, 16/32 -bit ISA design Special Features to Support the x 86 ISA • Condition codes • Addressing modes • Aware of long immediate & displacement values Future Microprocessors 5
Microarchitecture: Macro-op Execution • Enhanced OOO superscalar microarchitecture – Process & execute fused macro-ops as single Instructions throughout the entire pipeline Future Microprocessors 6
Macro-op Fusing Algorithm q Objectives: • • q Heuristics: • • • q Maximize fused dependent pairs Simple & Fast Pipelined Scheduler: Only single-cycle ALU ops can be a head. Minimize non-fused single-cycle ALU ops Criticality: Fuse instructions that are “close” in the original sequence. ALU-ops criticality is easier to estimate. Simplicity: 2 or fewer distinct register operands per fused pair Solution: Two-pass Fusing Algorithm: • • The 1 st pass, forward scan, prioritizes ALU ops, i. e. for each ALU-op tail candidate, look backward in the scan for its head The 2 nd pass considers all kinds of RISC-ops as tail candidates Future Microprocessors 7
Fusing Algorithm: Example x 86 asm: -----------------------------1. lea eax, DS: [edi + 01] 2. mov [DS: 080 b 8658], eax 3. movzx ebx, SS: [ebp + ecx << 1] 4. and eax, 0000007 f 5. mov edx, DS: [eax + esi << 0 + 0 x 7 c] RISC-ops: --------------------------After fusing: Macro-ops 1. ADD Reax, Redi, 1 --------------------------2. mem[R 22] 1. ST ADD Reax, R 18, Redi, 1 : : AND 3. LD. zx R 18, Rebx, mem[R 22] mem[Rebp + Recx << 1] 2. ST 4. 3. AND LD. zx Reax, Rebx, 0000007 f mem[Rebp + Recx << 1] 5. Reax, Resi : : 4. ADD R 17, Reax, LD 6. LD Redx, mem[R 17 + 0 x 7 c] Reax, R 18, 007 f Rebx, mem[R 17+0 x 7 c] Future Microprocessors 8
Instruction Fusing Profile q q 55+% fused RISC-ops increases effective ILP by 1. 4 Only 6% single-cycle ALU ops left un-fused. Future Microprocessors 9
Other DBT Software Profile q Of all fused macro-ops: • • • q 50% ALU-ALU pairs. 30% fused condition test & conditional branch pairs. Others mostly ALU-MEM ops pairs. 70+% are inter-x 86 instruction fusion. 46% access two distinct source registers, only 15% (6% of all instruction entities) write two distinct destination registers. Translation Overhead Profile • About 1000 instructions per translated hotspot instruction. Future Microprocessors 10
Co-designed x 86 Processor Performance Future Microprocessors 11
Dual Decoder Front-End Future Microprocessors 12
Evaluation: Startup Performance Future Microprocessors 13
Activity of HW x 86 Decoder Future Microprocessors 14
Important Research Issues q Profiling • q q q Probe insertion via software translator not feasible Multi-core • Shared code cache • SMT designs Memory consistency • Stores can be done in-order • Re-scheduled loads may be important for performance Precise traps • Potential HW assist? Future Microprocessors 15
- Slides: 15