Intel Processor Microarchitecture Core Intel Software College Intel

  • Slides: 105
Download presentation
Intel® Processor Micro-architecture – Core® Intel® Software College

Intel® Processor Micro-architecture – Core® Intel® Software College

Intel® Software College Objectives After completion of this module you will be able to

Intel® Software College Objectives After completion of this module you will be able to describe • Components of an IA processor • Working flow of the instruction pipeline • Notable features of the architecture Intel® Processor Micro-architecture - Core® microarchitecture 2 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Agenda Introduction Notable features Micro-architecture drill-down Advanced cache technology Coding considerations

Intel® Software College Agenda Introduction Notable features Micro-architecture drill-down Advanced cache technology Coding considerations Intel® Processor Micro-architecture - Core® microarchitecture 3 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Agenda Introduction Notable features Micro-architecture drill-down Advanced cache technology Coding considerations

Intel® Software College Agenda Introduction Notable features Micro-architecture drill-down Advanced cache technology Coding considerations Intel® Processor Micro-architecture - Core® microarchitecture 4 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Industrial Recognition Intel® Software College PC Format May 2006 “Intel Strikes Back! Conroe is

Industrial Recognition Intel® Software College PC Format May 2006 “Intel Strikes Back! Conroe is the name. Pistol-whipping Athlon 64 s into burger meat is the game. . “ Intel's Next Generation Microarchitecture Unveiled Real World Tech “Just as important as the technical innovations in Core MPUs, this microarchitecture will have a profound impact on the industry. “ Intel Dishes the Knockout Punch to AMD with Conroe, GD Hardware. com “…the results were far more than we could hope for and it'll be amusing to see AMD's response to this beat-down session Intel Regains Performance Crown, Anandtech “… At 2. 8 or 3. 0 GHz, a Conroe EE would offer even stronger performance than what we’ve seen here. ” Intel Reveals Conroe Architecture, Extremetech “… And not only was the Intel system running at 2. 66 GHz— a slower clock rate than the top Pentium 4—it was outpacing an overclocked Athlon 64 FX-60. Wrap your brain around that idea for a bit…” Conroe Benchmarks - Intel Showing Big Strength Hot Hardware. com Intel® Processor Micro-architecture - Core® microarchitecture “… Intel is poised to change the face of the desktop computing landscape…” 5 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Performance Summary Intel® Core™ Microarchitecture dramatically boosts Intel platform performance •

Intel® Software College Performance Summary Intel® Core™ Microarchitecture dramatically boosts Intel platform performance • Conroe & Woodcrest drive clear Desktop/Server performance leadership • Merom extends Intel Mobile performance leadership Intel® Core™ Microarchitecture-based platforms set the bar in Performance and Energy Efficiency for the Multi. Core era • Intel’s 3 rd generation dual-core (while competition stuck on 1 st generation) • New Intel high-performance ‘engine’: Wider, Smarter, Faster, More Efficient Best Processor on the Planet: Energy-Efficient Performance 1 The “Core™ Effect”: Intel® Core™ Microarchitecture 20% (Merom), 40% (Conroe), 80% (Woodcrest) Performance Boosts 1 ! ramp fuels broadmap accelerations Intel® Processor Micro-architecture - Core® microarchitecture 6 1 Based on SPECint*_rate_base 2000 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture Intel® Net. Burst® + New Innovations Mobile Microarchitecture

Intel® Software College Intel® Core™ Microarchitecture Intel® Net. Burst® + New Innovations Mobile Microarchitecture Intel® Processor Micro-architecture - Core® microarchitecture 7 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Agenda Introduction Notable features • Wide Dynamic Execution • Smart Memory

Intel® Software College Agenda Introduction Notable features • Wide Dynamic Execution • Smart Memory Access • Advanced Smart Cache • Advanced Digital Media Boost • Intelligent Power Capability Micro-architecture drill-down Advanced cache technology Coding considerations Intel® Processor Micro-architecture - Core® microarchitecture 8 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core® Micro-architecture Notable Features Intel® Wide Dynamic Execution • 14

Intel® Software College Intel® Core® Micro-architecture Notable Features Intel® Wide Dynamic Execution • 14 -stage efficient pipeline • • • Wider execution path Advanced branch prediction Macro-fusion • Roughly ~15% of all instructions are conditional branches • Macro-fusion fuses a comparison and jump to reduce micro-ops running down the pipeline • Micro-fusion • Merges the load and operation micro-ops into one macro-op • Stack pointer tracker • ESP tracks the stack • This pointer allows push/pops to work returning the correct values • 64 -Bit Support • Merom, Conroe, and Woodcrest support EM 64 T Intel® Processor Micro-architecture - Core® microarchitecture 9 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core® Micro-architecture Notable Features (cont. ) Intel® Advanced Memory Access

Intel® Software College Intel® Core® Micro-architecture Notable Features (cont. ) Intel® Advanced Memory Access • Improved prefetching • Memory disambiguation • Advance load before a possible data dependency (pointer conflict) • Earlier loads hide memory latencies Intel® Processor Micro-architecture - Core® microarchitecture 10 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core® Micro-architecture Notable Features (cont. ) Intel® Advanced Smart Cache

Intel® Software College Intel® Core® Micro-architecture Notable Features (cont. ) Intel® Advanced Smart Cache • Multi-core optimization • • • Shared between the two cores Advanced Transfer Cache architecture Reduced bus traffic Both cores have full access to the entire cache Dynamic Cache sizing • Shared second level (L 2) 2 MB 8 -way or 4 MB 16 -way instruction and data cache • Higher bandwidth from the L 2 cache to the core • ~14 clock latency and 2 clock throughput Intel® Processor Micro-architecture - Core® microarchitecture 11 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core® Micro-architecture Notable Features (cont. ) Intel® Advanced Digital Media

Intel® Software College Intel® Core® Micro-architecture Notable Features (cont. ) Intel® Advanced Digital Media Boost • Single Cycle SIMD Operation • 8 Single Precision Flops/cycle • 4 Double Precision Flops/cycle SIMD Operation (SSE/SSE 2/SSE 3/SSSE) SOURCE 128 -bit packed Add Multiply Load Store • Support for Intel® EM 64 T instructions 0 X 4 X 3 X 2 X 1 Y 4 Y 3 Y 2 Y 1 SSE/2/3 OP • Wide Operations • • 127 DEST Core™ arch CLOCK CYCLE 1 Previous CLOCK CYCLE 2 X 4 op. Y 4 X 3 op. Y 3 X 2 op. Y 2 X 1 op. Y 1 CLOCK CYCLE 1 X 2 op. Y 2 X 1 op. Y 1 X 4 op. Y 4 X 3 op. Y 3 Intel® Processor Micro-architecture - Core® microarchitecture 12 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core® Micro-architecture Notable Features (cont. ) Intelligent Power Capability •

Intel® Software College Intel® Core® Micro-architecture Notable Features (cont. ) Intelligent Power Capability • Advanced power gating & Dynamic power coordination • • Multi-point demand-based switching Voltage-Frequency switching separation Supports transitions to deeper sleep modes Event blocking Clock partitioning and recovery Dynamic Bus Parking During periods of high performance execution, many parts of the chip core can be shut off Intel® Processor Micro-architecture - Core® microarchitecture 13 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core® Micro-architecture Notable Features (cont. ) Intelligent Power Capability -

Intel® Software College Intel® Core® Micro-architecture Notable Features (cont. ) Intelligent Power Capability - Split Busses (core power feature) Many buses are sized for worst case data (x 86 instruction of 15 bytes) (ALU can write-back 128 bits) Improved Energy Efficiency Intel® Processor Micro-architecture - Core® microarchitecture 14 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core® Micro-architecture Notable Features (cont. ) Intelligent Power Capability -

Intel® Software College Intel® Core® Micro-architecture Notable Features (cont. ) Intelligent Power Capability - Split Busses (core power feature) By splitting buses to deal with varying data widths, we can gain the performance benefit of bus width while maintaining C dynamic closer to thinner buses Improved Energy Efficiency Intel® Processor Micro-architecture - Core® microarchitecture 15 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Agenda Introduction Notable features Micro-architecture drill-down • Front End • Out-Of-Order

Intel® Software College Agenda Introduction Notable features Micro-architecture drill-down • Front End • Out-Of-Order Execution Core • Memory Sub-system Advanced cache technology Coding considerations Intel® Processor Micro-architecture - Core® microarchitecture 16 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Key Terms CISC vs. RISC Super-scalar Out Of Order vs. In

Intel® Software College Key Terms CISC vs. RISC Super-scalar Out Of Order vs. In Order Architecture vs. Micro-architecture • Intel Architectures • IA 32/X 86 • Intel® 64 • IA 64 • Historical Micro-architectures • P 6 (Pentium Pro, Pentium III) • Net. Burst (Pentium 4) • Mobile (Centrino platforms) Intel® Processor Micro-architecture - Core® microarchitecture 17 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core® Micro-architecture Overview System Bus Unit 1 st Level Cache

Intel® Software College Intel® Core® Micro-architecture Overview System Bus Unit 1 st Level Cache (Data) 2 nd Level Cache Instruction Fetch Unit Decode /IQ Front End Renamer/Allocator Buffers(Retirement) Scheduler Execution Unit Execution Core Branch Prediction Unit Intel® Processor Micro-architecture - Core® microarchitecture 18 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core® Micro-architecture Drill-down icache branch prediction predecode unit instruction queue

Intel® Software College Intel® Core® Micro-architecture Drill-down icache branch prediction predecode unit instruction queue page miss handler data cache unit memory order buffer instruction decode register alias table MS ALLOC store address load store data integer FP SIMD (3 x) Reservation Station Re-Order Buffer Intel® Processor Micro-architecture - Core® microarchitecture 19 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Agenda Introduction Notable features Micro-architecture drill-down • Front End • Out-Of-Order

Intel® Software College Agenda Introduction Notable features Micro-architecture drill-down • Front End • Out-Of-Order Execution Core • Memory Sub-system Advanced cache technology Coding considerations Intel® Processor Micro-architecture - Core® microarchitecture 20 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Core® Micro-architecture Front End Instruction preparation before executed • Instruction Fetch

Intel® Software College Core® Micro-architecture Front End Instruction preparation before executed • Instruction Fetch Unit • Instruction Queue • Instruction Decode Unit • Branch Prediction Unit Intel® Processor Micro-architecture - Core® microarchitecture 21 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Core® Micro-architecture Front End Instruction Fetch Unit Instruction Queue Instruction Decode

Intel® Software College Core® Micro-architecture Front End Instruction Fetch Unit Instruction Queue Instruction Decode Unit Branch Prediction Unit Intel® Processor Micro-architecture - Core® microarchitecture 22 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Front End Instruction Fetch Unit icache Prefetches

Intel® Software College Intel® Core™ Microarchitecture – Front End Instruction Fetch Unit icache Prefetches instructions that are likely to be executed predecode Caches frequently-used instructions branch prediction unit Predecodes and Buffers instruction queue 2 nd Level Cache 1 st Level Cache (Data) IQ/ Decode Instruction Fetch Unit Front End Renamer/Allocator Buffers(Retirement) Scheduler Execution Unit Execution Core BTBs/Branch Prediction instruction decode MS Intel® Processor Micro-architecture - Core® microarchitecture 23 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End Intel® Software College Instruction Fetch Unit (cont. )

Intel® Core™ Microarchitecture – Front End Intel® Software College Instruction Fetch Unit (cont. ) I-Cache (Instruction Cache) • 32 KBytes / 8 -way / 64 -byte line • 16 aligned bytes fetched per cycle ITLB (Instruction Translation Lookaside Buffer) • 128 4 k pages, 8 2 M pages Instruction Prefetcher • 16 -byte aligned lookup through the ITLB into the instruction cache and instruction prefetch buffers Instruction Pre-decoder • Instruction Length Decode (predecode) • Avoid Length Changing Prefix, for example • The REX (EM 64 T) prefix (4 x. H) is not an LCP Avoid in loop: MOV dx, 1234 h Opcode Mod. R/M SIB Displacement Instruction Prefixes (66 H/67 H)Intel® Processor Micro-architecture - Core® microarchitecture Immediate 24 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Core® Micro-architecture Front End Instruction Fetch Unit Instruction Queue Instruction Decode

Intel® Software College Core® Micro-architecture Front End Instruction Fetch Unit Instruction Queue Instruction Decode Unit Branch Prediction Unit Intel® Processor Micro-architecture - Core® microarchitecture 25 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Front End Instruction Queue Buffer between instruction

Intel® Software College Intel® Core™ Microarchitecture – Front End Instruction Queue Buffer between instruction pre-decode unit and decoder • up to six predecoded instructions written per cycle • 18 Instructions contained in IQ • up to 5 Instructions read from IQ Potential Loop cache icache predecode Loop Stream Detector (LSD) support • Re-use of decoded instruction • Potential power saving 2 nd Level Cache instruction queue 1 st Level Cache (Data) IQ/ Decode Instruction Fetch Unit Front End Renamer/Allocator Buffers(Retirement) Scheduler branch prediction unit Execution Unit Execution Core BTBs/Branch Prediction instruction decode MS Intel® Processor Micro-architecture - Core® microarchitecture 26 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Core® Micro-architecture Front End Instruction Fetch Unit Instruction Queue Instruction Decode

Intel® Software College Core® Micro-architecture Front End Instruction Fetch Unit Instruction Queue Instruction Decode Unit Branch Prediction Unit Intel® Processor Micro-architecture - Core® microarchitecture 27 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Front End Instruction Decode the instructions into

Intel® Software College Intel® Core™ Microarchitecture – Front End Instruction Decode the instructions into micro-ops icache Ready for the execution in OOO core predecode branch prediction unit instruction queue 2 nd Level Cache 1 st Level Cache (Data) IQ/ Decode Instruction Fetch Unit Front End Renamer/Allocator Buffers(Retirement) Scheduler Execution Unit Execution Core BTBs/Branch Prediction instruction decode MS Intel® Processor Micro-architecture - Core® microarchitecture 28 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End Intel® Software College Instruction Decoders Features • Macro-fusion

Intel® Core™ Microarchitecture – Front End Intel® Software College Instruction Decoders Features • Macro-fusion • Micro-fusion • Stack Pointer Tracking Intel® Processor Micro-architecture - Core® microarchitecture 29 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End Intel® Software College Instruction Decode / Decoders Instructions

Intel® Core™ Microarchitecture – Front End Intel® Software College Instruction Decode / Decoders Instructions converted to micro-ops (uops) • 1 -uop includes load+op, stores, indirect jump, RET. . . 4 decoders: 1 “large” and 3 “small” • All decoders handle “simple” 1 -uop instructions • One large decoder handles instructions up to 4 uops All decoder working in parallel • Four(+) instructions / cycle Micro-Sequencer takes over for long flows (handling instruction contains 2~4 uops, u. Code. Rom handles more complex) Intel® Processor Micro-architecture - Core® microarchitecture 30 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Front End Code Sequence in Front End

Intel® Software College Intel® Core™ Microarchitecture – Front End Code Sequence in Front End cmp EAX, [mem] these instructions took more than one fetch as they are 22 bytes jne label IQ buffers them together mulps xmm 0, xmm 0 addps xmm 0, [EAX+16] IQ movps [EAX+240], xmm 0 all instructions are decodable by all decoders CMP and adjacent JCC are “fused” into a single uop. up to 5 instructions decoded per cycle Large (dec 0) small (dec 1) (dec 2) (dec 3) cmpjne EAX, [mem], label sta_std [EAX+240], xmm 0 mulps xmm 0, xmm 0 load_add xmm 0, [EAX+16] Intel® Processor Micro-architecture - Core® microarchitecture 31 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End Intel® Software College Instruction Decoders Features • Macro-fusion

Intel® Core™ Microarchitecture – Front End Intel® Software College Instruction Decoders Features • Macro-fusion • Micro-fusion • Stack Pointer Tracking Intel® Processor Micro-architecture - Core® microarchitecture 32 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Front End Instruction Decode / Macro -

Intel® Software College Intel® Core™ Microarchitecture – Front End Instruction Decode / Macro - Fusion Scheduler Roughly ~15% of all instructions are conditional branches. cmpjae eax, [mem], label Macro-fusion merges two instructions into a single micro-op, as if the two instructions were a single long instruction. Execution Enhanced Arithmetic Logic Unit (ALU) for macro-fusion. Each macro-fused instruction executes with a single dispatch. Branch Eval Not supported in EM 64 T long mode flags and target to Write back Intel® Processor Micro-architecture - Core® microarchitecture 33 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Front End Instruction Decode / Macro. Fusion

Intel® Software College Intel® Core™ Microarchitecture – Front End Instruction Decode / Macro. Fusion Absent Read four instructions from Instruction Queue Each instruction gets decoded into separate uops Enabling Example for (int i=0; i<100000; i++) { … Cycle 1 } Cycle 2 Instruction Queue add ecx, 1 mov [mem 1], ecx mov edx, [mem 1] cmp eax, [mem 2] jge label dec 0 dec 1 dec 2 dec 3 dec 0 Intel® Processor Micro-architecture - Core® microarchitecture 34 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Front End Instruction Decode / Macro. Fusion

Intel® Software College Intel® Core™ Microarchitecture – Front End Instruction Decode / Macro. Fusion Presented Read five Instructions from Instruction Queue Send fusable pair to single decoder Single uop represents two instructions Instruction Queue add ecx, 1 mov [mem 1], ecx mov edx, [mem 1] cmp eax, [mem 2] jae label Enabling Example for (unsigned int i=0; i<100000; i++) { … } Cycle 1 add ecx, 1 mov [mem 1], ecx mov edx, [mem 1] cmpjae eax, [mem 2], label dec 0 dec 1 dec 2 dec 3 Intel® Processor Micro-architecture - Core® microarchitecture 35 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End Intel® Software College Instruction Decode / Macro –

Intel® Core™ Microarchitecture – Front End Intel® Software College Instruction Decode / Macro – Fusion (cont. ) Benefits • Reduces latency • Increased renaming • Increased retire bandwidth • Increased virtual storage • Power savings Enabling Greater Performance & Efficiency Intel® Processor Micro-architecture - Core® microarchitecture 36 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End Intel® Software College Instruction Decoders Features • Macro-fusion

Intel® Core™ Microarchitecture – Front End Intel® Software College Instruction Decoders Features • Macro-fusion • Micro-fusion • Stack Pointer Tracking Intel® Processor Micro-architecture - Core® microarchitecture 37 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End Intel® Software College Instruction Decode / Micro-Op Fusion

Intel® Core™ Microarchitecture – Front End Intel® Software College Instruction Decode / Micro-Op Fusion Frequent pairs of micro-operations derived from the same Macro Instruction can be fused into a single micro-operation Micro-op fusion effectively widens the pipeline Intel® Processor Micro-architecture - Core® microarchitecture 38 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Front End Instruction Decode / Micro-Fusion (cont.

Intel® Software College Intel® Core™ Microarchitecture – Front End Instruction Decode / Micro-Fusion (cont. ) u-ops of a Store “mov edx, [mem 1]” sta mem 1 std edx, [mem 1] st edx, [mem 1] Intel® Processor Micro-architecture - Core® microarchitecture 39 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End Intel® Software College Instruction Decoders Features • Macro-fusion

Intel® Core™ Microarchitecture – Front End Intel® Software College Instruction Decoders Features • Macro-fusion • Micro-fusion • Stack Pointer Tracking Intel® Processor Micro-architecture - Core® microarchitecture 40 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Front End Instruction Decode / Stack Pointer

Intel® Software College Intel® Core™ Microarchitecture – Front End Instruction Decode / Stack Pointer Tracker (Extended Stack Pointer folding) ESP is calculated by dedicate logic PUSH EAX • No explicit Micro-Ops updating ESP • Micro-Ops saving • Power saving ESPd=8 PUSH EDX Decoder 4 Decoder 0 1 Recovery . Information . POP EBX 0 … Decoder N . Intel® Processor Micro-architecture - Core® microarchitecture 41 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Core® Micro-architecture Front End Instruction Fetch Unit Instruction Queue Instruction Decode

Intel® Software College Core® Micro-architecture Front End Instruction Fetch Unit Instruction Queue Instruction Decode Unit Branch Prediction Unit Intel® Processor Micro-architecture - Core® microarchitecture 42 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Front End Branch Prediction Unit Allow executing

Intel® Software College Intel® Core™ Microarchitecture – Front End Branch Prediction Unit Allow executing instructions long before the branch outcome is decided icache • Superset of Prescott / Pentium-M features predecode • One taken branch every other clock • Branch predictions for 32 bytes at a time, twice the width of the fetch engine 2 nd Level Cache 1 st Level Cache (Data) IQ/ Decode Instruction Fetch Unit Front End Renamer/Allocator Buffers(Retirement) Scheduler Execution Unit Execution Core BTBs/Branch Prediction branch prediction unit instruction queue instruction decode MS Intel® Processor Micro-architecture - Core® microarchitecture 43 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Front End Intel® Software College Branch Prediction Unit (cont. )

Intel® Core™ Microarchitecture – Front End Intel® Software College Branch Prediction Unit (cont. ) 16 -entry Return Stack Buffer (RSB) Front end queuing of BPU lookups Type of predictions • Direct Calls and Jumps • Indirect Calls and Jumps • Conditional branches Intel® Processor Micro-architecture - Core® microarchitecture 44 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Front End Branch Prediction Improvements Intel® Pentium®

Intel® Software College Intel® Core™ Microarchitecture – Front End Branch Prediction Improvements Intel® Pentium® 4 Processor branch prediction PLUS the following two improvements: Indirect Branch Predictor Loop Detector Branch miss-predictions reduced by >20% Intel® Processor Micro-architecture - Core® microarchitecture 45 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Agenda Introduction Notable features Micro-architecture drill-down • Front End • Out-Of-Order

Intel® Software College Agenda Introduction Notable features Micro-architecture drill-down • Front End • Out-Of-Order Execution Core • Memory Sub-system Advanced cache technology Coding considerations Intel® Processor Micro-architecture - Core® microarchitecture 46 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Core® Micro-architecture Execution Core store address Accepted decoded u-ops, assign resources,

Intel® Software College Core® Micro-architecture Execution Core store address Accepted decoded u-ops, assign resources, execute and retire u-ops load • Renamer store data • Reservation station (RS) register alias table • Issue ports • Execution Unit ALLOC 2 nd Level Cache Reservation Station Re-Order Buffer 1 st Level Cache (Data) IQ/ Decode Instruction Fetch Unit integer FP SIMD (3 x) Front End Renamer/Allocator Buffers(Retirement) Scheduler Execution Unit Execution Core BTBs/Branch Prediction Intel® Processor Micro-architecture - Core® microarchitecture 47 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Execution Core Building Blocks Renamer Ports RS

Intel® Software College Intel® Core™ Microarchitecture – Execution Core Building Blocks Renamer Ports RS ROB 0, 1, 5 SIMD Integer SIMD/Integer MUL 0, 1, 5 Integer 0, 1, 5 Floating Point Execution Unit 2 Load 3, 4 Store Memory Sub-system Intel® Processor Micro-architecture - Core® microarchitecture 48 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Execution Core Rename and Resources 4 uops

Intel® Software College Intel® Core™ Microarchitecture – Execution Core Rename and Resources 4 uops renamed / retired per clock • one taken branch, any # of untaken • one fxchg per cycle Uops written to RS and ROB • Decoded uops were renamed and allocated with resource by RAT and sent to ROB read and RS • RS waits for sources to arrive allowing OOO execution • Registers not “in flight” read from ROB during RS write register alias table ALLOC Reservation Station Re-Order Buffer Intel® Processor Micro-architecture - Core® microarchitecture 49 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Execution Core Issue Ports and Execution Units

Intel® Software College Intel® Core™ Microarchitecture – Execution Core Issue Ports and Execution Units 6 dispatch ports from RS • 3 execution ports • (shared for integer / fp / simd) • load store address load • store (address) • store (data) store data 128 -bit SSE implementation integer FP SIMD (3 x) • Port 0 has packed multiply (4 cycles SP 5 DP pipelined) • Port 1 has packed add (3 cycles all precisions) FP data has one additional cycle bypass latency • Do not mix SSE FP and SSE integer ops on same register Avoid: Addps XMM 0, XMM 1 Pand xmm 0, xmm 3 Addps xmm 2, xmm 0 Better: Addps XMM 0, XMM 1 Addps xmm 2, xmm 0 Pand xmm 0, xmm 3 Intel® Processor Micro-architecture - Core® microarchitecture 50 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Execution Core Intel® Software College The Out Of Order each

Intel® Core™ Microarchitecture – Execution Core Intel® Software College The Out Of Order each uop only takes a single RS entry load + add dispatches twice (load, then add) mulps dispatches once when load + add to write back sta + std dispatches twice sta (address) can fire as early as possible std must wait for mulps to write back cmpjne dispatches only once (functionality is truly fused) no dependency, can fire as early as it wants cmpjne EAX, #2000, TOP sta_std [EAX+240], xmm 0 mulps xmm 0, xmm 0 load_add xmm 0, [EAX+16] RS Intel® Processor Micro-architecture - Core® microarchitecture 51 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Execution Core Dispatching to OOO EXE cmpjne

Intel® Software College Intel® Core™ Microarchitecture – Execution Core Dispatching to OOO EXE cmpjne EAX, [mem], label sta_std [EAX+240], xmm 0 mulps xmm 0, xmm 0 load_add xmm 0, [EAX+16] cmpjne EAX, [mem+4], label sta_std [EAX+244], xmm 0 mulps xmm 0, xmm 0 load_add xmm 0, [EAX+16] cmpjne EAX, [mem+8], label sta_std [EAX+248], xmm 0 mulps xmm 0, xmm 0 load_add xmm 0, [EAX+16] cmpjne EAX, [mem+C], label sta_std [EAX+24 C], xmm 0 mulps xmm 0, xmm 0 load_add xmm 0, [EAX+16] RS 5 GP (incl jmp) 4 STD 3 STA 2 Load 1 GP (incl FP add) 0 GP (incl FP mul) Intel® Processor Micro-architecture - Core® microarchitecture 52 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Execution Core Retirement Unit Re. Order Buffer

Intel® Software College Intel® Core™ Microarchitecture – Execution Core Retirement Unit Re. Order Buffer (ROB) • Holds micro-ops in various stages of completion • Buffers completed micro-ops • updates the architectural state in order • manages ordering of exceptions register alias table ALLOC Reservation Station Re-Order Buffer Intel® Processor Micro-architecture - Core® microarchitecture 53 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Agenda Introduction Notable features Micro-architecture drill-down • Front End • Out-Of-Order

Intel® Software College Agenda Introduction Notable features Micro-architecture drill-down • Front End • Out-Of-Order Execution Core • Memory Sub-system Advanced cache technology Coding considerations Intel® Processor Micro-architecture - Core® microarchitecture 54 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Core® Micro-architecture Memory Sub. System 32 k D-Cache (8 -way, 64

Intel® Software College Core® Micro-architecture Memory Sub. System 32 k D-Cache (8 -way, 64 byte line size) Loads & Stores • One 128 -bit load and one 128 -bit store per cycle to different memory locations • Out of order Memory operations Data Prefetching Memory Disambiguation Store Forwarding Shared Cache Intel® Processor Micro-architecture - Core® microarchitecture 55 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Memory Sub-system Intel® Software College Advanced Memory Access 3 clk

Intel® Core™ Microarchitecture – Memory Sub-system Intel® Software College Advanced Memory Access 3 clk latency and 1 clk thrput of L 1 D; 14 and 2 for L 2 Miss Latencies • L 1 miss hits L 2 ~ 10 cycles • L 2 miss, access to memory ~300 cycles (server/FBD) • L 2 miss, access to memory ~165 cycles (Desk/DDR 2) • C step broadwater is reported to have ~50 ns latency Cache Bandwidth • Bandwidth to cache ~ 8. 5 bytes/cycle Memory Bandwidth • Desktop ~ 6 GB/sec/socket (linux) • Server ~3. 5 GB/sec/socket Intel® Processor Micro-architecture - Core® microarchitecture 56 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Memory Sub-system Advanced Memory Access / Enhanced

Intel® Software College Intel® Core™ Microarchitecture – Memory Sub-system Advanced Memory Access / Enhanced Data Pre-fetch Logic Speculates the next needed data and loads it into cache by HW and/or SW Door Valet Parking Area (L 1 Cache) (L 2 Cache) Main Parking Lot (External Memory) Intel® Processor Micro-architecture - Core® microarchitecture 57 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Memory Sub-system Intel® Software College Advanced Memory Access / Enhanced

Intel® Core™ Microarchitecture – Memory Sub-system Intel® Software College Advanced Memory Access / Enhanced Data Pre-fetch Logic (cont. ) • L 1 D cache prefetching • Data Cache Unit Prefetcher • Known as the streaming prefetcher • Recognizes ascending access patterns in recently loaded data • Prefetches the next line into the processors cache • Instruction Based Stride Prefetcher • Prefetches based upon a load having a regular stride • Can prefetch forward or backward 2 Kbytes • 1/2 default page size • L 2 cache prefetching: Data Prefetch Logic (DPL) • Prefetches data to the 2 nd level cache before the DCU requests the data • Maintains 2 tables for tracking loads • Upstream – 16 entries • Downstream – 4 entries • Every load is either found in the DPL or generates a new entry • Upon recognition of the 2 nd load of a “stream” the DPL will prefetch the next load Intel® Processor Micro-architecture - Core® microarchitecture 58 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core™ Microarchitecture – Memory Sub-system Intel® Software College Advanced Memory Access / Memory

Intel® Core™ Microarchitecture – Memory Sub-system Intel® Software College Advanced Memory Access / Memory Disambiguation predictor • Loads that are predicted NOT to forward from preceding store allowed to schedule as early as possible • increasing the performance of OOO memory pipelines Disambiguated loads checked at retirement • Extension to existing coherency mechanism • Invisible to software and system Intel® Processor Micro-architecture - Core® microarchitecture 59 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Memory Sub-system Advanced Memory Access / Memory

Intel® Software College Intel® Core™ Microarchitecture – Memory Sub-system Advanced Memory Access / Memory Disambiguation Absent Load 4 must WAIT until previous stores complete Memory Store 1 Y Load 2 Y Store 3 W Load 4 X Data W Data Z Data Y Data X Intel® Processor Micro-architecture - Core® microarchitecture 60 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Memory Sub-system Advanced Memory Access / Memory

Intel® Software College Intel® Core™ Microarchitecture – Memory Sub-system Advanced Memory Access / Memory Disambiguation Presented Loads can decouple from stores Load 4 can get its data WITHOUT waiting for stores Memory Load 4 Store 1 X Y Load 2 Y Store 3 W Data Z Data Y Data X Intel® Processor Micro-architecture - Core® microarchitecture 61 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture – Memory Sub-system Advanced Memory Access / Stores

Intel® Software College Intel® Core™ Microarchitecture – Memory Sub-system Advanced Memory Access / Stores Forwarding If a load follows a store and reloads the data that the store writes to memory, the micro-architecture can forward the data directly from the store to the load Memory Store 1 Y Load 2 Y Internal Buffers Data Y Intel® Processor Micro-architecture - Core® microarchitecture 62 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Advanced Memory Access / Stores Forwarding: Aligned Store Cases store 16

Intel® Software College Advanced Memory Access / Stores Forwarding: Aligned Store Cases store 16 store 32 bit store 64 bit load 16 load 32 bit load 64 bit ld 8 load 16 load 32 bit ld 8 load 16 load 32 bit ld 8 ld 8 store 128 bit load 64 bit load 32 bit load 16 load 16 load 16 Micro-architecture - Core® microarchitecture ld 8 ld 8 Intel® ld 8 Processor ld 8 ld 8 ld 8 63 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Advanced Memory Access / Stores Forwarding: Unaligned Cases Note that unaligned

Intel® Software College Advanced Memory Access / Stores Forwarding: Unaligned Cases Note that unaligned store forward does not occur when the load crosses a cache line boundary store 16 store 32 bit store 64 bit load 16‡ load 32 bit‡ load 64 bit ld 8 load 16‡ load 16 load 32 bit‡ ld 8 load 16‡ load 16 load 32 bit ld 8 ld 8 ld 8 Store forwarded to load ld 8 No forwarding ‡: No forwarding if the load crosses a cache line boundary Note: Unaligned 128 -bit stores are issued as two 64 -bit stores. This provides two alignments for store forwarding Intel® Processor Micro-architecture - Core® microarchitecture 64 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Agenda Introduction Notable features Micro-architecture drill-down • Front End • Out-Of-Order

Intel® Software College Agenda Introduction Notable features Micro-architecture drill-down • Front End • Out-Of-Order Execution Core • Memory Sub-system Advanced cache technology Coding considerations Intel® Processor Micro-architecture - Core® microarchitecture 65 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Advanced Smart Cache® Technology: Advantages of Shared Cache Memory Front Side

Intel® Software College Advanced Smart Cache® Technology: Advantages of Shared Cache Memory Front Side Bus (FSB) Shipping L 2 Cache Line ~Half access to memory Cache Line CPU 1 CPU 2 Intel® Processor Micro-architecture - Core® microarchitecture 66 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Advanced Smart Cache® Technology: Advantages of Shared Cache (cont. ) Memory

Intel® Software College Advanced Smart Cache® Technology: Advantages of Shared Cache (cont. ) Memory Front Side Bus (FSB) L 2 is shared: No need to ship cache line Cache Line CPU 1 CPU 2 Intel® Processor Micro-architecture - Core® microarchitecture 67 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Advanced Smart Cache® Technology (cont. ) Load & Store Access order

Intel® Software College Advanced Smart Cache® Technology (cont. ) Load & Store Access order 1. 2. 3. 4. L 1 cache of immediate core L 1 cache of the other core L 2 cache Memory Core 1 Core 2 Bus 2 MB L 2 Cache Intel® Processor Micro-architecture - Core® microarchitecture 68 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Advanced Smart Cache® Technology (cont. ) Shared second level (L 2)

Intel® Software College Advanced Smart Cache® Technology (cont. ) Shared second level (L 2) 2 MB 8 -way or 4 MB 16 -way instruction and data cache Cache 2 cache transfer • improves producer / consumer style MP Wider interface to L 2 • reduced interference • processor line fill is 2 cycles Higher bandwidth from the L 2 cache to the core • ~14 clock latency and 2 clock throughput Intel® Processor Micro-architecture - Core® microarchitecture 69 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Agenda Introduction Notable features Micro-architecture drilldown Advanced cache technology Coding considerations

Intel® Software College Agenda Introduction Notable features Micro-architecture drilldown Advanced cache technology Coding considerations Intel® Processor Micro-architecture - Core® microarchitecture 70 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Optimizing for Instruction Fetch and Pre. Decode Avoid “Length Changing Prefixes”

Intel® Software College Optimizing for Instruction Fetch and Pre. Decode Avoid “Length Changing Prefixes” (LCPs) • Affects instructions with immediate data or offset • Operand Size Override (66 H) • Address Size Override (67 H) [obsolete] • LCPs change the length decoding algorithm – increasing the processing time from one cycle to six cycles (or eleven cycles when the instruction spans a 16 -byte boundary) • The REX (EM 64 T) prefix (4 x. H) is not an LCP • The REX prefix does lengthen the instruction by one byte, so use of the first eight general registers in EM 64 T is preferred Intel® Processor Micro-architecture - Core® microarchitecture 71 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Optimizing for Instruction Queue Includes a “Loop Stream Detector” (LSD) •

Intel® Software College Optimizing for Instruction Queue Includes a “Loop Stream Detector” (LSD) • Potentially very high bandwidth instruction streaming • A number of requirements to make use of the LSD • • Maximum of 18 instructions in up to four 16 -byte packets No RET instructions (hence, little practical use for CALLs) Up to four taken branches allowed Most effective at 70+ iterations • LSD is after Pre. Decode so there is no added cost for LCPs • Trade-off LSD with conventional loop unrolling Intel® Processor Micro-architecture - Core® microarchitecture 72 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Optimizing for Decoder issues up to 4 u. Ops for renaming/

Intel® Software College Optimizing for Decoder issues up to 4 u. Ops for renaming/ allocation per clock • This creates a trade off between more complex instruction u. Ops versus multiple simple instruction u. Ops • For example, a single four u. Op instruction is all that can be renamed/allocated in a single clock • In some cases, multiple simple instructions may be a better choice than a single complex instruction • Single u. Op instructions allow more decoder flexibility • For example, 4 -1 -1 -1 can be decoded in one clock • However, 2 -2 -2 -1 takes three clocks to decode Intel® Processor Micro-architecture - Core® microarchitecture 73 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Optimizing for Execution Up to six u. Ops can be dispatched

Intel® Software College Optimizing for Execution Up to six u. Ops can be dispatched per clock • “Store Data” and “Store Address” dispatch ports are combined on the block diagram Up to four results can be written back per clock Single clock latency operations are best • Differing latency operations can create writeback conflicts • Separate multiple-clock u. Ops with several single u. Op instructions • Typical instructions here: ADC/SBB, RWM, CMOVcc • In some cases, separating a RMW instruction into its piece might be faster (decode and scheduling flexibility) When equivalent, PS preferred to PD (LCP) • For example, MOVAPS over MOVAPD, XORPS over XORPD Intel® Processor Micro-architecture - Core® microarchitecture 74 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Optimizing for Execution (cont. ) Bypass register “access” preferred to register

Intel® Software College Optimizing for Execution (cont. ) Bypass register “access” preferred to register reads Partial register accesses often lead to stalls • Register size access that ‘conflicts’ with recent previous register write • Partial XMM updates subject to dependency delays • Partial flag stall can occur, too much higher cost • Use TEST instruction between shift and conditional to prevent • Common zeroing instructions (e. g. , XOR reg, reg) don’t stall Avoid bypass between execution domains • For example: FP (ADDPS) and logical ops (PAND) on XMMn Vectorization: careful packing/unpacking sequence • Use MXCSR’s FZ and DAZ controls as appropriate Intel® Processor Micro-architecture - Core® microarchitecture 75 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Optimizing for Memory Software prefetch instructions • Can reach beyond a

Intel® Software College Optimizing for Memory Software prefetch instructions • Can reach beyond a page boundary (including page walk) • Prefetches only when it completes without an exception General techniques to help these prefetchers • Organize data in consecutive lines • In general, increasing addresses are more easily prefetched Intel® Processor Micro-architecture - Core® microarchitecture 76 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Summary What has been covered • Notable features of Core® Micro-architecture

Intel® Software College Summary What has been covered • Notable features of Core® Micro-architecture • • • Wide Dynamic Execution Advanced Memory Access Advanced Smart Cache Advanced Digital Media Boost Power Efficient Support • Core® Micro-architecture components • Front End • OOO execution core • Memory sub-system • Advanced cache technology Intel® Processor Micro-architecture - Core® microarchitecture 77 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Processor Micro-architecture - Core® microarchitecture 78 Copyright © 2006, Intel

Intel® Software College Intel® Processor Micro-architecture - Core® microarchitecture 78 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Reasons Why Software Prefetches Can Negatively Impact Performance on Core 2

Intel® Software College Reasons Why Software Prefetches Can Negatively Impact Performance on Core 2 Duo Architecture Software prefetches are rarely ignored on Merom Architecture • • On P 4 if you had a DTLB miss the prefetch could be ignored On Merom architecture they are not ignored and the prefetch can hurt performance since it cannot retire until after the page walk Critical chunk is not utilized on a software prefetch • • A prefetch can hurt performance if it is too close to the load When the data comes in due to a prefetch you need the entire cache line instead of just the critical chunk before the data can be used by the actual load Software prefetches can trigger hardware prefetching mechanisms • • Trigger the hardware prefetchers just like a regular load False patterns can be found if you prefetch on the wrong data • Software prefetches can saturate the bus Not Guaranteed to Behave Cross Architecture Performance can vary between architectures Intel® Processor Micro-architecture - Core® microarchitecture 79 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Advanced Digital Media Boost Lets scale a vector: B[i] : =

Intel® Software College Advanced Digital Media Boost Lets scale a vector: B[i] : = A[i] * C A Existing Processor Intel® Core™ uarch Advanced Digital Media Boost B Intel® Processor Micro-architecture - Core® microarchitecture 80 2 x Compute Throughput / Clock Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Advanced Digital Media Boost Assume both microarchs have 128 -bit path

Intel® Software College Advanced Digital Media Boost Assume both microarchs have 128 -bit path from L 1 to Processor A Existing Processor Intel® Core™ uarch Advanced Digital Media Boost B Intel® Processor Micro-architecture - Core® microarchitecture 81 2 x Compute Throughput / Clock Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Advanced Digital Media Boost Intel® Software College . . . handles all the memory

Advanced Digital Media Boost Intel® Software College . . . handles all the memory data A Multiply can’t keep up with load bandwidth Existing Processor multiplier Intel® Core™ uarch operates Advanced Digital Media Boost on all data B Intel® Processor Micro-architecture - Core® microarchitecture 82 2 x Compute Throughput / Clock Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Advanced Digital Media Boost Intel® Software College Existing implementations eventually stall the load pipe

Advanced Digital Media Boost Intel® Software College Existing implementations eventually stall the load pipe waiting for multiplier A Load eventually stalls waiting for multiplier Existing Processor Load pipe Intel® Core™ uarch is free to Advanced Digital Media Boost advance B Intel® Processor Micro-architecture - Core® microarchitecture 83 2 x Compute Throughput / Clock Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Advanced Digital Media Boost Intel® Software College . . . keeps pipeline free for

Advanced Digital Media Boost Intel® Software College . . . keeps pipeline free for computations A Load eventually stalls waiting for multiplier Existing Processor Load pipe Intel® Core™ uarch is free to Advanced Digital Media Boost advance B Intel® Processor Micro-architecture - Core® microarchitecture 84 2 x Compute Throughput / Clock Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Advanced Digital Media Boost Intel® Software College . . . maintains 2 X throughput

Advanced Digital Media Boost Intel® Software College . . . maintains 2 X throughput compared to prior implementations A Load eventually stalls waiting for multiplier Existing Processor Load pipe Intel® Core™ uarch is free to Advanced Digital Media Boost advance B Intel® Processor Micro-architecture - Core® microarchitecture 85 2 x Compute Throughput / Clock Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Advanced Digital Media Boost Intel® Software College 8 Single Precision Flops/cycle A Load eventually

Advanced Digital Media Boost Intel® Software College 8 Single Precision Flops/cycle A Load eventually stalls waiting for multiplier Existing Processor Load pipe Intel® Core™ uarch is free to Advanced Digital Media Boost advance B Intel® Processor Micro-architecture - Core® microarchitecture 86 2 x Compute Throughput / Clock Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Advanced Digital Media Boost Intel® Software College 4 Double Precision Flops/cycle A Load eventually

Advanced Digital Media Boost Intel® Software College 4 Double Precision Flops/cycle A Load eventually stalls waiting for multiplier Existing Processor Load pipe Intel® Core™ uarch is free to Advanced Digital Media Boost advance B Intel® Processor Micro-architecture - Core® microarchitecture 87 2 x Compute Throughput / Clock Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Advanced Digital Media Boost Intel® Software College A Load eventually stalls waiting for multiplier

Advanced Digital Media Boost Intel® Software College A Load eventually stalls waiting for multiplier Existing Processor Load pipe Intel® Core™ uarch is free to Advanced Digital Media Boost advance B Intel® Processor Micro-architecture - Core® microarchitecture 88 2 x Compute Throughput / Clock Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Advanced Digital Media Boost Intel® Software College A Load eventually stalls waiting for multiplier

Advanced Digital Media Boost Intel® Software College A Load eventually stalls waiting for multiplier Existing Processor Load pipe Intel® Core™ uarch is free to Advanced Digital Media Boost advance B Intel® Processor Micro-architecture - Core® microarchitecture 89 2 x Compute Throughput / Clock Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Advanced Digital Media Boost Intel® Software College A Load eventually stalls waiting for multiplier

Advanced Digital Media Boost Intel® Software College A Load eventually stalls waiting for multiplier Existing Processor Load pipe Intel® Core™ uarch is free to Advanced Digital Media Boost advance B Intel® Processor Micro-architecture - Core® microarchitecture 90 2 x Compute Throughput / Clock Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Advanced Digital Media Boost Intel® Software College A Load eventually stalls waiting for multiplier

Advanced Digital Media Boost Intel® Software College A Load eventually stalls waiting for multiplier Existing Processor Load pipe Intel® Core™ uarch is free to Advanced Digital Media Boost advance B Intel® Processor Micro-architecture - Core® microarchitecture 91 2 x Compute Throughput / Clock Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Advanced Digital Media Boost Intel® Software College A Load eventually stalls waiting for multiplier

Advanced Digital Media Boost Intel® Software College A Load eventually stalls waiting for multiplier Existing Processor Load pipe Intel® Core™ uarch is free to Advanced Digital Media Boost advance B Intel® Processor Micro-architecture - Core® microarchitecture 92 2 x Compute Throughput / Clock Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Advanced Digital Media Boost Intel® Software College A Load eventually stalls waiting for multiplier

Advanced Digital Media Boost Intel® Software College A Load eventually stalls waiting for multiplier Existing Processor Load pipe Intel® Core™ uarch is free to Advanced Digital Media Boost advance B Leading Compute Density 2 x Compute Throughput / Clock Intel® Processor Micro-architecture - Core® microarchitecture 93 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core® Micro-architecture Notable Features (cont. ) New Instructions Instruction name

Intel® Software College Intel® Core® Micro-architecture Notable Features (cont. ) New Instructions Instruction name Description psignb/w/d mm, mm/m 64 Per element, if the source operand is negative, multiply the destination operand by -1. psignb/w/d xmm, xmm/m 128 pabsb/w/d mm, mm/m 64 pabsb/w/d xmm, xmm/m 128 phaddw/d/sw mm, mm/m 64 phaddw/d/sw xmm, xmm/m 128 phsubw/d/sw mm, mm/m 64 phsubw/d/sw xmm, xmm/m 128 PMADDUBSW mm, mm/m 64 Per element, overwrite destination with absolute value of source. Pairwise integer horizontal addition + pack. Pairwise integer horizontal subtract + pack. PMADDUBSW xmm, xmm/m 128 Multiply signed & unsigned bytes. Accumulate result to signed-words. (Multiply Accumulate) PMULHRSW mm, mm/m 64 Signed 16 bits multiply, return high bits. PMULHRSW xmm, xmm/m 128 PSHUFB mm, mm/m 64 PSHUFB xmm, xmm/m 128 PALIGNR mm, mm/m 64, imm 8 A complete byte-granularity permutation, including force-to-zero flag. Extract any continuous 16 (8 in the 64 bit case) bytes from the pair [dst, src] and PALIGNR xmm, xmm/m 128, Intel® imm 8 Processor Micro-architecture - Core® microarchitecture store them to the dst register. 94 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Using New Instructions with Intel Compiler Architecture-tuning compiler switch –Qx. T.

Intel® Software College Using New Instructions with Intel Compiler Architecture-tuning compiler switch –Qx. T. Intel® Processor Micro-architecture - Core® microarchitecture 95 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Power Status Indicator (Mobile) Processor communicates power consumption to external platform

Intel® Software College Power Status Indicator (Mobile) Processor communicates power consumption to external platform components • Optimization of voltage regulator efficiency • Load line and power delivery efficiency PSI-2 / VID VR Intel® Processor Micro-architecture - Core® microarchitecture 96 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Enabling Efficient Processor and Platform Thermal Control… DTS – Digital Thermal

Intel® Software College Enabling Efficient Processor and Platform Thermal Control… DTS – Digital Thermal Sensor Several thermal sensors are located within the Processor to cover all possible hot spots Dedicated logic scans thermal sensors and measures the maximum temperature on the die at any given time Accurately reporting Processor temperature enables advanced thermal control schemes LPF Core 1 DTS Logic LPF Core 2 DTS Logic DTS control and status Intel® Processor Micro-architecture - Core® microarchitecture 97 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Platform Environment Control Interface (PECI) Intel® Software College Processor provides its temperature reading over

Platform Environment Control Interface (PECI) Intel® Software College Processor provides its temperature reading over a multi drop single wire bus allowing efficient platform thermal control Processor Fan PROC #1 Auxiliary Fan Manager PROC #2 Chassis Fan 1 PROC #3 PECI Chassis Fan 2 Intel® Processor Micro-architecture - Core® microarchitecture 98 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture Other Features – Platform Power Management (cont. )

Intel® Software College Intel® Core™ Microarchitecture Other Features – Platform Power Management (cont. ) Front side bus with the following low power improvements • Lower voltage • DPWR# and BPRI# signals • Must have FSB traffic to enable data and address bus input sense amplifiers and control signals (~120 pins) • Eliminated higher address and dual processor capable pins Intel® Processor Micro-architecture - Core® microarchitecture 99 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core™ Microarchitecture Other Features - Enhanced Speed. Step® Technology Voltage-Frequency

Intel® Software College Intel® Core™ Microarchitecture Other Features - Enhanced Speed. Step® Technology Voltage-Frequency switching separation Clock partitioning and recovery Event blocking Even during periods of high performance execution, many parts of the chip core can be shut off Intel® Processor Micro-architecture - Core® microarchitecture 100 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Core® Micro-architecture Blocks Intel® Software College To L 2 Cache Branch Target Buffer

Intel® Core® Micro-architecture Blocks Intel® Software College To L 2 Cache Branch Target Buffer Instruction Decode (4 issue) Microcode Sequencer Register Allocation Table (RAT) Execute 32 KB Data Cache Bus Unit Next IP Reservation Stations (RS) 32 entry 32 KB Instruction Cache Scheduler / Dispatch Ports Port Port Fetch / Decode Integer Arithmetic SIMD FP Add FP Integer Div/Mul SIMD Shift/Rotate SIMD Load Store Addr Store Data Memory Order Buffer (MOB) Disclaimer: This block diagram is for example purposes only. Significant hardware blocks have been arranged or omitted for clarity. Some resources (Bus Unit, L 2 Cache, etc…) are shared between cores. Retire Re-Order Buffer (ROB) – 96 entry IA Register Set Intel® Processor Micro-architecture - Core® microarchitecture 101 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core® Micro-architecture Notable Features Enhanced Pipeline in order instruction fetch

Intel® Software College Intel® Core® Micro-architecture Notable Features Enhanced Pipeline in order instruction fetch instruction decode micro-op rename micro-op allocate Intel® Processor Micro-architecture - Core® microarchitecture 102 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core® Micro-architecture Notable Features Enhanced Pipeline (cont. ) out of

Intel® Software College Intel® Core® Micro-architecture Notable Features Enhanced Pipeline (cont. ) out of order micro-op schedule micro-op execute Intel® Processor Micro-architecture - Core® microarchitecture 103 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core® Micro-architecture Notable Features Enhanced Pipeline (cont. ) out of

Intel® Software College Intel® Core® Micro-architecture Notable Features Enhanced Pipeline (cont. ) out of order memory pipelines memory order unit maintains architectural ordering requirements Intel® Processor Micro-architecture - Core® microarchitecture 104 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Intel® Software College Intel® Core® Micro-architecture Notable Features Enhanced Pipeline (cont. ) in order

Intel® Software College Intel® Core® Micro-architecture Notable Features Enhanced Pipeline (cont. ) in order micro-op retirement fault handling Retirement Unit maintains illusion of in order instruction retirement Intel® Processor Micro-architecture - Core® microarchitecture 105 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.