ELEC 669 Low Power Design Techniques Lecture 1

ELEC 669: Low Power Design Techniques Instructor: Amirali Baniasadi EOW 441, Only by appt.

Course Structure z Lectures: y 1 -2 weeks on processor review y 5 weeks

Course Philosophy z Papers to be used as supplement for lectures (If a topic

Topics z z z z z High Performance Processors? Low-Power Design Low Power Branch

A Modern Processor 1 -What do each do? 2 -Possible Power Optimizations? Fetch Decode

Power Breakdown Alpha 21464 Pentium. Pro 8

Instruction Set Architecture (ISA) • Instruction Execution Cycle Fetch Instruction From Memory Decode Instruction

What Should we Know? z A specific ISA (MIPS) z Performance issues - vocabulary

What is Expected From You? • Read papers! • Be up-to-date! • Come back

Power? z Everything is done by tiny switches z z z Their charge represents

Power as a Performance Limiter Conventional Performance Scaling: Goal: Max. performance w/ min cost/complexity

Power-Aware Architecture Conventional Architecture: Goal: Max. performance How: Do as much as you can.

Why is this challenging z Identify actions that can be delayed/eliminated z Don’t touch

Definitions z Performance is in units of things-per-second y bigger is better z If

Amdahl's Law Speedup due to enhancement E: Ex. Time w/o E Speedup(E) = ----------Ex.

Amdahl's Law-example A new CPU makes Web serving 10 times faster. The old CPU

Why Do Benchmarks? z How we evaluate differences y Different systems y Changes to

SPEC first round z First round 1989; 10 programs, single number to summarize performance

SPEC 95 z Eighteen application benchmarks (with inputs) reflecting a technical computing workload z

Summary CPU time = Seconds Program = Instructions x Cycles Program Instruction x Seconds

Execution Cycle Instruction Obtain instruction from program storage Fetch Instruction Determine required actions and

What Must be Specified? Instruction Fetch ° Instruction Decode Instruction Format or Encoding –

What Is an ILP? z Principle: Many instructions in the code do not depend

What Is ILP? z CODE A: z z z LD R 1, (R 2)100

Out of Order Execution Programmer: Instructions execute in-order Processor: Instructions may execute in any

Assumptions z Five-stage integer pipeline z Branches have delay of one clock cycle y

Simple Loop & Assembler Equivalent z for (i=1000; i>0; i--) x[i] = x[i] +

Where are the stalls? z. Unscheduled z. Loop: LD z stall z ADDD z

Loop Unrolling Four copies of loop z z z z z LD ADDD SD

Loop Unroll & Schedule z. Loop: z z z z z z z LD

Summary Iteration 10 cycles Unrolling 7 cycles Scheduling 6 cycles Scheduling 3. 5 cycles

Multiple Issue • Multiple Issue is the ability of the processor to start more

A Modern Processor Multiple Issue Fetch Decode Front-end Issue Complete Commit Back-end 37

1990’s: Superscalar Processors z Bottleneck: CPI >= 1 y Limit on scalar performance (single

Elements of Advanced Superscalars z High performance instruction fetching y Good dynamic branch and

SS + DS + Speculation z Superscalar + Dynamic scheduling + Speculation Three great

The Big Picture issue Static program Fetch & branch predict execution & Reorder &

Superscalar Microarchitecture Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch

Register renaming methods z z z First Method: Physical register file vs. logical (architectural)

Register Renaming Example z. Loop: z z z z z z z LD stall

Register renaming: first method Mapping table r 0 R 8 r 1 R 7

Superscalar Processors • Issues varying number of instructions per clock • Scheduling: Static (by

More Realistic HW: Register Impact z Effect of limiting the number of renaming registers

Reorder Buffer z Place data in entry when execution finished Reserve entry at tail

register renaming: reorder buffer Before add r 3, 4 Add r 3, rob 6,

Instruction Buffers Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch

Issue Buffer Organization z a) Single, shared queue No out-of-order No Renaming b)Multiple queue;

Issue Buffer Organization z z c) Multiple reservation stations; (one per instruction type or

Typical reservation station Operation source 1 data 1 valid 1 source 2 data 2

Memory Hazard Detection Logic Load address buffer Instruction issue loads Address add & translation

Slides: 53

Download presentation

ELEC 669 Low Power Design Techniques Lecture 1 Amirali Baniasadi amirali@ece. uvic. ca

ELEC 669: Low Power Design Techniques Instructor: Amirali Baniasadi EOW 441, Only by appt. Call or email with your schedule. Email: amirali@ece. uvic. ca Office Tel: 721 -8613 Web Page for this class will be at http: //www. ece. uvic. ca/~amirali/courses/ELEC 669/elec 669. html Will use paper reprints Lecture notes will be posted on the course web page. 2

Course Structure z Lectures: y 1 -2 weeks on processor review y 5 weeks on low power techniques y 6 weeks: discussion, presentation, meetings z Reading paper posted on the web for each week. z Need to bring a 1 page review of the papers. z Presentations: Each student should give to presentations in class. 3

Course Philosophy z Papers to be used as supplement for lectures (If a topic is not covered in the class, or a detail not presented in the class, that means I expect you to read on your own to learn those details) z One Project (50%) z Presentation (30%)- Will be announced in advance. z Final Exam: take home (20%) z IMPORTANT NOTE: Must get passing grade in all components to pass the course. Failing any of the three components will result in failing the course. 4

Project z More on project later 5

Topics z z z z z High Performance Processors? Low-Power Design Low Power Branch Prediction Low-Power Register Renaming Low-Power SRAMs Low-Power Front-End Low-Power Back-End Low-Power Issue Logic Low-Power Commit AND more… 6

A Modern Processor 1 -What do each do? 2 -Possible Power Optimizations? Fetch Decode Front-end Issue Complete Commit Back-end 7

Power Breakdown Alpha 21464 Pentium. Pro 8

Instruction Set Architecture (ISA) • Instruction Execution Cycle Fetch Instruction From Memory Decode Instruction determine its size & action Fetch Operand data Execute instruction & compute results or status Store Result in memory Determine Next Instruction’s address 9

What Should we Know? z A specific ISA (MIPS) z Performance issues - vocabulary and motivation z Instruction-Level Parallelism z How to Use Pipelining to improve performance z Exploiting Instruction-Level Parallelism w/ Dynamic Approach z Memory: caches and virtual memory 10

What is Expected From You? • Read papers! • Be up-to-date! • Come back with your input & questions for discussion! 11

Power? z Everything is done by tiny switches z z z Their charge represents logic values Changing charge energy Power energy over time Devices are non-ideal power heat Excess heat Circuits breakdown Need to keep power within acceptable limits 12

POWER in the real world 13

Power as a Performance Limiter Conventional Performance Scaling: Goal: Max. performance w/ min cost/complexity How: -More and faster xtors. -More complex structures. Power: Don’t fix if it ain’t broken Not True Anymore: Power has increased rapidly Power-Aware Architecture a Necessity 14

Power-Aware Architecture Conventional Architecture: Goal: Max. performance How: Do as much as you can. This Work Power-Aware Architecture Goal: Min. Power and Maintain Performance How: Do as little as you can, while maintaining performance Challenging and new area 15

Why is this challenging z Identify actions that can be delayed/eliminated z Don’t touch those that boost performance z Cost/Power of doing so must not out-weight benefits 16

Definitions z Performance is in units of things-per-second y bigger is better z If we are primarily concerned with response time y performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n = ----------- Performance(Y) 17

Amdahl's Law Speedup due to enhancement E: Ex. Time w/o E Speedup(E) = ----------Ex. Time w/ E Performance w/ E = ----------Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, Ex. Time(with E) = ((1 -F) + F/S) X Ex. Time(without E) Speedup(with E) = Ex. Time(without E) ÷ ((1 -F) + F/S) X Ex. Time(without E ) Speedup(with E) =1/ ((1 -F) + F/S)

Amdahl's Law-example A new CPU makes Web serving 10 times faster. The old CPU spent 40% of the time on computation and 60% on waiting for I/O. What is the overall enhancement? Fraction enhanced= 0. 4 Speedup enhanced = 10 Speedup overall = 1 0. 6 +0. 4/10 = 1. 56

Why Do Benchmarks? z How we evaluate differences y Different systems y Changes to a single system z Provide a target y Benchmarks should represent large class of important programs y Improving benchmark performance should help many programs z For better or worse, benchmarks shape a field z Good ones accelerate progress y good target for development z Bad benchmarks hurt progress y help real programs v. sell machines/papers? y Inventions that help real programs don’t help benchmark

SPEC first round z First round 1989; 10 programs, single number to summarize performance z One program: 99% of time in single line of code z New front-end compiler could improve dramatically

SPEC 95 z Eighteen application benchmarks (with inputs) reflecting a technical computing workload z Eight integer y go, m 88 ksim, gcc, compress, li, ijpeg, perl, vortex z Ten floating-point intensive y tomcatv, swim, su 2 cor, hydro 2 d, mgrid, applu, turb 3 d, apsi, fppp, wave 5 z Must run with standard compiler flags y eliminate special undocumented incantations that may not even generate working code for real programs 23

Summary CPU time = Seconds Program = Instructions x Cycles Program Instruction x Seconds Cycle z Time is the measure of computer performance! z Remember Amdahl’s Law: Improvement is limited by unimproved part of program

Execution Cycle Instruction Obtain instruction from program storage Fetch Instruction Determine required actions and instruction size Decode Operand Locate and obtain operand data Fetch Execute Result Compute result value or status Deposit results in storage for later use Store Next Determine successor instruction Instruction 25

What Must be Specified? Instruction Fetch ° Instruction Decode Instruction Format or Encoding – how is it decoded? ° Location of operands and result – where other than memory? Operand Fetch – how many explicit operands? – how are memory operands located? – which can or cannot be in memory? Execute ° Data type and Size ° Operations Result Store Next – what are supported ° Successor instruction – jumps, conditions, branches Instruction 26

What Is an ILP? z Principle: Many instructions in the code do not depend on each other z Result: Possible to execute them in parallel z ILP: Potential overlap among instructions (so they can be evaluated in parallel) z Issues: z Building compilers to analyze the code z Building special/smarter hardware to handle the code z ILP: Increase the amount of parallelism exploited among instructions z Seeks Good Results out of Pipelining 27

What Is ILP? z CODE A: z z z LD R 1, (R 2)100 ADD R 4, R 1 SUB R 5, R 1 CMP R 1, R 2 ADD R 3, R 1 CODE B: LD R 1, (R 2)100 ADD R 4, R 1 SUB R 5, R 4 SW R 5, (R 2)100 LD R 1, (R 2)100 z Code A: Possible to execute 4 instructions in parallel. z Code B: Can’t execute more than one instruction per cycle. Code A has Higher ILP 28

Out of Order Execution Programmer: Instructions execute in-order Processor: Instructions may execute in any order if results remain the same at the end In-Order A: LD R 1, (R 2) B: ADD R 3, R 4 C: ADD R 3, R 5 D: CMP R 3, R 1 A D B Out-of-Order C B: ADD R 3, R 4 C: ADD R 3, R 5 A: LD R 1, (R 2) D: CMP R 3, R 1 29

Assumptions z Five-stage integer pipeline z Branches have delay of one clock cycle y ID stage: Comparisons done, decisions made and PC loaded z No structural hazards y Functional units are fully pipelined or replicated (as many times as the pipeline depth) z FP Latencies Integer load latency: 1; Integer ALU operation latency: 0 z. Source instruction z. Dependant instruction z. Latency (clock cycles) z. FP ALU op z. Another FP ALU op z 3 z. FP ALU op z. Store double z 2 z. Load double z. FP ALU op z 1 z. Load double z. Store double z 0 30

Simple Loop & Assembler Equivalent z for (i=1000; i>0; i--) x[i] = x[i] + s; z z z z Loop: LD ADDD SD SUBI BNE • x[i] & s are double/floating point type • R 1 initially address of array element with the highest address • F 2 contains the scalar value s • Register R 2 is pre-computed so that 8(R 2) is the last element to operate on F 0, 0(R 1) F 4, F 0, F 2 F 4 , 0(R 1) R 1, #8 R 1, R 2, Loop ; F 0=array element ; add scalar in F 2 ; store result ; decrement pointer 8 bytes (DW) ; branch R 1!=R 2 31

Where are the stalls? z. Unscheduled z. Loop: LD z stall z ADDD z stall z SD z SUBI z stall z BNE z stall F 0, 0(R 1) F 4, F 0, F 2 Schedule F 4, 0(R 1) R 1, #8 z Scheduled z Loop: LD z SUBI z ADDD z stall z BNE z SD z F 0, 0(R 1) R 1, #8 F 4, F 0, F 2 R 1, R 2, Loop F 4, 8(R 1) R 1, R 2, Loop y 10 clock cycles y. Can we minimize? y 6 clock cycles y 3 cycles: actual work; 3 cycles: overhead y Can we minimize further? z z. Source instruction z. Dependant instruction z. Latency (clock cycles) z. FP ALU op z. Load double z. Another FP ALU op z. Store double z. FP ALU op z 3 z 2 z 1 z. Load double z. Store double z 0 32

Loop Unrolling Four copies of loop z z z z z LD ADDD SD SUBI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4 , 0(R 1) R 1, R 1, #8 R 1, R 2, Loop Eliminate Incr, Branch LD ADDD SD SUBI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4 , 0(R 1) R 1, #8 R 1, R 2, Loop F 0, -8(R 1) F 4, F 0, F 2 F 4 , -8(R 1) R 1, #8 R 1, R 2, Loop F 0, -16(R 1) F 4, F 0, F 2 F 4 , -16(R 1) R 1, #8 R 1, R 2, Loop F 0, -24(R 1) F 4, F 0, F 2 F 4 , -24(R 1) R 1, #32 R 1, R 2, Loop Four iteration code z Loop: LD z ADDD z SD z SUBI z BNE z F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) F 6, -8(R 1) F 8, F 6, F 2 F 8, -8(R 1) F 10, -16(R 1) F 12, F 10, F 2 F 12, -16(R 1) F 14, -24(R 1) F 16, F 14, F 2 F 16, -24(R 1) R 1, #32 R 1, R 2, Loop Assumption: R 1 is initially a multiple of 32 or number of loop iterations is a multiple of 4 33

Loop Unroll & Schedule z. Loop: z z z z z z z LD stall ADDD stall stall SD SUBI stall BNE stall F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) F 6, -8(R 1) F 8, F 6, F 2 Schedule F 8, -8(R 1) F 10, -16(R 1) F 12, F 10, F 2 F 12, -16(R 1) F 14, -24(R 1) Loop: LD LD ADDD SD SD SD SUBI BNE SD F 0, 0(R 1) F 6, -8(R 1) F 10, -16(R 1) F 14, -24(R 1) F 4, F 0, F 2 F 8, F 6, F 2 F 12, F 10, F 2 F 16, F 14, F 2 F 4, 0(R 1) F 8, -8(R 1) F 12, -16(R 1) R 1, #32 R 1, R 2, Loop F 16, 8(R 1) F 16, F 14, F 2 F 16, -24(R 1) R 1, #32 R 1, R 2, Loop No stalls! 14 clock cycles or 3. 5 per iteration Can we minimize further? 28 clock cycles or 7 per iteration Can we minimize further? 34

Summary Iteration 10 cycles Unrolling 7 cycles Scheduling 6 cycles Scheduling 3. 5 cycles (No stalls) 35

Multiple Issue • Multiple Issue is the ability of the processor to start more than one instruction in a given cycle. • Superscalar processors • Very Long Instruction Word (VLIW) processors 36

A Modern Processor Multiple Issue Fetch Decode Front-end Issue Complete Commit Back-end 37

1990’s: Superscalar Processors z Bottleneck: CPI >= 1 y Limit on scalar performance (single instruction issue) x. Hazards x. Superpipelining? Diminishing returns (hazards + overhead) z How can we make the CPI = 0. 5? y Multiple instructions in every pipeline stage (super-scalar) z 1 2 3 4 5 6 7 z z z Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 Inst 5 IF IF ID ID IF IF EX EX ID ID IF IF MEM EX EX ID ID WB WB MEM EX EX WB WB MEM WB WB 38

Elements of Advanced Superscalars z High performance instruction fetching y Good dynamic branch and jump prediction y Multiple instructions per cycle, multiple branches per cycle? z Scheduling and hazard elimination y Dynamic scheduling y Not necessarily: Alpha 21064 & Pentium were statically scheduled y Register renaming to eliminate WAR and WAW z Parallel functional units, paths/buses/multiple register ports z High performance memory systems z Speculative execution 39

SS + DS + Speculation z Superscalar + Dynamic scheduling + Speculation Three great tastes that taste great together y CPI >= 1? x. Overcome with superscalar y Superscalar increases hazards x. Overcome with dynamic scheduling y RAW dependences still a problem? x. Overcome with a large window x. Branches a problem for filling large window? x. Overcome with speculation 40

The Big Picture issue Static program Fetch & branch predict execution & Reorder & commit 41

Superscalar Microarchitecture Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch Functional units Floating point inst. buffer Integer address inst buffer Functional units and data cache Memory interface Integer register file Reorder and commit 42

Register renaming methods z z z First Method: Physical register file vs. logical (architectural) register file. Mapping table used to associate physical reg w/ current value of log. Reg use a free list of physical registers Physical register file bigger than log register file z Second Method: z physical register file same size as logical z Also, use a buffer w/ one entry per inst. Reorder buffer. 43

Register Renaming Example z. Loop: z z z z z z z LD stall ADDD stall stall SD SUBI stall BNE stall F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) F 6, -8(R 1) F 8, F 6, F 2 Schedule F 8, -8(R 1) F 10, -16(R 1) F 12, F 10, F 2 F 12, -16(R 1) F 14, -24(R 1) Loop: LD LD ADDD SD SD SD SUBI BNE SD F 0, 0(R 1) F 6, -8(R 1) F 10, -16(R 1) F 14, -24(R 1) F 4, F 0, F 2 F 8, F 6, F 2 F 12, F 10, F 2 F 16, F 14, F 2 F 4, 0(R 1) F 8, -8(R 1) F 12, -16(R 1) R 1, #32 R 1, R 2, Loop F 16, 8(R 1) F 16, F 14, F 2 F 16, -24(R 1) R 1, #32 R 1, R 2, Loop No stalls! 14 clock cycles or 3. 5 per iteration Can we minimize further? 28 clock cycles or 7 per iteration Can we minimize further? 44

Register renaming: first method Mapping table r 0 R 8 r 1 R 7 r 2 R 5 r 3 R 1 r 3 R 2 r 4 R 9 R 2 R 6 R 13 Free List Add r 3, 4 R 6 R 13 Free List 45

Superscalar Processors • Issues varying number of instructions per clock • Scheduling: Static (by the compiler) or dynamic(by the hardware) • Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo). • IBM Power. PC, Sun Ultra. Sparc, DEC Alpha, HP 8000 46

More Realistic HW: Register Impact z Effect of limiting the number of renaming registers FP: 11 - 45 IPC Integer: 5 - 15 47

Reorder Buffer z Place data in entry when execution finished Reserve entry at tail when dispatched Remove from head when complete Bypass to other instructions when needed 48

register renaming: reorder buffer Before add r 3, 4 Add r 3, rob 6, 4 add rob 8, rob 6, 4 r 0 R 8 r 1 R 7 r 2 R 5 r 3 rob 6 r 3 rob 8 r 4 R 9 7 Reorder buffer 6 r 3 8 0 …. . R 3 7 6 0 R 3 0 …. Reorder buffer 49

Instruction Buffers Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch Functional units Floating point inst. buffer Integer address inst buffer Functional units and data cache Memory interface Integer register file Reorder and commit 50

Issue Buffer Organization z a) Single, shared queue No out-of-order No Renaming b)Multiple queue; one per inst. type No out-of-order inside queues Queues issue out of order 51

Issue Buffer Organization z z c) Multiple reservation stations; (one per instruction type or big pool) NO FIFO ordering Ready operands, hardware available execution starts Proposed by Tomasulo From Instruction Dispatch 52

Typical reservation station Operation source 1 data 1 valid 1 source 2 data 2 valid 2 destination 53

Memory Hazard Detection Logic Load address buffer Instruction issue loads Address add & translation To memory Address compare Hazard Control stores Store address buffer 54