ELEC 669 Low Power Design Techniques Lecture 1
- Slides: 53
ELEC 669 Low Power Design Techniques Lecture 1 Amirali Baniasadi amirali@ece. uvic. ca
ELEC 669: Low Power Design Techniques Instructor: Amirali Baniasadi EOW 441, Only by appt. Call or email with your schedule. Email: amirali@ece. uvic. ca Office Tel: 721 -8613 Web Page for this class will be at http: //www. ece. uvic. ca/~amirali/courses/ELEC 669/elec 669. html Will use paper reprints Lecture notes will be posted on the course web page. 2
Course Structure z Lectures: y 1 -2 weeks on processor review y 5 weeks on low power techniques y 6 weeks: discussion, presentation, meetings z Reading paper posted on the web for each week. z Need to bring a 1 page review of the papers. z Presentations: Each student should give to presentations in class. 3
Course Philosophy z Papers to be used as supplement for lectures (If a topic is not covered in the class, or a detail not presented in the class, that means I expect you to read on your own to learn those details) z One Project (50%) z Presentation (30%)- Will be announced in advance. z Final Exam: take home (20%) z IMPORTANT NOTE: Must get passing grade in all components to pass the course. Failing any of the three components will result in failing the course. 4
Project z More on project later 5
Topics z z z z z High Performance Processors? Low-Power Design Low Power Branch Prediction Low-Power Register Renaming Low-Power SRAMs Low-Power Front-End Low-Power Back-End Low-Power Issue Logic Low-Power Commit AND more… 6
A Modern Processor 1 -What do each do? 2 -Possible Power Optimizations? Fetch Decode Front-end Issue Complete Commit Back-end 7
Power Breakdown Alpha 21464 Pentium. Pro 8
Instruction Set Architecture (ISA) • Instruction Execution Cycle Fetch Instruction From Memory Decode Instruction determine its size & action Fetch Operand data Execute instruction & compute results or status Store Result in memory Determine Next Instruction’s address 9
What Should we Know? z A specific ISA (MIPS) z Performance issues - vocabulary and motivation z Instruction-Level Parallelism z How to Use Pipelining to improve performance z Exploiting Instruction-Level Parallelism w/ Dynamic Approach z Memory: caches and virtual memory 10
What is Expected From You? • Read papers! • Be up-to-date! • Come back with your input & questions for discussion! 11
Power? z Everything is done by tiny switches z z z Their charge represents logic values Changing charge energy Power energy over time Devices are non-ideal power heat Excess heat Circuits breakdown Need to keep power within acceptable limits 12
POWER in the real world 13
Power as a Performance Limiter Conventional Performance Scaling: Goal: Max. performance w/ min cost/complexity How: -More and faster xtors. -More complex structures. Power: Don’t fix if it ain’t broken Not True Anymore: Power has increased rapidly Power-Aware Architecture a Necessity 14
Power-Aware Architecture Conventional Architecture: Goal: Max. performance How: Do as much as you can. This Work Power-Aware Architecture Goal: Min. Power and Maintain Performance How: Do as little as you can, while maintaining performance Challenging and new area 15
Why is this challenging z Identify actions that can be delayed/eliminated z Don’t touch those that boost performance z Cost/Power of doing so must not out-weight benefits 16
Definitions z Performance is in units of things-per-second y bigger is better z If we are primarily concerned with response time y performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n = ----------- Performance(Y) 17
Amdahl's Law Speedup due to enhancement E: Ex. Time w/o E Speedup(E) = ----------Ex. Time w/ E Performance w/ E = ----------Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, Ex. Time(with E) = ((1 -F) + F/S) X Ex. Time(without E) Speedup(with E) = Ex. Time(without E) ÷ ((1 -F) + F/S) X Ex. Time(without E ) Speedup(with E) =1/ ((1 -F) + F/S)
Amdahl's Law-example A new CPU makes Web serving 10 times faster. The old CPU spent 40% of the time on computation and 60% on waiting for I/O. What is the overall enhancement? Fraction enhanced= 0. 4 Speedup enhanced = 10 Speedup overall = 1 0. 6 +0. 4/10 = 1. 56
Why Do Benchmarks? z How we evaluate differences y Different systems y Changes to a single system z Provide a target y Benchmarks should represent large class of important programs y Improving benchmark performance should help many programs z For better or worse, benchmarks shape a field z Good ones accelerate progress y good target for development z Bad benchmarks hurt progress y help real programs v. sell machines/papers? y Inventions that help real programs don’t help benchmark
SPEC first round z First round 1989; 10 programs, single number to summarize performance z One program: 99% of time in single line of code z New front-end compiler could improve dramatically
SPEC 95 z Eighteen application benchmarks (with inputs) reflecting a technical computing workload z Eight integer y go, m 88 ksim, gcc, compress, li, ijpeg, perl, vortex z Ten floating-point intensive y tomcatv, swim, su 2 cor, hydro 2 d, mgrid, applu, turb 3 d, apsi, fppp, wave 5 z Must run with standard compiler flags y eliminate special undocumented incantations that may not even generate working code for real programs 23
Summary CPU time = Seconds Program = Instructions x Cycles Program Instruction x Seconds Cycle z Time is the measure of computer performance! z Remember Amdahl’s Law: Improvement is limited by unimproved part of program
Execution Cycle Instruction Obtain instruction from program storage Fetch Instruction Determine required actions and instruction size Decode Operand Locate and obtain operand data Fetch Execute Result Compute result value or status Deposit results in storage for later use Store Next Determine successor instruction Instruction 25
What Must be Specified? Instruction Fetch ° Instruction Decode Instruction Format or Encoding – how is it decoded? ° Location of operands and result – where other than memory? Operand Fetch – how many explicit operands? – how are memory operands located? – which can or cannot be in memory? Execute ° Data type and Size ° Operations Result Store Next – what are supported ° Successor instruction – jumps, conditions, branches Instruction 26
What Is an ILP? z Principle: Many instructions in the code do not depend on each other z Result: Possible to execute them in parallel z ILP: Potential overlap among instructions (so they can be evaluated in parallel) z Issues: z Building compilers to analyze the code z Building special/smarter hardware to handle the code z ILP: Increase the amount of parallelism exploited among instructions z Seeks Good Results out of Pipelining 27
What Is ILP? z CODE A: z z z LD R 1, (R 2)100 ADD R 4, R 1 SUB R 5, R 1 CMP R 1, R 2 ADD R 3, R 1 CODE B: LD R 1, (R 2)100 ADD R 4, R 1 SUB R 5, R 4 SW R 5, (R 2)100 LD R 1, (R 2)100 z Code A: Possible to execute 4 instructions in parallel. z Code B: Can’t execute more than one instruction per cycle. Code A has Higher ILP 28
Out of Order Execution Programmer: Instructions execute in-order Processor: Instructions may execute in any order if results remain the same at the end In-Order A: LD R 1, (R 2) B: ADD R 3, R 4 C: ADD R 3, R 5 D: CMP R 3, R 1 A D B Out-of-Order C B: ADD R 3, R 4 C: ADD R 3, R 5 A: LD R 1, (R 2) D: CMP R 3, R 1 29
Assumptions z Five-stage integer pipeline z Branches have delay of one clock cycle y ID stage: Comparisons done, decisions made and PC loaded z No structural hazards y Functional units are fully pipelined or replicated (as many times as the pipeline depth) z FP Latencies Integer load latency: 1; Integer ALU operation latency: 0 z. Source instruction z. Dependant instruction z. Latency (clock cycles) z. FP ALU op z. Another FP ALU op z 3 z. FP ALU op z. Store double z 2 z. Load double z. FP ALU op z 1 z. Load double z. Store double z 0 30
Simple Loop & Assembler Equivalent z for (i=1000; i>0; i--) x[i] = x[i] + s; z z z z Loop: LD ADDD SD SUBI BNE • x[i] & s are double/floating point type • R 1 initially address of array element with the highest address • F 2 contains the scalar value s • Register R 2 is pre-computed so that 8(R 2) is the last element to operate on F 0, 0(R 1) F 4, F 0, F 2 F 4 , 0(R 1) R 1, #8 R 1, R 2, Loop ; F 0=array element ; add scalar in F 2 ; store result ; decrement pointer 8 bytes (DW) ; branch R 1!=R 2 31
Where are the stalls? z. Unscheduled z. Loop: LD z stall z ADDD z stall z SD z SUBI z stall z BNE z stall F 0, 0(R 1) F 4, F 0, F 2 Schedule F 4, 0(R 1) R 1, #8 z Scheduled z Loop: LD z SUBI z ADDD z stall z BNE z SD z F 0, 0(R 1) R 1, #8 F 4, F 0, F 2 R 1, R 2, Loop F 4, 8(R 1) R 1, R 2, Loop y 10 clock cycles y. Can we minimize? y 6 clock cycles y 3 cycles: actual work; 3 cycles: overhead y Can we minimize further? z z. Source instruction z. Dependant instruction z. Latency (clock cycles) z. FP ALU op z. Load double z. Another FP ALU op z. Store double z. FP ALU op z 3 z 2 z 1 z. Load double z. Store double z 0 32
Loop Unrolling Four copies of loop z z z z z LD ADDD SD SUBI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4 , 0(R 1) R 1, R 1, #8 R 1, R 2, Loop Eliminate Incr, Branch LD ADDD SD SUBI BNE F 0, 0(R 1) F 4, F 0, F 2 F 4 , 0(R 1) R 1, #8 R 1, R 2, Loop F 0, -8(R 1) F 4, F 0, F 2 F 4 , -8(R 1) R 1, #8 R 1, R 2, Loop F 0, -16(R 1) F 4, F 0, F 2 F 4 , -16(R 1) R 1, #8 R 1, R 2, Loop F 0, -24(R 1) F 4, F 0, F 2 F 4 , -24(R 1) R 1, #32 R 1, R 2, Loop Four iteration code z Loop: LD z ADDD z SD z SUBI z BNE z F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) F 6, -8(R 1) F 8, F 6, F 2 F 8, -8(R 1) F 10, -16(R 1) F 12, F 10, F 2 F 12, -16(R 1) F 14, -24(R 1) F 16, F 14, F 2 F 16, -24(R 1) R 1, #32 R 1, R 2, Loop Assumption: R 1 is initially a multiple of 32 or number of loop iterations is a multiple of 4 33
Loop Unroll & Schedule z. Loop: z z z z z z z LD stall ADDD stall stall SD SUBI stall BNE stall F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) F 6, -8(R 1) F 8, F 6, F 2 Schedule F 8, -8(R 1) F 10, -16(R 1) F 12, F 10, F 2 F 12, -16(R 1) F 14, -24(R 1) Loop: LD LD ADDD SD SD SD SUBI BNE SD F 0, 0(R 1) F 6, -8(R 1) F 10, -16(R 1) F 14, -24(R 1) F 4, F 0, F 2 F 8, F 6, F 2 F 12, F 10, F 2 F 16, F 14, F 2 F 4, 0(R 1) F 8, -8(R 1) F 12, -16(R 1) R 1, #32 R 1, R 2, Loop F 16, 8(R 1) F 16, F 14, F 2 F 16, -24(R 1) R 1, #32 R 1, R 2, Loop No stalls! 14 clock cycles or 3. 5 per iteration Can we minimize further? 28 clock cycles or 7 per iteration Can we minimize further? 34
Summary Iteration 10 cycles Unrolling 7 cycles Scheduling 6 cycles Scheduling 3. 5 cycles (No stalls) 35
Multiple Issue • Multiple Issue is the ability of the processor to start more than one instruction in a given cycle. • Superscalar processors • Very Long Instruction Word (VLIW) processors 36
A Modern Processor Multiple Issue Fetch Decode Front-end Issue Complete Commit Back-end 37
1990’s: Superscalar Processors z Bottleneck: CPI >= 1 y Limit on scalar performance (single instruction issue) x. Hazards x. Superpipelining? Diminishing returns (hazards + overhead) z How can we make the CPI = 0. 5? y Multiple instructions in every pipeline stage (super-scalar) z 1 2 3 4 5 6 7 z z z Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 Inst 5 IF IF ID ID IF IF EX EX ID ID IF IF MEM EX EX ID ID WB WB MEM EX EX WB WB MEM WB WB 38
Elements of Advanced Superscalars z High performance instruction fetching y Good dynamic branch and jump prediction y Multiple instructions per cycle, multiple branches per cycle? z Scheduling and hazard elimination y Dynamic scheduling y Not necessarily: Alpha 21064 & Pentium were statically scheduled y Register renaming to eliminate WAR and WAW z Parallel functional units, paths/buses/multiple register ports z High performance memory systems z Speculative execution 39
SS + DS + Speculation z Superscalar + Dynamic scheduling + Speculation Three great tastes that taste great together y CPI >= 1? x. Overcome with superscalar y Superscalar increases hazards x. Overcome with dynamic scheduling y RAW dependences still a problem? x. Overcome with a large window x. Branches a problem for filling large window? x. Overcome with speculation 40
The Big Picture issue Static program Fetch & branch predict execution & Reorder & commit 41
Superscalar Microarchitecture Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch Functional units Floating point inst. buffer Integer address inst buffer Functional units and data cache Memory interface Integer register file Reorder and commit 42
Register renaming methods z z z First Method: Physical register file vs. logical (architectural) register file. Mapping table used to associate physical reg w/ current value of log. Reg use a free list of physical registers Physical register file bigger than log register file z Second Method: z physical register file same size as logical z Also, use a buffer w/ one entry per inst. Reorder buffer. 43
Register Renaming Example z. Loop: z z z z z z z LD stall ADDD stall stall SD SUBI stall BNE stall F 0, 0(R 1) F 4, F 0, F 2 F 4, 0(R 1) F 6, -8(R 1) F 8, F 6, F 2 Schedule F 8, -8(R 1) F 10, -16(R 1) F 12, F 10, F 2 F 12, -16(R 1) F 14, -24(R 1) Loop: LD LD ADDD SD SD SD SUBI BNE SD F 0, 0(R 1) F 6, -8(R 1) F 10, -16(R 1) F 14, -24(R 1) F 4, F 0, F 2 F 8, F 6, F 2 F 12, F 10, F 2 F 16, F 14, F 2 F 4, 0(R 1) F 8, -8(R 1) F 12, -16(R 1) R 1, #32 R 1, R 2, Loop F 16, 8(R 1) F 16, F 14, F 2 F 16, -24(R 1) R 1, #32 R 1, R 2, Loop No stalls! 14 clock cycles or 3. 5 per iteration Can we minimize further? 28 clock cycles or 7 per iteration Can we minimize further? 44
Register renaming: first method Mapping table r 0 R 8 r 1 R 7 r 2 R 5 r 3 R 1 r 3 R 2 r 4 R 9 R 2 R 6 R 13 Free List Add r 3, 4 R 6 R 13 Free List 45
Superscalar Processors • Issues varying number of instructions per clock • Scheduling: Static (by the compiler) or dynamic(by the hardware) • Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo). • IBM Power. PC, Sun Ultra. Sparc, DEC Alpha, HP 8000 46
More Realistic HW: Register Impact z Effect of limiting the number of renaming registers FP: 11 - 45 IPC Integer: 5 - 15 47
Reorder Buffer z Place data in entry when execution finished Reserve entry at tail when dispatched Remove from head when complete Bypass to other instructions when needed 48
register renaming: reorder buffer Before add r 3, 4 Add r 3, rob 6, 4 add rob 8, rob 6, 4 r 0 R 8 r 1 R 7 r 2 R 5 r 3 rob 6 r 3 rob 8 r 4 R 9 7 Reorder buffer 6 r 3 8 0 …. . R 3 7 6 0 R 3 0 …. Reorder buffer 49
Instruction Buffers Floating point register file Predecode Inst. Cache Inst. buffer Decode rename dispatch Functional units Floating point inst. buffer Integer address inst buffer Functional units and data cache Memory interface Integer register file Reorder and commit 50
Issue Buffer Organization z a) Single, shared queue No out-of-order No Renaming b)Multiple queue; one per inst. type No out-of-order inside queues Queues issue out of order 51
Issue Buffer Organization z z c) Multiple reservation stations; (one per instruction type or big pool) NO FIFO ordering Ready operands, hardware available execution starts Proposed by Tomasulo From Instruction Dispatch 52
Typical reservation station Operation source 1 data 1 valid 1 source 2 data 2 valid 2 destination 53
Memory Hazard Detection Logic Load address buffer Instruction issue loads Address add & translation To memory Address compare Hazard Control stores Store address buffer 54
- D.669
- D.669
- Elec 4601
- Elec4601
- Keesmel
- Pattern recognition
- Elec
- Elec
- Saif zahir
- Superposition electric circuits
- Low power design essentials
- Low power asic design
- 01:640:244 lecture notes - lecture 15: plat, idah, farad
- Mid = (low + high) / 2
- Communication style bias
- High precision vs high accuracy
- Low voltage hazards
- Power triangle diagram
- Laurie baker low cost housing techniques
- Power system dynamics and stability lecture notes
- Zline 667-36
- Power semiconductor devices lecture notes
- Switch mode power supply lecture notes
- Power system dynamics and stability lecture notes
- Causes of low power factor
- Msp430 lpm
- Low power ic
- Low and high uncertainty avoidance
- Northwest microscope
- Moves the stage slightly to sharpen the image
- How to calculate power of lens
- Generator over voltage protection
- Low power rf amplifier
- Fonctions techniques et solutions techniques
- Anchorage length eurocode
- Urban design lecture
- Elements of design in interior design ppt
- Lecture hall background
- Game design lecture
- Computer-aided drug design lecture notes
- Cmos vlsi design lecture notes
- Which diagram describes "who can do what in a system".
- Low impact design
- Vacuum sewer system problems
- Power electronics
- Solar power satellites and microwave power transmission
- Actual power
- Flex power power supply
- Resolving power and dispersive power of grating
- Power of a power property
- General power rule
- Power angle curve in power system stability
- Informsu
- Power absorbed or delivered