CS 35101 Computer Architecture Week 9 Understanding Performance

  • Slides: 21
Download presentation
CS 35101 – Computer Architecture Week 9: Understanding Performance Paul Durand ( www. cs.

CS 35101 – Computer Architecture Week 9: Understanding Performance Paul Durand ( www. cs. kent. edu/~durand ) [Adapted from M Irwin (www. cse. psu. edu/~mji) ] [Adapted from COD, Patterson & Hennessy, © 2005, UCB]

Indeed, the cost-performance ratio of the product will depend most heavily on the implementer,

Indeed, the cost-performance ratio of the product will depend most heavily on the implementer, just as ease of use depends most heavily on the architect. The Mythical Man-Month, Brooks, pg 46

Performance Metrics q Purchasing perspective l given a collection of machines, which has the

Performance Metrics q Purchasing perspective l given a collection of machines, which has the - best performance ? - least cost ? - best cost/performance? q Design perspective l faced with design options, which has the - best performance improvement ? - least cost ? - best cost/performance? q Both require l l q basis for comparison metric for evaluation Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors

Defining (Speed) Performance q Normally interested in reducing l Response time (aka execution time)

Defining (Speed) Performance q Normally interested in reducing l Response time (aka execution time) – the time between the start and the completion of a task - Important to individual users l Thus, to maximize performance, need to minimize execution time performance. X = 1 / execution_time. X If X is n times faster than Y, then performance. X execution_time. Y ---------- = ----------- = n performance. Y execution_time. X l Throughput – the total amount of work done in a given time - Important to data center managers l Decreasing response time almost always improves throughput

Performance Problem 1 Consider two machines, A and B • A runs a program

Performance Problem 1 Consider two machines, A and B • A runs a program in 20 seconds • B runs the same program in 25 seconds All other things being equal • Which machine has the better performance? Perf (A) / Perf (B) = Time (B) / Time (A) = 25 / 20 = 1. 25 Perf (B) / Perf (A) = Time (A) / Time (B) = 20 / 25 = 0. 80 Machine A has better performance • By how much? 25% - Machine A is 25% faster than machine B or, Machine A is 1. 25 times as fast as machine B

Performance Factors q Want to distinguish elapsed time and the time spent on our

Performance Factors q Want to distinguish elapsed time and the time spent on our task q CPU execution time (CPU time) – time the CPU spends working on a task l Does not include time waiting for I/O or running other programs CPU execution time # CPU clock cycles = x clock cycle time for a program or CPU execution time = #---------------------CPU clock cycles for a program clock rate q Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program

Review: Machine Clock Rate q Clock rate (MHz, GHz) is inverse of clock cycle

Review: Machine Clock Rate q Clock rate (MHz, GHz) is inverse of clock cycle time (clock period) CC = 1 / CR one clock period 10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate 1 nsec clock cycle => 1 GHz clock rate 500 psec clock cycle => 2 GHz clock rate 250 psec clock cycle => 4 GHz clock rate 200 psec clock cycle => 5 GHz clock rate

Performance Problem 2 Our favorite program runs in 10 seconds on computer A, which

Performance Problem 2 Our favorite program runs in 10 seconds on computer A, which has a 4 GHZ clock. We are trying to help a computer designer build a new machine, B, that will run this program in 6 seconds. The designer can use new ( or perhaps more expensive ) technology to substantially increase the clock rate. However, this will affect the rest of the CPU design, causing B to require 1. 2 times as many clock cycles as machine A for the same program. What clock rate should we tell the designer to target? Remember that cpu execution time = # cpu clock cycles / clock rate For machine A: 10 sec = n cpu clock cycles / 4 GHz For machine B : 6 sec = 1. 2 n cpu clock cycles / target GHz Dividing equation A by equation B => (10/6) = target / (4 x 1. 2) Or, target = 4. 8 x 1. 6667 = 8 GHz

Performance Problem 3 - CPI Suppose we have two implementations of the same instruction

Performance Problem 3 - CPI Suppose we have two implementations of the same instruction set architecture (ISA). For some program with n instructions: Machine A has a clock cycle time of 250 ps and a CPI 0 f 2. 0 Machine B has a clock cycle time of 500 ps and a CPI of 1. 2 Which machine is faster for this program and by how much? Machine A: cpu time = n instruc x 2. 0 clocks/instruc x 250 ps/clock Machine B: cpu time = n instruc x 1. 2 clocks/instruc x 500 ps/clock Perf(A/B) = Time(B/A) = 600/500 = 6/5 = 1. 2 So machine A is 1. 2 times as fast as machine B Or, machine A is 20% faster than machine B

Instructions Have Different Execution. Times q Multiplication takes more time than addition q Floating

Instructions Have Different Execution. Times q Multiplication takes more time than addition q Floating point operations take longer than integer ones q Accessing memory takes more time than accessing registers q Important point: changing the cycle time often changes the number of cycles required for various instructions (more later)

Clock Cycles per Instruction q Since not all instructions take the same amount of

Clock Cycles per Instruction q Since not all instructions take the same amount of time to execute l One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction # CPU clock cycles # Instructions Average clock cycles = x for a program per instruction q Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute l A way to compare two different implementations of the same ISA CPI for this instruction class A B C 1 2 3

Effective CPI q Computing the overall effective CPI is done by looking at the

Effective CPI q Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging n Overall effective CPI = i=1 l l l q (CPIi x ICi) Where ICi is the count (percentage) of the number of instructions of class i executed CPIi is the (average) number of clock cycles per instruction for that instruction class n is the number of instruction classes The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs

THE Performance Equation q Our basic performance equation is then CPU time = Instruction_count

THE Performance Equation q Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle or CPU time q = Instruction_count x CPI -----------------------clock_rate These equations separate three key factors that affect performance l l Can measure the CPU execution time by running the program The clock rate is usually given l Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details l CPI varies by instruction type and ISA implementation for which we must know the implementation details

Determinates of CPU Performance CPU time = Instruction_count x CPI x clock_cycle Instruction_ count

Determinates of CPU Performance CPU time = Instruction_count x CPI x clock_cycle Instruction_ count CPI X X Programming language X X Compiler X X X X Algorithm ISA Processor organization Technology clock_cycle X

Performance Problem 4 – Compiler Effect q A compiler designer is trying to decide

Performance Problem 4 – Compiler Effect q A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. Which sequence will be faster? By how much? What is the CPI for each sequence? Instruction Class CPI Code Sequence 1 # Instruc clock cycles A 1 2 2 4 4 B 2 1 2 C 3 2 6 1 3 10 clocks CPI = 2. 0 Code Sequence 2 # Instruc clock cycles 9 clocks CPI = 1. 5

A Simple Example Op Freq CPIi Freq x CPIi ALU 50% 1 . 5

A Simple Example Op Freq CPIi Freq x CPIi ALU 50% 1 . 5 . 5 . 25 Load 20% 5 1. 0 . 4 1. 0 Store 10% 3 . 3 . 3 Branch 20% 2 . 4 2. 2 1. 6 2. 0 1. 95 = q How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPU time new = 1. 6 x IC x CC so 2. 2/1. 6 means 37. 5% faster q How does this compare with using branch prediction to shave a cycle off the branch time? CPU time new = 2. 0 x IC x CC so 2. 2/2. 0 means 10% faster q What if two ALU instructions could be executed at once? CPU time new = 1. 95 x IC x CC so 2. 2/1. 95 means 12. 8% faster

Comparing and Summarizing Performance q How do we summarize the performance for benchmark set

Comparing and Summarizing Performance q How do we summarize the performance for benchmark set with a single number? l The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM) n AM = 1/n i=1 l l q Timei Where Timei is the execution time for the ith program of a total of n programs in the workload A smaller mean indicates a smaller average execution time and thus improved performance Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc. ))

SPEC Benchmarks www. spec. org Integer benchmarks gzip compression vpr FPGA place & route

SPEC Benchmarks www. spec. org Integer benchmarks gzip compression vpr FPGA place & route gcc GNU C compiler mcf Combinatorial optimization crafty Chess program parser Word processing program eon Computer visualization perlbmk perl application gap vortex bzip 2 twolf Group theory interpreter Object oriented database compression Circuit place & route FP benchmarks wupwise Quantum chromodynamics swim Shallow water model mgrid Multigrid solver in 3 D fields applu Parabolic/elliptic pde mesa 3 D graphics library galgel Computational fluid dynamics art Image recognition (NN) equake Seismic wave propagation simulation facerec Facial image recognition ammp Computational chemistry lucas Primality testing fma 3 d Crash simulation fem sixtrack Nuclear physics accel apsi Pollutant distribution

SPEC 2000 Does doubling the clock rate double the performance? Can a machine with

SPEC 2000 Does doubling the clock rate double the performance? Can a machine with a slower clock rate have better performance?

Other Performance Metrics q Power consumption – especially in the embedded market where battery

Other Performance Metrics q Power consumption – especially in the embedded market where battery life is important (and passive cooling) l For power-limited applications, the most important metric is energy efficiency

Summary: Evaluating ISAs q Design-time metrics: Can it be implemented, in how long, at

Summary: Evaluating ISAs q Design-time metrics: Can it be implemented, in how long, at what cost? l Can it be programmed? Ease of compilation? l q Static Metrics: l q How many bytes does the program occupy in memory? Dynamic Metrics: How many instructions are executed? How many bytes does the processor fetch to execute the program? CPI l How many clocks are required per instruction? l l How "lean" a clock is practical? Best Metric: Time to execute the program! depends on the instructions set, the processor organization, and compilation techniques. Inst. Count Cycle Time