FAMUFSU College of Engineering Computer Architecture EEL 47135764

  • Slides: 85
Download presentation
FAMU-FSU College of Engineering Computer Architecture EEL 4713/5764, Spring 2006 Dr. Michael Frank Module

FAMU-FSU College of Engineering Computer Architecture EEL 4713/5764, Spring 2006 Dr. Michael Frank Module #5 – Computer Performance 1

Part I Background and Motivation June 2005 Computer Architecture, Background and Motivation 2

Part I Background and Motivation June 2005 Computer Architecture, Background and Motivation 2

I Background and Motivation Provide motivation, paint the big picture, introduce tools: • Review

I Background and Motivation Provide motivation, paint the big picture, introduce tools: • Review components used in building digital circuits • Present an overview of computer technology • Understand the meaning of computer performance (or why a 2 GHz processor isn’t 2 as fast as a 1 GHz model) Topics in This Part Chapter 1 Combinational Digital Circuits Chapter 2 Digital Circuits with Memory Chapter 3 Computer System Technology Chapter 4 Computer Performance June 2005 Computer Architecture, Background and Motivation 3

4 Computer Performance is key in design decisions; also cost and power • It

4 Computer Performance is key in design decisions; also cost and power • It has been a driving force for innovation • Isn’t quite the same as speed (higher clock rate) Topics in This Chapter 4. 1 Cost, Performance, and Cost/Performance 4. 2 Defining Computer Performance 4. 3 Performance Enhancement and Amdahl’s Law 4. 4 Performance Measurement vs Modeling 4. 5 Reporting Computer Performance 4. 6 The Quest for Higher Performance June 2005 Computer Architecture, Background and Motivation 4

Course Instructional Objective #1 As the syllabus says: o n At the completion of

Course Instructional Objective #1 As the syllabus says: o n At the completion of this course, students should be able to… o CIO #1. (Metrics) Calculate and interpret different performance and cost metrics of computer systems. This CIO should also support the following Program Outcome: o n Students graduating from the BSEE and BSCp. E degree programs will have: o o o PO (a). (Apply) An ability to apply knowledge of mathematics, science and engineering; PO (e). (Solve) An ability to identify, formulate, and solve engineering problems; PO (o). (Topics) EE: A knowledge of electrical engineering applications selected from the …digital systems… areas; Cp. E: A knowledge of computer science and computer engineering topics including …computer architecture. Under “assessment instruments, ” the syllabus says: o n 1. Metrics. Students will solve exam problems in which they must analyze descriptions of hypothetical processors to determine their performance, costperformance, and power-performance. 5

Module Instructional Objectives I break down the CIO as follows: CIO #1. Metrics (aeo).

Module Instructional Objectives I break down the CIO as follows: CIO #1. Metrics (aeo). Calculate and interpret different performance and cost metrics of computer systems. 1. 1. Know & apply (a) the definitions of clock frequency, MIPS, execution time, performance, throughput, cost-performance, and power-performance. 1. 2. Explain why a given metric is or is not appropriate to use in a given situation. 1. 3. Identify (e. i) the specific figure(s) of merit that are most appropriate for choosing between alternative computer designs in a given scenario. 1. 4. Formulate (e. ii) appropriate symbolic equations for calculating a desired figure of merit from the provided information about an architectural scenario. 1. 5. Solve (e. iii) problems involving the determination of which of several computer designs would be preferable in a given scenario. 1. 6. Apply Amdahl’s Law (and generalizations thereof) in characterizing the relationship between an improvement to a particular component of a system and the overall improvement of the whole system. 1. 7. Apply (a) the CPU Performance Equation that relates performance and execution time to instruction count, CPI, and clock frequency. 6

FAMU-FSU College of Engineering Topic #1 Overview of Some Important Metrics for Computer Systems:

FAMU-FSU College of Engineering Topic #1 Overview of Some Important Metrics for Computer Systems: Performance, Cost, and Power Consumption 7

Important Performance Metrics o Some metrics that are often used, but that do not

Important Performance Metrics o Some metrics that are often used, but that do not always accurately reflect true performance: n n n o CPU clock frequency = number of CPU clock cycles per unit time MIPS rating = How many Millions of Instructions Per Second Benchmark ratings (e. g. , SPECmarks) – more on this later Metrics that are “true” measures of performance: n Total execution time of a work unit (on real applications) o n Performance = 1/(execution time) o n Wall-clock time from beginning to end of the execution process For a single work unit Throughput = (# work units)/(execution time) o A generalized kind of performance 8

Cost and Cost-Related Metrics o In the real world, the performance of a system

Cost and Cost-Related Metrics o In the real world, the performance of a system is not the only thing that is important… n For example, its cost may also matter a lot! o n o We almost always have budgetary constraints. The usual goal: Maximize the cost-performance (i. e. , cost-efficiency) of the systems that you buy. n n o E. g. , the IBM Blue Gene/L has really high performance, but you’re not likely to buy it as your next computer… Cost-performance = (performance) / (cost). In other words, you want to get the best value for your dollar. This strategy roughly maximizes total throughput within a fixed budget. n Whenever you can have many systems gathered together working in parallel. 9

Throughput and Cost-Performance o When there is a fixed budget, the maximum throughput of

Throughput and Cost-Performance o When there is a fixed budget, the maximum throughput of a parallel system is (roughly) the cost-performance of the individual serial units. 10

The Vanishing Computer Cost June 2005 Computer Architecture, Background and Motivation 11

The Vanishing Computer Cost June 2005 Computer Architecture, Background and Motivation 11

Cost/Performance Figure 4. 1 June 2005 Performance improvement as a function of cost. Computer

Cost/Performance Figure 4. 1 June 2005 Performance improvement as a function of cost. Computer Architecture, Background and Motivation 12

Importance of Power Consumption o In the real world, a computer’s performance and manufacturing

Importance of Power Consumption o In the real world, a computer’s performance and manufacturing cost are not the only important concerns… n o Today, power consumption is an increasingly important factor that impacts all of the following: n o Operating costs, usability, and other factors may also be important! Manufacturing cost, operating cost, performance, and usability! In general, higher power consumption means… n More manufacturing cost o for more aggressive power-delivery & cooling systems n n Higher operating cost o n Higher performance would exceed limits of cooling system Poor usability / poorer overall quality of product: o o More electricity consumed, frequent changing/recharging of batteries, inconvenience to user Lower performance o n power supplies, heat sinks, fans, etc. Annoyingly noisy cooling fans or data center A/C units, laptops that burn up your lap So in many design scenarios, we may wish to maximize performance within a fixed power budget, or minimize power consumption to reach a desired performance. 13

Throughput and Power-Performance o When there is a fixed power budget, the maximum throughput

Throughput and Power-Performance o When there is a fixed power budget, the maximum throughput of a parallel system is (roughly) the powerperformance of the individual serial units. n This is exactly analogous to the earlier cost-performance analysis. 14

Performance Maximization within Cost and Power Constraints o Suppose we have both a cost

Performance Maximization within Cost and Power Constraints o Suppose we have both a cost budget and a power budget, n and we want to maximize system throughput. o o Then we have the following constraints on nunits: n nunits Cunit ≤ Cmax C = cost P = power n nunits Punit ≤ Pmax T = throughput o and nunits ≤ Pmax/Punit The largest value of nunits within these constraints is: o o n o With a given unit design, we must maximize the number of || units. So, nunits ≤ Cmax/Cunit nunits = min( Cmax/Cunit , Pmax/Punit ) = min( Cmax/Cunit, Pmax/Punit) and so the maximum feasible throughput is: n Ttot = Tunit nunits = Tunit min(Cmax/Cunit, Pmax/Punit) 15

Power-Performance and Energy Efficiency o Power-performance means performance (i. e. , throughput) per unit

Power-Performance and Energy Efficiency o Power-performance means performance (i. e. , throughput) per unit of power consumption: n o Of course, since n n o throughput = (work units)/(time) and power = (energy consumed)/(time), The times cancel, and so power-performance is equal to: n o power-performance = (throughput)/(power). (work units)/(energy consumed) In other words, system power-performance is the same thing as the energy efficiency of the underlying computing process. n To maximize power-performance, minimize the amount of energy that is consumed per unit of work that is performed. 16

System Optimization Example o Suppose you have a budget of $1 M to set

System Optimization Example o Suppose you have a budget of $1 M to set up a new corporate data center that should have a total power consumption of no more than 100 k. W while serving web transactions in a simple database application. If your goal is to maximize total performance (in transactions/second) while staying within your budget and meeting the power constraint, which of the following types of machines would be preferable as a basis for the design? n n o Sun servers, each $15, 000, burning 100 W, processing 100 transactions/second Playstation 2 s, each $100 from flea market, 30 W, processing 30 transactions/second Solution: n A PS 2 -based design could attain 50 higher throughput and use only 1/3 of the budget while still meeting the power constraints! 17

FAMU-FSU College of Engineering Topic #2 Measuring Computer Performance 18

FAMU-FSU College of Engineering Topic #2 Measuring Computer Performance 18

4. 2 Defining Computer Performance Figure 4. 2 Pipeline analogy shows that imbalance between

4. 2 Defining Computer Performance Figure 4. 2 Pipeline analogy shows that imbalance between processing power and I/O capabilities leads to a performance bottleneck. June 2005 Computer Architecture, Background and Motivation 19

Concepts of Performance and Speedup Performance = 1 / Execution time is simplified to

Concepts of Performance and Speedup Performance = 1 / Execution time is simplified to Performance = 1 / CPU execution time (Performance of M 1) / (Performance of M 2) = Speedup of M 1 over M 2 = (Execution time on M 2) / (Execution time on M 1) Terminology: M 1 is x times as fast as M 2 (e. g. , 1. 5 times as fast) M 1 is 100(x – 1)% faster than M 2 (e. g. , 50% faster) CPU performance equation: CPU time = (Clock cycles executed) (Time per cycle) = Instructions (Cycles per instruction) (Time per cycle) = Instructions CPI / (Clock frequency) Instruction count, CPI, and clock rate are not completely independent, so improving one by a given factor may not lead to overall execution time improvement by the same factor. June 2005 Computer Architecture, Background and Motivation 20

Faster Clock Shorter Running Time Figure 4. 3 Faster steps do not necessarily mean

Faster Clock Shorter Running Time Figure 4. 3 Faster steps do not necessarily mean shorter travel time. June 2005 Computer Architecture, Background and Motivation 21

4. 3 Performance Enhancement: Amdahl’s Law f = fraction p unaffected = speedup of

4. 3 Performance Enhancement: Amdahl’s Law f = fraction p unaffected = speedup of the rest s= 1 f + (1 – f)/p min(p, 1/f) Figure 4. 4 Amdahl’s law: speedup achieved if a fraction f of a task is unaffected and the remaining (1–f) part runs p times as fast. June 2005 Computer Architecture, Background and Motivation 22

Amdahl’s Law Used in Design Example 4. 1 A processor spends 30% of its

Amdahl’s Law Used in Design Example 4. 1 A processor spends 30% of its time on flp addition, 25% on flp mult, and 10% on flp division. Evaluate the following enhancements, each costing the same to implement: a. Redesign of the flp adder to make it twice as fast. b. Redesign of the flp multiplier to make it three times as fast. c. Redesign the flp divider to make it 10 times as fast. Solution a. Adder redesign speedup = 1 / [0. 7 + 0. 3 / 2] = 1. 18 b. Multiplier redesign speedup = 1 / [0. 75 + 0. 25 / 3] = 1. 20 c. Divider redesign speedup = 1 / [0. 9 + 0. 1 / 10] = 1. 10 What if both the adder and the multiplier are redesigned? June 2005 Computer Architecture, Background and Motivation 23

4. 4 Performance Measurement vs. Modeling Figure 4. 5 June 2005 Running times of

4. 4 Performance Measurement vs. Modeling Figure 4. 5 June 2005 Running times of six programs on three machines. Computer Architecture, Background and Motivation 24

Performance Benchmarks Example 4. 3 You are an engineer at Outtel, a start-up aspiring

Performance Benchmarks Example 4. 3 You are an engineer at Outtel, a start-up aspiring to compete with Intel via its new processor design that outperforms the latest Intel processor by a factor of 2. 5 on floating-point instructions. This level of performance was achieved by design compromises that led to a 20% increase in the execution time of all other instructions. You are in charge of choosing benchmarks that would showcase Outtel’s performance edge. a. What is the minimum required fraction f of time spent on floating-point instructions in a program on the Intel processor to show a speedup of 2 or better for Outtel? Solution a. We use a generalized form of Amdahl’s formula in which a fraction f is speeded up by a given factor (2. 5) and the rest is slowed down by another factor (1. 2): 1 / [1. 2(1 – f) + f / 2. 5] 2 f 0. 875 June 2005 Computer Architecture, Background and Motivation 25

Performance Estimation Average CPI = All instruction classes (Class-i fraction) (Class-i CPI) Machine cycle

Performance Estimation Average CPI = All instruction classes (Class-i fraction) (Class-i CPI) Machine cycle time = 1 / Clock rate CPU execution time = Instructions (Average CPI) / (Clock rate) Table 4. 3 Usage frequency, in percentage, for various instruction classes in four representative applications. Application Instr’n class Data compression C language compiler Reactor simulation Atomic motion modeling A: Load/Store 25 37 32 37 B: Integer 32 28 17 5 C: Shift/Logic 16 13 2 1 D: Float 0 0 34 42 E: Branch 19 13 9 10 F: All others 8 9 6 4 June 2005 Computer Architecture, Background and Motivation 26

MIPS Rating Can Be Misleading Example 4. 5 Two compilers produce machine code for

MIPS Rating Can Be Misleading Example 4. 5 Two compilers produce machine code for a program on a machine with two classes of instructions. Here are the number of instructions: Class A B CPI 1 2 Compiler 1 600 M 400 M Compiler 2 400 M a. What are run times of the two programs with a 1 GHz clock? b. Which compiler produces faster code and by what factor? c. Which compiler’s output runs at a higher MIPS rate? Solution a. Running time 1 (2) = (600 M 1 + 400 M 2) / 109 = 1. 4 s (1. 2 s) b. Compiler 2’s output runs 1. 4 / 1. 2 = 1. 17 times as fast c. MIPS rating 1, CPI = 1. 4 (2, CPI = 1. 5) = 1000 / 1. 4 = 714 (667) June 2005 Computer Architecture, Background and Motivation 27

4. 5 Reporting Computer Performance Table 4. 4 Measured or estimated execution times for

4. 5 Reporting Computer Performance Table 4. 4 Measured or estimated execution times for three programs. Time on machine X Time on machine Y Speedup of Y over X Program A 20 200 0. 1 Program B 1000 10. 0 Program C 1500 150 10. 0 All 3 prog’s 2520 450 5. 6 Analogy: If a car is driven to a city 100 km away at 100 km/hr and returns at 50 km/hr, the average speed is not (100 + 50) / 2 but is obtained from the fact that it travels 200 km in 3 hours. June 2005 Computer Architecture, Background and Motivation 28

Comparing the Overall Performance Table 4. 4 Measured or estimated execution times for three

Comparing the Overall Performance Table 4. 4 Measured or estimated execution times for three programs. Speedup of X over Y Time on machine X Time on machine Y Speedup of Y over X Program A 20 200 0. 1 10 Program B 1000 10. 0 0. 1 Program C 1500 150 10. 0 0. 1 Arithmetic mean Geometric mean 6. 7 2. 15 3. 4 0. 46 Geometric mean does not yield a measure of overall speedup, but provides an indicator that at least moves in the right direction June 2005 Computer Architecture, Background and Motivation 29

4. 6 The Quest for Higher Performance State of available computing power ca. the

4. 6 The Quest for Higher Performance State of available computing power ca. the early 2000 s: Gigaflops on the desktop Teraflops in the supercomputer center Petaflops on the drawing board Note on terminology (see Table 3. 1) Prefixes for large units: Kilo = 103, Mega = 106, Giga = 109, Tera = 1012, Peta = 1015 For memory: K = 210 = 1024, M = 220, G = 230, T = 240, P = 250 Prefixes for small units: micro = 10 -6, nano = 10 -9, pico = 10 -12, femto = 10 -15 June 2005 Computer Architecture, Background and Motivation 30

Supercomputers Figure 4. 7 June 2005 Exponential growth of supercomputer performance. Computer Architecture, Background

Supercomputers Figure 4. 7 June 2005 Exponential growth of supercomputer performance. Computer Architecture, Background and Motivation 31

The Most Powerful Computers Figure 4. 8 Milestones in the DOE’s Accelerated Strategic Computing

The Most Powerful Computers Figure 4. 8 Milestones in the DOE’s Accelerated Strategic Computing Initiative (ASCI) program with extrapolation up to the PFLOPS level. June 2005 Computer Architecture, Background and Motivation 32

Performance is Important, But It Isn’t Everything Figure 25. 1 Trend in energy consumption

Performance is Important, But It Isn’t Everything Figure 25. 1 Trend in energy consumption per MIPS of computational power in generalpurpose processors and DSPs. June 2005 Computer Architecture, Background and Motivation 33

Computer Architecture Lecture Notes Spring 2005 Dr. Michael P. Frank Competency Area 2: Performance

Computer Architecture Lecture Notes Spring 2005 Dr. Michael P. Frank Competency Area 2: Performance Metrics Lecture 1 34

Performance Metrics • Why is it necessary for us to study performance? — Performance

Performance Metrics • Why is it necessary for us to study performance? — Performance is usually the key to the effectiveness of a system (hardware + software). — Performance is critical to customers (purchasers), thus, we as designers and architects must also make it a priority. — Performance must be assessed and understood in order for a system to communicate efficiently with peripheral devices.

Topic: Computer Performance Sub-Topic: Airplane Analogy 36

Topic: Computer Performance Sub-Topic: Airplane Analogy 36

Performance Metrics • How can we determine performance? Consider this example from the transportation

Performance Metrics • How can we determine performance? Consider this example from the transportation industry: Aircraft Passenger Fuel Capacity Cruising Range Speed Throughput Cost Boeing 747 -400 421 216, 847 10, 734 920 387, 320 0. 048 Boeing 767 -300 270 91, 380 10, 548 853 230, 310 0. 032 Airbus 340 -300 284 139, 681 12, 493 869 246, 796 0. 039 Airbus 340 -300 120 23, 859 4, 442 837 100, 440 0. 045 77 11, 750 2, 406 708 54, 516 0. 063 132 119, 501 6, 230 2, 180 287, 760 0. 145 50 3, 202 1, 389 531 26, 550 0. 046 5 60 100 500 0. 017 BAE-146 -200 Concorde Dash-8 Car 700

Performance Example • • Fuel Capacity in liters Range in kilometers Speed in kilometers/hour

Performance Example • • Fuel Capacity in liters Range in kilometers Speed in kilometers/hour Throughput is defined as (# of passengers) x (cruising speed) • Cost is given as (fuel capacity) / (passengers x range) Which mode of transportation has the “best” performance?

Performance Example • It depends on how we define performance. • Consider raw speed:

Performance Example • It depends on how we define performance. • Consider raw speed: —Getting from one place to another quickly best worst

Performance Example • What if we’re interested in the rate at which people are

Performance Example • What if we’re interested in the rate at which people are carried throughput: best worst

Performance Example • Often times we relate performance and cost. Thus we can consider

Performance Example • Often times we relate performance and cost. Thus we can consider the amount of fuel used per passenger: Best plane Best overall

Topic: Computer Performance Sub-Topic: Basic Concepts: Performance, Throughput, and Execution Time 42

Topic: Computer Performance Sub-Topic: Basic Concepts: Performance, Throughput, and Execution Time 42

Performance Metrics • Similar measures of performance are used for computers. — Number of

Performance Metrics • Similar measures of performance are used for computers. — Number of computations done per unit of time — Cost of computations — Possibly several aspects of cost can be considered including initial purchase price, operating cost, cost of training users of system, etc. • Common performance measures are 1. RESPONSE TIME – the amount of time it takes a program to complete (a. k. a execution time) 2. THROUGHPUT – the total amount of work done in a given amount of time

Performance Metrics Example: Given the following actions: 1. Replacing processor with a faster version

Performance Metrics Example: Given the following actions: 1. Replacing processor with a faster version 2. Adding additional processors to perform separate tasks in a multiprocessor system do they (a) increase throughput, (a) decrease response time or (c) both?

Defining Performance • • Our focus will be primarily on execution time. To maximize

Defining Performance • • Our focus will be primarily on execution time. To maximize performance implies a minimization in execution time: • For two machines: • We say that machine Y is faster than machine X.

Performance Metrics Notes: (1) If X is n times faster than Y, then (2)

Performance Metrics Notes: (1) If X is n times faster than Y, then (2) To avoid confusion, we’ll use the following terminology: (3) We say We mean (4) “improve performance” increase performance (5) “improve execution time” decrease execution time

Performance Example If machine A runs a program in 10 seconds and machine B

Performance Example If machine A runs a program in 10 seconds and machine B runs the same program in 15 seconds, how much faster is A than B?

Performance Example If machine A runs a program in 10 seconds and machine B

Performance Example If machine A runs a program in 10 seconds and machine B runs the same program in 15 seconds, how much faster is A than B?

Topic: Computer Performance Sub-Topic: Measuring Performance 49

Topic: Computer Performance Sub-Topic: Measuring Performance 49

Measuring Performance • Quite simply, TIME is the measure of computer performance! • The

Measuring Performance • Quite simply, TIME is the measure of computer performance! • The most straightforward definition of time is wall-clock time elapsed time response time. Total time to complete a task including system overhead activities such as Input/Output tasks, disk and memory accesses, etc.

Measuring Performance • CPU Time is the time it takes to complete a task

Measuring Performance • CPU Time is the time it takes to complete a task excluding the time it takes for I/O waits. CPU TIME USER CPU TIME The time CPU is busy executing the user’s code. SYSTEM CPU TIME The time CPU spends performing operating system tasks. Note: Sometimes system and user CPU times are difficult to distinguish since it is hard to assign responsibility for OS activities.

Measuring Performance Example, To understand the concept of CPUTime, consider the UNIX command ‘time’.

Measuring Performance Example, To understand the concept of CPUTime, consider the UNIX command ‘time’. Once typed, it may return a response similar to 90. 7 u 12. 9 s 2: 39 What do these numbers mean? 65%

Measuring Performance Example, To understand the concept of CPUTime, consider the UNIX command ‘time’.

Measuring Performance Example, To understand the concept of CPUTime, consider the UNIX command ‘time’. Once typed, it may return a response similar to 90. 7 u 12. 9 s 2: 39 User CPU Time System CPU Time 65% % of elapsed time that is CPU time Elapsed Time

Measuring Performance Example, To understand the concept of CPUTime, consider the UNIX command ‘time’.

Measuring Performance Example, To understand the concept of CPUTime, consider the UNIX command ‘time’. Once typed, it may return a response similar to 90. 7 u 12. 9 s 2: 39 65% a. What is the total CPUTime? b. Percentage of time spent on I/O and other programs?

Measuring Performance Example, To understand the concept of CPUTime, consider the UNIX command ‘time’.

Measuring Performance Example, To understand the concept of CPUTime, consider the UNIX command ‘time’. Once typed, it may return a response similar to 90. 7 u 12. 9 s 2: 39 65% a. What is the total CPUTime? b. Percentage of time spent on I/O and other programs?

Measuring Performance • Other notes: 1. SYSTEM PERFORMANCE – reciprocal of elapsed time on

Measuring Performance • Other notes: 1. SYSTEM PERFORMANCE – reciprocal of elapsed time on an unloaded system (e. g. no user applications) 2. CPU PERFORMANCE – recip. of user CPU time 3. CLOCK CYCLES (CC) – discrete time intervals measured by the processor clock running at a constant rate. 4. CLOCK PERIOD – time it takes to complete a clock cycle 5. CLOCK RATE – inverse of clock period

Measuring Performance • Consider CPU performance: Also,

Measuring Performance • Consider CPU performance: Also,

Measuring Performance • Since the execution time clearly depends on the number of instructions

Measuring Performance • Since the execution time clearly depends on the number of instructions for a program, we must also define another performance metric: CPI = average number of clock cycles per instruction

Measuring Performance • Now we have two more equations that we can define for

Measuring Performance • Now we have two more equations that we can define for CPUTime:

Measuring Performance • In summary, performance metrics include: Components of Performance Units of Measure

Measuring Performance • In summary, performance metrics include: Components of Performance Units of Measure CPUTime Seconds for program IC # of instructions for a program CPI Average # of clock cycles per instructions t. CC Seconds per clock cycle

Measuring Performance Example, Suppose Machine A implements the same ISA as Machine B. Given

Measuring Performance Example, Suppose Machine A implements the same ISA as Machine B. Given and for some program, and for the same program, determine which machine is faster and by how much.

Breakdown by Instruction Category • Recall CPI = Clock cycles (CC) per instruction •

Breakdown by Instruction Category • Recall CPI = Clock cycles (CC) per instruction • But, CPI depends on many factors, including: —Memory system behavior —Processor structure —Availability special processor features – E. g. , floating point, graphics, etc. • To characterize the effect of changing specific aspects of the architecture, we find it helpful to break down CC into components due to different classes (categories) of instructions: —Where: – ICi = instruction count for class i – CPIi = avg. cycles for insts. in class i – n = the number of instruction classes

Example • Suppose a processor has 3 categories of instructions A, B, C with

Example • Suppose a processor has 3 categories of instructions A, B, C with the following CPIs: • And, suppose a compiler designer is comparing two code sequences for a given program that have the following instruction counts: • Determine: (i) Which code sequence executes the most instructions? (ii) Which will be faster? (iii) What is the average CPI for each code sequence? Instr. Class CPIi A 1 B 2 C 3 Code Inst. counts Seq. ICA ICB ICC 1 2 2 4 1 1

Solution to Example • Part (i): — ICseq 1 = 2 + 1 +

Solution to Example • Part (i): — ICseq 1 = 2 + 1 + 2 = 5 instructions — ICseq 2 = 4 + 1 = 6 instructions — Code sequence 2 executes more instructions • Part (ii): — CCseq 1 = ∑i(CPIix. ICi) = 1 x 2 + 2 x 1 + 3 x 2 = 10 cycles — CCseq 2 = ∑i(CPIix. ICi) = 1 x 4 + 2 x 1 + 3 x 1 = 9 cycles — Code sequence 2 takes fewer cycles is faster! • Part (iii): — CPIseq 1 = {CC/IC}seq 1 = 10 cyc. /5 inst. = 2 — CPIseq 2 = {CC/IC}seq 2 = 9 cyc. /6 inst. = 1. 5 • Which part should we consult to tell us which code sequence has better performance?

Topic: Computer Performance Subtopic: Benchmarks & Performance Summaries 65

Topic: Computer Performance Subtopic: Benchmarks & Performance Summaries 65

Importance of Benchmarks • How do we evaluate and compare the performance of different

Importance of Benchmarks • How do we evaluate and compare the performance of different architectures? —We use benchmarks Programs that are specifically chosen to measure performance. A workload is a set of programs. Benchmarks consist of workloads that (user hopes) will predict the performance of the actual workload It is important that benchmarks consist of realistic workloads Not simple toy programs or code fragments Manufacturers often try to fine-tune their machines to do well on popular benchmarks that were too simple This does not always mean the machine will do well on real programs!

SPEC benchmark • A popular source of benchmarks is SPEC —Standard Performance Evaluation Corporation

SPEC benchmark • A popular source of benchmarks is SPEC —Standard Performance Evaluation Corporation • General CPU benchmarks: CPU 2000. —Includes programs such as: – gzip (compression), vpr (FPGA place & route), gcc (compiler), crafty (chess), vortex (database) • SPEC also offers specialized benchmarks for: —Graphics, Parallel computing, Java, mail servers, network fileservers, web servers • They publish reports on benchmark results for various systems. —Main metric: SPECRatio – Proportional to average inverse execution time. The bigger, the better! • Reproducibility of results is very important!

Summarizing Performance • How do we summarize performance in a way that accurately compares

Summarizing Performance • How do we summarize performance in a way that accurately compares different machines? —One common approach: Total Execution Time (TET) – Based on: —Or, if the workload includes n different programs, we can calculate the average or Arithmetic Mean (AM): – Smaller AM Improved performance —Other methods are also used: – Weighted arithmetic mean, geometric mean ratio.

Topic: Computer Performance Subtopic: Performance Improvement and Amdahl’s Law 69

Topic: Computer Performance Subtopic: Performance Improvement and Amdahl’s Law 69

Performance Improvement • Recall the formula: CPUTime = IC × CPI / fcyc. —Thus,

Performance Improvement • Recall the formula: CPUTime = IC × CPI / fcyc. —Thus, CPU performance is Perf = f / (IC×CPI). • Thus we can see 3 basic ways to improve CPU performance on a given task: —Increase clock frequency —Decrease CPI – by improved processor organization —Decrease instruction count – By compiler enhancement, – change in ISA design (new instructions), or – A more efficient application algorithm. • However, we have to be careful! —Sometimes, improving one of these can hurt others!

Generalized Cost Measures • In this course, we will often be focusing on ways

Generalized Cost Measures • In this course, we will often be focusing on ways to minimize execution time of programs. — Either CPU time, or number of clock cycles. • Execution time is one example of what we may call a generalized cost measure (GCM). — A GCM is any property of a HW/SW design that tells us how much of some valued resource is used up when the system is manufactured or used. • Other examples of important GCMs include: — Energy consumed by a computation — Silicon chip area used up by a circuit design — Dollar cost to manufacture a computer component • We will study some general engineering principles that apply to the minimization of any GCM in any system.

Additive Cost Measures • Let us suppose we have a GCM C for a

Additive Cost Measures • Let us suppose we have a GCM C for a system. • Many times, the total cost C can be represented as a sum of independent cost components: —E. g. , C = C 1 + C 2 + … + Cn or . • These could correspond to the resources used by individual subsystems of the whole system. —Or, used in doing particular categories of tasks. • For example, execution time T can be broken down as the sum of time Tfp taken by floatingpoint instructions and the time Toth for others. —That is, T = Tfp + Toth.

Improving Part of a System • Suppose a GCM is broken down as C

Improving Part of a System • Suppose a GCM is broken down as C = A + B. —The total cost is the sum of two components A & B. • Now suppose you are considering making an improvement to the system design that affects only cost component B. —Suppose you reduce it by a factor f, to B′ = B/f. • The new total cost is then C′ = A + B′. —The cost of component A is unaffected. • Overall (total) cost has therefore been reduced by the factor:

Diminishing Returns • Suppose we continue improving (reducing) a cost component by larger and

Diminishing Returns • Suppose we continue improving (reducing) a cost component by larger and larger factors. —Does this mean the system’s total cost will be reduced by correspondingly large factors? NO! • Even if we improved one cost component (B in our example) by a factor of f = ∞, note that: • Even here, the overall cost reduction factor foverall would still be only the finite value 1+B/A! —The system can only be improved by at most this factor, if we improve just the one component B.

Diminishing Returns Example • Suppose a particular chip contains B = 1 cm 2

Diminishing Returns Example • Suppose a particular chip contains B = 1 cm 2 of logic circuits, and A = 2 cm 2 of cache memory. —The total cost (in terms of area) is C = A+B = 3 cm 2. • Now, let’s go crazy trying to simplify and shrink Logic the design of just the logic circuit… —What is the maximum factor by which this tactic can reduce the area cost of the whole design (logic+memory)? 1 cm 2 Memory 2 cm 2 • Obviously, this can reduce the total area from 3 (cm 2) to no less than 2 (area of memory alone), —or, shrink it by a factor of foverall = 3/2 = 1. 5. • Note we could have obtained this same answer using the equation foverall, max = 1+B/A as well.

Graph Showing Diminishing Returns Part/rest (initial) (B/A) ( f )

Graph Showing Diminishing Returns Part/rest (initial) (B/A) ( f )

Important Lessons to Take from This • It’s probably not worth spending significant design

Important Lessons to Take from This • It’s probably not worth spending significant design time extensively improving just a single component of a system, —Unless that component accounts for a dominant part of the total cost (by some measure) to begin with. (B/A >> 1). • It’s only worth improving a given component up to the point where it is no longer dominant. —Reducing it further won’t make a lot of difference. • Therefore, all components with significant costs must be improved together in order to significantly improve an entire design. —Well-engineered systems will tend to have roughly comparable costs in all of their major components.

Other Ways to Calculate foverall • Earlier, we saw this formula: — For the

Other Ways to Calculate foverall • Earlier, we saw this formula: — For the overall improvement factor foverall resulting from improving component B by the factor f. • But, what if we don’t know the values of A and B? — What if we only know their relative sizes? – Fortunately, it turns out that we can still calculate foverall. • Let us define fracenh = B/C = B/(A+B) to be the fraction of the original total system cost that is accounted for by the particular part B that is going to be enhanced. — Then, the fraction of cost accounted for by A (the rest of the system) is • Our equation for foverall can then be reexpressed in terms of the quantities fracenh and 1−fracenh, as follows…

Calculating foverall in terms of fracenh • Let’s re-express foverall in terms of fracenh:

Calculating foverall in terms of fracenh • Let’s re-express foverall in terms of fracenh: • We will call this form for foverall the Generalized Amdahl’s Law. (We’ll see why in a moment. )

Amdahl’s Law Proper • We saw that execution time is one valid cost measure.

Amdahl’s Law Proper • We saw that execution time is one valid cost measure. — In such a case, note that the factor by which a cost is reduced is the speedup, or the factor by which performance is improved. • We thus rename the improvement factor f of B (the enhanced part) to speedupenh, and the overall improvement factor foverall becomes speedupoverall, and we get: • This is called Amdahl’s Law, and it is one of the most widely hyped quantitative principles of processor design. — But as we can see, it is not a special law of CPU architecture, but just an application of the universal engineering principle of diminishing returns which we discussed earlier.

Key Points from This Module • • • Throughput vs. Response Time Performance as

Key Points from This Module • • • Throughput vs. Response Time Performance as Inverse Execution Time Speedup Factors Averaging Benchmark Results CPU Performance Equation: —Execution time = IC × CPI × tcc —Performance = fcc / (IC × CPI) • Amdahl’s Law: — C′ = A + B/f — Implies: C = Execution time after improvement B = Part of execution time affected by improvement f = Factor of improvement (speedup of enhanced part) A = Part of execution time unaffected by improvement

Example Performance Calculation • Suppose program takes 10 secs. on computer A —And suppose

Example Performance Calculation • Suppose program takes 10 secs. on computer A —And suppose computer A has a 4 GHz clock • Want new computer B to run prg. in 6 seconds. —Suppose that increasing the clock speed is only possible with a substantial processor redesign, – which will result in 1. 2× as many clock cycles being needed to execute the program. • What clock rate is needed? — Answer: 4 GHz × (10/6) × 1. 2 = 8 GHz

Another Example • Consider two different implementations of a given ISA, running a given

Another Example • Consider two different implementations of a given ISA, running a given benchmark: — Processor A has a cycle time of 250 ps – And a CPI of 2. 0 — Processor B has a cycle time of 500 ps – And a CPI of 1. 2 • Which computer is faster on this benchmark, and by what factor? — Processor A takes 250 ps × 2. 0 = 500 ps / instr. — Processor B takes 500 ps × 1. 2 = 600 ps / instr. — Thus, A is faster by a factor of 6/5 = 1. 2×.

Another example • Suppose some Java application takes 15 seconds on a certain machine.

Another example • Suppose some Java application takes 15 seconds on a certain machine. • A new Java compiler is released that requires only 0. 6 as many dynamic instructions to run the application. — Unfortunately, it also increases the CPI by 1. 1× – Presumably, uses more multi-cycle instructions. • How fast will the application run when compiled using the new compiler? —It will take 15 × 0. 6 × 1. 1 = 9. 9 seconds to run —It will be 15/9. 9 = 50/33 = 1. 515…× faster – Only slightly more than 50% faster than before.

Another Example • Consider the following measurements of execution time: Program Computer A Computer

Another Example • Consider the following measurements of execution time: Program Computer A Computer B 1 2 sec. 4 sec. 2 5 sec. 2 sec. • Which of the following statements are true? — A is faster than B for program 1. — A is faster than B for program 2. — A is faster than B for a workload with equal numbers of executions of programs 1 and 2. — A is faster than B for a workload with twice as many executions of program 1 as of program 2.