CS 4100 Performance and Cost Adapted from class

  • Slides: 46
Download presentation
CS 4100: 計算機結構 Performance and Cost 國立清華大學資訊 程學系 九十六學年度第一學期 Adapted from class notes of

CS 4100: 計算機結構 Performance and Cost 國立清華大學資訊 程學系 九十六學年度第一學期 Adapted from class notes of D. Patterson and W. Dally Copyright 1998, 2000 UCB Performance-

買那一支手機比較好? Performance-1 Computer Architecture

買那一支手機比較好? Performance-1 Computer Architecture

差不多的價錢,你怎麼比? 易利信T 68 諾基亞 8250 Performance-2 摩托羅拉V 66 Computer Architecture

差不多的價錢,你怎麼比? 易利信T 68 諾基亞 8250 Performance-2 摩托羅拉V 66 Computer Architecture

Performance t Purchasing perspective: Given a collection of machines, which has the l t

Performance t Purchasing perspective: Given a collection of machines, which has the l t best performance? least cost? best performance/cost? Design perspective: Faced with design options, which has the best performance improvement? least cost? l best performance/cost? l t Both require basis for comparison l metric for evaluation l t Goal: understand cost and performance implications of architectural choices Performance-4 Computer Architecture

I/O Chan Link Tasks of a Computer Architect Technology ISA API Chapter 2 Interfaces

I/O Chan Link Tasks of a Computer Architect Technology ISA API Chapter 2 Interfaces (ISA) Historic Background, Trend IR Regs Chapters 3, 5 -8 Machine Organization Applications Computer Architect Performance-5 Chapter 4 Measurement & Evaluation Computer Architecture

Outline t Performance l l l t Definition CPU performance formula Measuring and evaluating

Outline t Performance l l l t Definition CPU performance formula Measuring and evaluating performance Cost l Cost of chips Performance-6 Computer Architecture

那一架飛機的效能比較好? Concorde: • Capacity: 100 persons • Range: 6667 km • Cruising speed: 2160

那一架飛機的效能比較好? Concorde: • Capacity: 100 persons • Range: 6667 km • Cruising speed: 2160 kph (Mach 2) at 60, 000 ft 747 -400: • Capacity: 400 persons • Range: 11, 485 km • Cruising speed: 929 kph at 35, 000 ft Performance-7 Computer Architecture

Two Notions of Performance t Plane DC to Paris Speed Passengers Throughput (pmph) Boeing

Two Notions of Performance t Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6. 5 hours 610 mph 470 286, 700 Concorde 3 hours 1350 mph (Mach 2) 132 178, 200 Which has higher performance? l l l Time to delivery 1 passenger? deliver 400 passengers? Time to do the task: execution time, response time, latency Tasks per unit time: throughput, bandwidth Response time and throughput often are in opposition Performance-8 Computer Architecture

Which Is Better? t Time of Concorde vs. Boeing 747: l t Throughput of

Which Is Better? t Time of Concorde vs. Boeing 747: l t Throughput of Concorde vs. Boeing 747: l t t Concord is 1350 mph / 610 mph = 2. 2 times faster = 6. 5 hours / 3 hours Boeing is 286, 700 pmph / 178, 200 pmph = 1. 6 times better Boeing is 1. 6 times (60%) faster in terms of throughput Concord is 2. 2 times (120%) faster in terms of flying time (response time) We will focus on execution time for a single job Performance-9 Computer Architecture

Performance Definition t Performance according to time: => faster is better t If interested

Performance Definition t Performance according to time: => faster is better t If interested in comparing two things: “X is n times faster than Y” means Performance-10 Computer Architecture

What is Time? t Straightforward definition of time: l l l t Total time

What is Time? t Straightforward definition of time: l l l t Total time to complete a task, including disk & memory accesses, I/O activities, OS overhead, … May include execution time of other programs in a multiprogramming environment Too many factors involved Alternative: the time that the processor (CPU) is working only on your program (since multiple processes running at same time) l l “CPU execution time” or “CPU time ” Often divided into system CPU time (in OS) and user CPU time (in user program) CPU performance: user CPU time of a single program Performance-11 Computer Architecture

Outline t Performance l l l t Definition CPU performance formula (Sec. 4. 2)

Outline t Performance l l l t Definition CPU performance formula (Sec. 4. 2) Measuring and evaluating performance Cost l Cost of chips Performance-12 Computer Architecture

如何以公式表達程式執行時間? t Hint: basic components of a program t 指令數 t 指令執行時間(平均) Performance-13 Computer

如何以公式表達程式執行時間? t Hint: basic components of a program t 指令數 t 指令執行時間(平均) Performance-13 Computer Architecture

何謂程式的指令數? t 有幾條C指令? for(i=0; i<100; i++) a[i] = b[i] * c[i]; t 有幾條組合語言指令? Loop:

何謂程式的指令數? t 有幾條C指令? for(i=0; i<100; i++) a[i] = b[i] * c[i]; t 有幾條組合語言指令? Loop: End: sub beq addi j $r 1, $r 2, $r 3 $r 9, $r 0, End $r 8, $r 10 10 times => 41 instructions $r 9, -1 Loop Dynamic Instruction Count Performance-14 Computer Architecture

Instruction Execution Time t t Time unit: from a user’s perspective: time = seconds

Instruction Execution Time t t Time unit: from a user’s perspective: time = seconds CPU Time: computers are constructed using a clock that runs at a constant rate and determines when events take place in the hardware l l These discrete time intervals called clock cycles (or informally clocks or cycles) Length of clock period: clock cycle time (e. g. , 2 nanoseconds or 2 ns) and clock rate (e. g. , 500 megahertz, or 500 MHz), which is the inverse of the clock period 指令執行時間以cycle為單位 Performance-15 Computer Architecture

Program Execution Time CPU execution time for program = Clock Cycles for program x

Program Execution Time CPU execution time for program = Clock Cycles for program x Clock Cycle Time Clock Cycles for program = ------------------Clock Rate Clock Cycles for program = Instructions for program (“Instruction Count”) x Average Clock Cycles per Instruction (“CPI”) t CPI: one way to compare two machines with same instruction set, since Instruction Count is same Performance-16 Computer Architecture

Performance Calculation (1/2) CPU execution time for program (designer’s view) = Clock Cycles for

Performance Calculation (1/2) CPU execution time for program (designer’s view) = Clock Cycles for program x Clock Cycle Time t Substituting for clock cycles: CPU execution time for program (user’s view) = Instruction Count x CPI x Clock Cycle Time Performance-17 Computer Architecture

How to Calculate the 3 Components? t t Clock Cycle Time: in specification of

How to Calculate the 3 Components? t t Clock Cycle Time: in specification of computer (Clock Rate in advertisements) Instruction Count: l l l t Count instructions in loop of small program Use simulator or emulator to count instructions Debugger or tracing program Execution-based monitoring: insert instrumentation code into binary code, run, and record information Hardware counter in special register (Pentium) CPI: l l Calculate: Execution Time / Clock Cycle Time Instruction Count Hardware counter in special register (Pentium) Performance-18 Computer Architecture

Calculating CPI Another Way t t t First calculate CPI for each individual instruction

Calculating CPI Another Way t t t First calculate CPI for each individual instruction (add, sub, and, etc. ) Next calculate frequency of each individual instruction in the workload Finally multiply these two for each instruction and add them up to get final CPI =“instruction frequency” Performance-19 Computer Architecture

Example (RISC processor) Op ALU Load Store Branch Freqi 50% 20% 10% 20% Instruction

Example (RISC processor) Op ALU Load Store Branch Freqi 50% 20% 10% 20% Instruction Mix t t CPIi 1 5 3 2 Prod (% Time). 5 (23%) 1. 0 (45%). 3 (14%). 4 (18%) 2. 2 (Where time spent) What if Branch instructions twice as fast? What if two ALU instr. could be executed at once? Must know the limit of architectural enhancement Performance-20 Computer Architecture

Summary: CPU Time Formula Performance-21 Computer Architecture

Summary: CPU Time Formula Performance-21 Computer Architecture

Amdahl's Law t Speedup due to enhancement E: Suppose that enhancement E accelerates a

Amdahl's Law t Speedup due to enhancement E: Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, Performance-23 Computer Architecture

Outline t Performance l l l Definition CPU performance formula Measuring and evaluating performance

Outline t Performance l l l Definition CPU performance formula Measuring and evaluating performance (Sec. 4. 3, 4. 4) n n n t Benchmark programs Summarizing performance Reporting performance Cost l Cost of chips Performance-25 Computer Architecture

What Programs for Comparison? t What’s wrong with this program as a workload? integer

What Programs for Comparison? t What’s wrong with this program as a workload? integer A[][], B[][], C[][]; for (I=0; I<100; I++) for (J=0; J<100; J++) for (K=0; K<100; K++) C[I][J] = C[I][J] + A[I][K]*B[K][J]; t What measured? Not measured? What is it good for? Ideally run typical programs with typical input before purchase, or before even build machine t l l l Called a “workload”; For example: Engineer uses compiler, spreadsheet Author uses word processor, drawing program, compression software Performance-26 Computer Architecture

Choosing Benchmark Programs Pros • representative Actual Target Workload • portable • widely used

Choosing Benchmark Programs Pros • representative Actual Target Workload • portable • widely used • improvements useful in reality Full Application Benchmarks • easy to run, early in design • find potential bottleneck & peak capability Small Kernel Benchmarks Microbenchmarks Performance-27 Cons • not portable • hard to measure, find cause • less representative • easy to fool • peak does not reflect application performance Computer Architecture

Benchmarks t t Obviously, apparent speed of processor depends on code used to test

Benchmarks t t Obviously, apparent speed of processor depends on code used to test it Need industry standards so that different processors can be fairly compared => benchmark programs Companies exist that create these benchmarks: “typical” code used to evaluate systems Tricks in benchmarking: l l t different system configurations compiler and libraries optimized (perhaps manually) for benchmarks test specification biased towards one machine very small benchmarks used Need to be changed every 2 or 3 years since designers could target these standard benchmarks Performance-28 Computer Architecture

Example Standardized Workload Benchmarks t t t Standard Performance Evaluation Corporation (SPEC) : supported

Example Standardized Workload Benchmarks t t t Standard Performance Evaluation Corporation (SPEC) : supported by a number of computer vendors to create standard set of benchmarks Began in 1989 focusing on benchmarking workstation and servers using CPU-intensive benchmarks The latest release: SPEC 2000 benchmarks l l l l l CPU performance Graphics High-performance computing Object-oriented computing Java applications Client-sever models Mail systems File systems Web-servers Performance-29 Computer Architecture

SPEC CPU 2000 (CINT) Benchmark Language Category 164. gzip C Compression 175. vpr C

SPEC CPU 2000 (CINT) Benchmark Language Category 164. gzip C Compression 175. vpr C FPGA Placement/Route 176. gcc C C Compiler 181. mcf C Combinatorial Opt. 186. crafty C Chess 197. parser C Word Processing 252. eon C++ Computer Visualization 253. perlbmk C PERL 254. gap C Group Theory, Interpreter 255. vortex C OO Database 256. bzip 2 C Compression 300. twolf C Place/Route Simulator (http: //www. spec. org/cpu 2000) Performance-30 Computer Architecture

SPEC CPU 2000 (CFP) Benchmark Lang. Category 168. wupwise F 77 Quantum Chromodynamics 171.

SPEC CPU 2000 (CFP) Benchmark Lang. Category 168. wupwise F 77 Quantum Chromodynamics 171. swim F 77 Shallow Water Modeling 172. mgrid F 77 Multi-grid Solver 173. applu F 77 Parabolic/Elliptic PDE 177. mesa C 3 -D Graphics Library 178. galgel F 90 Computational Fluid Dynamics 179. art C Image Recognition/Neural Net 183. equake C Seismic Wave Propagation 187. facerec F 90 Image Processing 188. ammp C Computational Chemistry 189. lucas F 90 Number Theory 191. fma 3 d F 90 Finite-element Crash Simulation 200. sixtrack F 77 Nuclear Accelerator Designs 301. apsi F 77 Pollutant Distribution Performance-31 Computer Architecture

Summarizing Performance Machine A Machine B Program 1 1 s 10 s Program 2

Summarizing Performance Machine A Machine B Program 1 1 s 10 s Program 2 1000 s 100 s Total 1001 s 110 s t A is 10 times faster than B for program 1 t B is 10 times faster than A for program 2 Ø What is relative performance of A & B? t Arithmetic mean (tracking total time) assuming running programs 1 and 2 an equal number of times: t Weighted arithmetic mean Performance-32 Computer Architecture

Reporting Performance t Guiding principle: reproducible l l List everything another experimenter would need

Reporting Performance t Guiding principle: reproducible l l List everything another experimenter would need to duplicate the results (especially, the input set) Fig. 4. 3 Hardware CPU L 3 Cache size Memory Disk subsystem Software OS Compiler 3. 2 -GHz Pentium 4 Extreme Edition 2048 KB (I+D) on chip 4 x 512 MB 1 x 80 GB ATA/100 7200 RPM Windows XP Professional SP 1 Intel C++ Compiler 7. 1 Performance-33 Computer Architecture

Two SPEC Benchmarks t t Compare the performance of recent Intel processors Performance with

Two SPEC Benchmarks t t Compare the performance of recent Intel processors Performance with SPEC CPU Benchmarks l l l t CINT 2000 CFP 2000 Each SPEC ratio is a ratio of execution time n SPEC ratio(A, go) = time (Sun Ultra 5_10 , go) / time(A, go) SPECweb 99 l A throughput benchmark for web servers Performance-34 Computer Architecture

SPEC CINT 2000 and CFP 2000 Rating for Pentium III and 4 at Different

SPEC CINT 2000 and CFP 2000 Rating for Pentium III and 4 at Different Clock Rates Fig. 4. 6 Performance-35 Computer Architecture

Performance Per Clock Rate in MHz t CPU time improvement l l l t

Performance Per Clock Rate in MHz t CPU time improvement l l l t Advanced integrated circuit technology (clock rate) Aggressive pipeline structure (clock rate, CPI) Better ISA (CPI) Compute the performance per clock rate in MHz The average value of the ratio across different clock rate Rate Pentium III Pentium 4 CINT 2000/MHz 0. 47 0. 36 CFP 2000/MHz 0. 34 0. 39 l t CINT 2000 l l t A fast version sacrifice one aspect (e. g. , CPI) to enhance another (clock rate) Assuming identical codes, CPI of Pentium 4 to Pentium III is 1. 3 CFP 2000: l l Pentium 4 provides new instructions Instruction count and. Performance-36 CPI for Pentium 4 are different Computer Architecture

SPECweb 99: Throughput for Web Servers System Processor 2500 2550 6600 8450 # disk

SPECweb 99: Throughput for Web Servers System Processor 2500 2550 6600 8450 # disk Dr. # CPU #Networks Clock (GHz) Pentium III 8 2 4 1. 13 Pentium III 1 2 1 1. 26 Pentium 4 8 2 Pentium III 7 8 8 0. 7 Result 3435 1454 6700 8001 Fig. 4. 7 Performance-37 Computer Architecture

Performance, Power Fig. 4. 8 Performance-38 Computer Architecture

Performance, Power Fig. 4. 8 Performance-38 Computer Architecture

Performance, Energy R E = performance / power Fig. 4. 9 Performance-39 Computer Architecture

Performance, Energy R E = performance / power Fig. 4. 9 Performance-39 Computer Architecture

Summary: Performance t Latency v. Throughput l t t CPU Time: time spent executing

Summary: Performance t Latency v. Throughput l t t CPU Time: time spent executing a single program: depends solely on design of processor (datapath, pipelining effectiveness, caches, etc. ) Performance doesn’t depend on any single factor: need to know Instruction Count, Clocks Per Instruction and Clock Rate to get valid estimations Performance evaluation needs to consider: l l l Benchmark programs Summarizing performance Reporting performance results Performance-40 Computer Architecture

Outline t Performance l l l t Definition CPU performance formula Measuring and evaluating

Outline t Performance l l l t Definition CPU performance formula Measuring and evaluating performance Cost (Sec. 1. 4) l Cost of chips Performance-41 Computer Architecture

Chip Cost: Manufacturing Process Fig. 1. 14 Performance-42 Computer Architecture

Chip Cost: Manufacturing Process Fig. 1. 14 Performance-42 Computer Architecture

Cost of a Chip Includes. . . t Die cost: affected by wafer cost,

Cost of a Chip Includes. . . t Die cost: affected by wafer cost, number of dies per wafer, and die yield (#good dies/#total dies) l l t t goes roughly with the cube of the die area An 8” wafer can contain 196 Pentium dies, but only 78 Pentium Pro Testing cost Packaging cost: depends on pins, heat dissipation, . . . Performance-43 Computer Architecture

Real World Examples Chip Metal Line Die Cost layers 386 DX 2 0. 90

Real World Examples Chip Metal Line Die Cost layers 386 DX 2 0. 90 486 DX 2 3 0. 80 $12 Power. PC 601 4 0. 80 $53 HP PA 7100 3 0. 80 $73 DEC Alpha 3 0. 70 $149 Super. SPARC 3 0. 70 $272 Pentium 3 0. 80 $417 Wafer Defect Area Dies/ Yield width cost $900 1. 0 $1200 1. 0 /cm 2 43 81 mm 2 360 181 wafer 71% $4 54% $1700 1. 3 121 115 28% $1300 1. 0 196 66 27% $1500 1. 2 234 53 19% $1700 1. 6 256 48 13% $1500 1. 5 296 40 9% Performance-44 Computer Architecture

Summary: Cost t Integrated circuits driving computer industry Die costs goes up with the

Summary: Cost t Integrated circuits driving computer industry Die costs goes up with the cube of die area Economics ($$$) is the ultimate driver for performance! Performance-45 Computer Architecture