CS 4100 Performance and Cost Adapted from class
- Slides: 46
CS 4100: 計算機結構 Performance and Cost 國立清華大學資訊 程學系 九十六學年度第一學期 Adapted from class notes of D. Patterson and W. Dally Copyright 1998, 2000 UCB Performance-
買那一支手機比較好? Performance-1 Computer Architecture
差不多的價錢,你怎麼比? 易利信T 68 諾基亞 8250 Performance-2 摩托羅拉V 66 Computer Architecture
Performance t Purchasing perspective: Given a collection of machines, which has the l t best performance? least cost? best performance/cost? Design perspective: Faced with design options, which has the best performance improvement? least cost? l best performance/cost? l t Both require basis for comparison l metric for evaluation l t Goal: understand cost and performance implications of architectural choices Performance-4 Computer Architecture
I/O Chan Link Tasks of a Computer Architect Technology ISA API Chapter 2 Interfaces (ISA) Historic Background, Trend IR Regs Chapters 3, 5 -8 Machine Organization Applications Computer Architect Performance-5 Chapter 4 Measurement & Evaluation Computer Architecture
Outline t Performance l l l t Definition CPU performance formula Measuring and evaluating performance Cost l Cost of chips Performance-6 Computer Architecture
那一架飛機的效能比較好? Concorde: • Capacity: 100 persons • Range: 6667 km • Cruising speed: 2160 kph (Mach 2) at 60, 000 ft 747 -400: • Capacity: 400 persons • Range: 11, 485 km • Cruising speed: 929 kph at 35, 000 ft Performance-7 Computer Architecture
Two Notions of Performance t Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6. 5 hours 610 mph 470 286, 700 Concorde 3 hours 1350 mph (Mach 2) 132 178, 200 Which has higher performance? l l l Time to delivery 1 passenger? deliver 400 passengers? Time to do the task: execution time, response time, latency Tasks per unit time: throughput, bandwidth Response time and throughput often are in opposition Performance-8 Computer Architecture
Which Is Better? t Time of Concorde vs. Boeing 747: l t Throughput of Concorde vs. Boeing 747: l t t Concord is 1350 mph / 610 mph = 2. 2 times faster = 6. 5 hours / 3 hours Boeing is 286, 700 pmph / 178, 200 pmph = 1. 6 times better Boeing is 1. 6 times (60%) faster in terms of throughput Concord is 2. 2 times (120%) faster in terms of flying time (response time) We will focus on execution time for a single job Performance-9 Computer Architecture
Performance Definition t Performance according to time: => faster is better t If interested in comparing two things: “X is n times faster than Y” means Performance-10 Computer Architecture
What is Time? t Straightforward definition of time: l l l t Total time to complete a task, including disk & memory accesses, I/O activities, OS overhead, … May include execution time of other programs in a multiprogramming environment Too many factors involved Alternative: the time that the processor (CPU) is working only on your program (since multiple processes running at same time) l l “CPU execution time” or “CPU time ” Often divided into system CPU time (in OS) and user CPU time (in user program) CPU performance: user CPU time of a single program Performance-11 Computer Architecture
Outline t Performance l l l t Definition CPU performance formula (Sec. 4. 2) Measuring and evaluating performance Cost l Cost of chips Performance-12 Computer Architecture
如何以公式表達程式執行時間? t Hint: basic components of a program t 指令數 t 指令執行時間(平均) Performance-13 Computer Architecture
何謂程式的指令數? t 有幾條C指令? for(i=0; i<100; i++) a[i] = b[i] * c[i]; t 有幾條組合語言指令? Loop: End: sub beq addi j $r 1, $r 2, $r 3 $r 9, $r 0, End $r 8, $r 10 10 times => 41 instructions $r 9, -1 Loop Dynamic Instruction Count Performance-14 Computer Architecture
Instruction Execution Time t t Time unit: from a user’s perspective: time = seconds CPU Time: computers are constructed using a clock that runs at a constant rate and determines when events take place in the hardware l l These discrete time intervals called clock cycles (or informally clocks or cycles) Length of clock period: clock cycle time (e. g. , 2 nanoseconds or 2 ns) and clock rate (e. g. , 500 megahertz, or 500 MHz), which is the inverse of the clock period 指令執行時間以cycle為單位 Performance-15 Computer Architecture
Program Execution Time CPU execution time for program = Clock Cycles for program x Clock Cycle Time Clock Cycles for program = ------------------Clock Rate Clock Cycles for program = Instructions for program (“Instruction Count”) x Average Clock Cycles per Instruction (“CPI”) t CPI: one way to compare two machines with same instruction set, since Instruction Count is same Performance-16 Computer Architecture
Performance Calculation (1/2) CPU execution time for program (designer’s view) = Clock Cycles for program x Clock Cycle Time t Substituting for clock cycles: CPU execution time for program (user’s view) = Instruction Count x CPI x Clock Cycle Time Performance-17 Computer Architecture
How to Calculate the 3 Components? t t Clock Cycle Time: in specification of computer (Clock Rate in advertisements) Instruction Count: l l l t Count instructions in loop of small program Use simulator or emulator to count instructions Debugger or tracing program Execution-based monitoring: insert instrumentation code into binary code, run, and record information Hardware counter in special register (Pentium) CPI: l l Calculate: Execution Time / Clock Cycle Time Instruction Count Hardware counter in special register (Pentium) Performance-18 Computer Architecture
Calculating CPI Another Way t t t First calculate CPI for each individual instruction (add, sub, and, etc. ) Next calculate frequency of each individual instruction in the workload Finally multiply these two for each instruction and add them up to get final CPI =“instruction frequency” Performance-19 Computer Architecture
Example (RISC processor) Op ALU Load Store Branch Freqi 50% 20% 10% 20% Instruction Mix t t CPIi 1 5 3 2 Prod (% Time). 5 (23%) 1. 0 (45%). 3 (14%). 4 (18%) 2. 2 (Where time spent) What if Branch instructions twice as fast? What if two ALU instr. could be executed at once? Must know the limit of architectural enhancement Performance-20 Computer Architecture
Summary: CPU Time Formula Performance-21 Computer Architecture
Amdahl's Law t Speedup due to enhancement E: Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, Performance-23 Computer Architecture
Outline t Performance l l l Definition CPU performance formula Measuring and evaluating performance (Sec. 4. 3, 4. 4) n n n t Benchmark programs Summarizing performance Reporting performance Cost l Cost of chips Performance-25 Computer Architecture
What Programs for Comparison? t What’s wrong with this program as a workload? integer A[][], B[][], C[][]; for (I=0; I<100; I++) for (J=0; J<100; J++) for (K=0; K<100; K++) C[I][J] = C[I][J] + A[I][K]*B[K][J]; t What measured? Not measured? What is it good for? Ideally run typical programs with typical input before purchase, or before even build machine t l l l Called a “workload”; For example: Engineer uses compiler, spreadsheet Author uses word processor, drawing program, compression software Performance-26 Computer Architecture
Choosing Benchmark Programs Pros • representative Actual Target Workload • portable • widely used • improvements useful in reality Full Application Benchmarks • easy to run, early in design • find potential bottleneck & peak capability Small Kernel Benchmarks Microbenchmarks Performance-27 Cons • not portable • hard to measure, find cause • less representative • easy to fool • peak does not reflect application performance Computer Architecture
Benchmarks t t Obviously, apparent speed of processor depends on code used to test it Need industry standards so that different processors can be fairly compared => benchmark programs Companies exist that create these benchmarks: “typical” code used to evaluate systems Tricks in benchmarking: l l t different system configurations compiler and libraries optimized (perhaps manually) for benchmarks test specification biased towards one machine very small benchmarks used Need to be changed every 2 or 3 years since designers could target these standard benchmarks Performance-28 Computer Architecture
Example Standardized Workload Benchmarks t t t Standard Performance Evaluation Corporation (SPEC) : supported by a number of computer vendors to create standard set of benchmarks Began in 1989 focusing on benchmarking workstation and servers using CPU-intensive benchmarks The latest release: SPEC 2000 benchmarks l l l l l CPU performance Graphics High-performance computing Object-oriented computing Java applications Client-sever models Mail systems File systems Web-servers Performance-29 Computer Architecture
SPEC CPU 2000 (CINT) Benchmark Language Category 164. gzip C Compression 175. vpr C FPGA Placement/Route 176. gcc C C Compiler 181. mcf C Combinatorial Opt. 186. crafty C Chess 197. parser C Word Processing 252. eon C++ Computer Visualization 253. perlbmk C PERL 254. gap C Group Theory, Interpreter 255. vortex C OO Database 256. bzip 2 C Compression 300. twolf C Place/Route Simulator (http: //www. spec. org/cpu 2000) Performance-30 Computer Architecture
SPEC CPU 2000 (CFP) Benchmark Lang. Category 168. wupwise F 77 Quantum Chromodynamics 171. swim F 77 Shallow Water Modeling 172. mgrid F 77 Multi-grid Solver 173. applu F 77 Parabolic/Elliptic PDE 177. mesa C 3 -D Graphics Library 178. galgel F 90 Computational Fluid Dynamics 179. art C Image Recognition/Neural Net 183. equake C Seismic Wave Propagation 187. facerec F 90 Image Processing 188. ammp C Computational Chemistry 189. lucas F 90 Number Theory 191. fma 3 d F 90 Finite-element Crash Simulation 200. sixtrack F 77 Nuclear Accelerator Designs 301. apsi F 77 Pollutant Distribution Performance-31 Computer Architecture
Summarizing Performance Machine A Machine B Program 1 1 s 10 s Program 2 1000 s 100 s Total 1001 s 110 s t A is 10 times faster than B for program 1 t B is 10 times faster than A for program 2 Ø What is relative performance of A & B? t Arithmetic mean (tracking total time) assuming running programs 1 and 2 an equal number of times: t Weighted arithmetic mean Performance-32 Computer Architecture
Reporting Performance t Guiding principle: reproducible l l List everything another experimenter would need to duplicate the results (especially, the input set) Fig. 4. 3 Hardware CPU L 3 Cache size Memory Disk subsystem Software OS Compiler 3. 2 -GHz Pentium 4 Extreme Edition 2048 KB (I+D) on chip 4 x 512 MB 1 x 80 GB ATA/100 7200 RPM Windows XP Professional SP 1 Intel C++ Compiler 7. 1 Performance-33 Computer Architecture
Two SPEC Benchmarks t t Compare the performance of recent Intel processors Performance with SPEC CPU Benchmarks l l l t CINT 2000 CFP 2000 Each SPEC ratio is a ratio of execution time n SPEC ratio(A, go) = time (Sun Ultra 5_10 , go) / time(A, go) SPECweb 99 l A throughput benchmark for web servers Performance-34 Computer Architecture
SPEC CINT 2000 and CFP 2000 Rating for Pentium III and 4 at Different Clock Rates Fig. 4. 6 Performance-35 Computer Architecture
Performance Per Clock Rate in MHz t CPU time improvement l l l t Advanced integrated circuit technology (clock rate) Aggressive pipeline structure (clock rate, CPI) Better ISA (CPI) Compute the performance per clock rate in MHz The average value of the ratio across different clock rate Rate Pentium III Pentium 4 CINT 2000/MHz 0. 47 0. 36 CFP 2000/MHz 0. 34 0. 39 l t CINT 2000 l l t A fast version sacrifice one aspect (e. g. , CPI) to enhance another (clock rate) Assuming identical codes, CPI of Pentium 4 to Pentium III is 1. 3 CFP 2000: l l Pentium 4 provides new instructions Instruction count and. Performance-36 CPI for Pentium 4 are different Computer Architecture
SPECweb 99: Throughput for Web Servers System Processor 2500 2550 6600 8450 # disk Dr. # CPU #Networks Clock (GHz) Pentium III 8 2 4 1. 13 Pentium III 1 2 1 1. 26 Pentium 4 8 2 Pentium III 7 8 8 0. 7 Result 3435 1454 6700 8001 Fig. 4. 7 Performance-37 Computer Architecture
Performance, Power Fig. 4. 8 Performance-38 Computer Architecture
Performance, Energy R E = performance / power Fig. 4. 9 Performance-39 Computer Architecture
Summary: Performance t Latency v. Throughput l t t CPU Time: time spent executing a single program: depends solely on design of processor (datapath, pipelining effectiveness, caches, etc. ) Performance doesn’t depend on any single factor: need to know Instruction Count, Clocks Per Instruction and Clock Rate to get valid estimations Performance evaluation needs to consider: l l l Benchmark programs Summarizing performance Reporting performance results Performance-40 Computer Architecture
Outline t Performance l l l t Definition CPU performance formula Measuring and evaluating performance Cost (Sec. 1. 4) l Cost of chips Performance-41 Computer Architecture
Chip Cost: Manufacturing Process Fig. 1. 14 Performance-42 Computer Architecture
Cost of a Chip Includes. . . t Die cost: affected by wafer cost, number of dies per wafer, and die yield (#good dies/#total dies) l l t t goes roughly with the cube of the die area An 8” wafer can contain 196 Pentium dies, but only 78 Pentium Pro Testing cost Packaging cost: depends on pins, heat dissipation, . . . Performance-43 Computer Architecture
Real World Examples Chip Metal Line Die Cost layers 386 DX 2 0. 90 486 DX 2 3 0. 80 $12 Power. PC 601 4 0. 80 $53 HP PA 7100 3 0. 80 $73 DEC Alpha 3 0. 70 $149 Super. SPARC 3 0. 70 $272 Pentium 3 0. 80 $417 Wafer Defect Area Dies/ Yield width cost $900 1. 0 $1200 1. 0 /cm 2 43 81 mm 2 360 181 wafer 71% $4 54% $1700 1. 3 121 115 28% $1300 1. 0 196 66 27% $1500 1. 2 234 53 19% $1700 1. 6 256 48 13% $1500 1. 5 296 40 9% Performance-44 Computer Architecture
Summary: Cost t Integrated circuits driving computer industry Die costs goes up with the cube of die area Economics ($$$) is the ultimate driver for performance! Performance-45 Computer Architecture
- Epp packet army
- Cs 4100
- Rtx4100
- Issai 4000
- This passage is adapted from jane austen
- Red blood cells are
- Adapted with permission from
- In what ways have the highland maya adapted to modern life?
- Xerophytes diagram
- Climate of the chaparral biome
- Mensaje subliminal camel
- Adapted animals in the rainforest
- Brother quotes from brother
- Adapted from the internet
- How is amoeba adapted for gas exchange bbc bitesize
- How have plants adapted to the rainforest
- Spermopsida
- The outsiders adapted for struggling readers
- Comprise synoynm
- How are giraffes long necks adapted to their lifestyle
- Cost accumulation and cost assignment
- Cost accumulation and cost assignment
- Cost pools
- Cost accumulation and cost assignment
- Cost control and cost reduction difference
- What is the standard price
- Distinguish between average cost and marginal cost
- Cost control and cost reduction difference
- Process costing vs job order costing
- Ordering cost and carrying cost
- Opportunity cost vs trade off
- Cost control and cost reduction project report
- Cost control and cost reduction project report
- The relative proportion of variable fixed or mixed
- Standard costs and operating performance measures
- Manufacturing cost vs non manufacturing cost
- Process costing definition
- Flotation cost in cost of equity
- Non controllable cost
- Manufacturing cost vs non manufacturing cost
- Standard cost vs budgeted cost
- Cost of equity
- Literal cost gate input cost
- Literal cost gate input cost
- Literal cost gate input cost
- Response time in computer architecture
- International business strategy