EEL 5708 High Performance Computer Architecture Lecture 1

  • Slides: 25
Download presentation
EEL 5708 High Performance Computer Architecture Lecture 1 Introduction August 21, 2006 Lotzi Bölöni

EEL 5708 High Performance Computer Architecture Lecture 1 Introduction August 21, 2006 Lotzi Bölöni Fall 2006 EEL 5708/Bölöni Lec 1. 1

Acknowledgements • All the lecture slides were adopted from the slides of David Patterson

Acknowledgements • All the lecture slides were adopted from the slides of David Patterson (1998, 2001) and David E. Culler (2001), Copyright 19982002, University of California Berkeley EEL 5708/Bölöni Lec 1. 2

Case 1: VIA KT 266 chipset for the Athlon processors EEL 5708/Bölöni Lec 1.

Case 1: VIA KT 266 chipset for the Athlon processors EEL 5708/Bölöni Lec 1. 3

Take 1: April 4, 2001 • Tom’s Hardware ( www. tomshardware. com). Web site

Take 1: April 4, 2001 • Tom’s Hardware ( www. tomshardware. com). Web site for hardware entusiasts. • Review of the VIA Apollo KT 266 chipset. • http: //www 17. tomshardware. com/mainboard/01 q 2/010409/kt 2 66 -10. html • The website’s conclusion: KT 266 is still way too slow to challenge or even replace AMD's 760 chipset. As a conclusion, I could maybe say the typical words always used in early reviews "let's hope VIA will finally improve KT 266". However, I have my doubts if this will happen any time soon. My advice to you is to either forget about DDR altogether for the time being, or to go for Athlon plus AMD 760 and NOTHING ELSE. EEL 5708/Bölöni Lec 1. 4

Take 2: One week later… • Article title: “VIA Apollo KT 266 revisited: Much

Take 2: One week later… • Article title: “VIA Apollo KT 266 revisited: Much Ado About Nothing” (http: //www 17. tomshardware. com/mainboard/01 q 2/0 10416/index. html) • Another website (www. anandtech. com) obtains different results. • An additional resistor (!) mounted on the motherboard and a different BIOS. • Tom’s Hardware concludes that there are indeed improvements, but they are not significant enough to change the conclusion. EEL 5708/Bölöni Lec 1. 5

Take 3: Five months later (September 2001) • VIA KT 266 A is launched

Take 3: Five months later (September 2001) • VIA KT 266 A is launched • Tom’s Hardware: “’A’ stands for vastly improved performance” (http: //www 17. tomshardware. com/mainboard/01 q 3/01 0902/index. html) • Changes: “improvements” to the memory controller. • Processor frequency, bus frequency, etc. stay the same. Pin-by-pin compatible with the predecessors! • Conclusion: “The performance of Apollo KT 266 A is nothing short of impressive. ” EEL 5708/Bölöni Lec 1. 6

Synthetic benchmarks: EEL 5708/Bölöni Lec 1. 7

Synthetic benchmarks: EEL 5708/Bölöni Lec 1. 7

Real world benchmarks EEL 5708/Bölöni Lec 1. 8

Real world benchmarks EEL 5708/Bölöni Lec 1. 8

Some conclusions • “Architecture” matters. • Real world benchmarks less improvement than synthetic ones:

Some conclusions • “Architecture” matters. • Real world benchmarks less improvement than synthetic ones: Amdahl’s Law • Which benchmark do I care about? (this time at least, they were consistent…) • … EEL 5708/Bölöni Lec 1. 9

Case 2: Video compression performance in Intel Pentium 4 vs. AMD Athlon EEL 5708/Bölöni

Case 2: Video compression performance in Intel Pentium 4 vs. AMD Athlon EEL 5708/Bölöni Lec 1. 10

Take 1 (11/20/00): First impressions • Intel Pentium 4 is launched. • The initial

Take 1 (11/20/00): First impressions • Intel Pentium 4 is launched. • The initial measurements show that it greatly overperforms the AMD Athlon for MPEG 4 video compression. • http: //www 6. tomshardware. com/cpu/00 q 4/ 001120/index. html EEL 5708/Bölöni Lec 1. 11

Take 1 (11/20/00): First impressions (cont’d) EEL 5708/Bölöni Lec 1. 12

Take 1 (11/20/00): First impressions (cont’d) EEL 5708/Bölöni Lec 1. 12

Take 2: New results force new conclusions • Concerns are raised about the fact

Take 2: New results force new conclusions • Concerns are raised about the fact that the measurement was done with a low quality setting (MMX arithmetics) • Repeating the measurements with floating point arithmetics, the relative performance was reversed. • http: //www 6. tomshardware. com/cpu/00 q 4/0 01122/index. html EEL 5708/Bölöni Lec 1. 13

Take 2 : New results force new conclusions (cont’d) EEL 5708/Bölöni Lec 1. 14

Take 2 : New results force new conclusions (cont’d) EEL 5708/Bölöni Lec 1. 14

Take 3: Intel engineers create an optimized version of the software • As a

Take 3: Intel engineers create an optimized version of the software • As a response, Intel engineers created a modified version of the software: -recompiled it with higher optimizations. -rewritten parts of the code to use the new instruction set extensions (SSE 2) • The higher optimizations benefited both Intel and AMD processors (but Intel more) • The SSE 2 options reversed the performance ranking again. • OBS: AMD engineers created an AMD optimized version, too, with significant improvements, but this did not change the rankings. EEL 5708/Bölöni Lec 1. 15

Take 3: Intel engineers create an optimized version of the software EEL 5708/Bölöni Lec

Take 3: Intel engineers create an optimized version of the software EEL 5708/Bölöni Lec 1. 16

Take 3 (cont’d) EEL 5708/Bölöni Lec 1. 17

Take 3 (cont’d) EEL 5708/Bölöni Lec 1. 17

Case 2: Conclusions • Real world benchmark, huge differences – Why? • Software solution

Case 2: Conclusions • Real world benchmark, huge differences – Why? • Software solution to a hardware problem? – Optimizing for the architecture – So, what if it is not open source? – Software development cycles… • Picking the right architecture + understanding the architecture we have EEL 5708/Bölöni Lec 1. 18

Review: Measuring performance EEL 5708/Bölöni Lec 1. 19

Review: Measuring performance EEL 5708/Bölöni Lec 1. 19

Performance measures • Time to execute a given program • Number of programs which

Performance measures • Time to execute a given program • Number of programs which can be run in parallel • Responsiveness (user interfaces) • Predictable execution time (for real time systems) • Energy consumption (mostly for portables, but check the new Google and Microsoft data centers…) • And so on… EEL 5708/Bölöni Lec 1. 20

Which is faster? (Latency vs throughput) Plane DC to Paris Speed Passengers Throughput (pmph)

Which is faster? (Latency vs throughput) Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6. 5 hours 610 mph 470 286, 700 BAD/Sud Concorde 3 hours 1350 mph 132 178, 200 • Time to run the task (Ex. Time) – Execution time, response time, latency • Tasks per day, hour, week, sec, ns … (Performance) – Throughput, bandwidth EEL 5708/Bölöni Lec 1. 21

Definitions • Performance is in units of things per sec – bigger is better

Definitions • Performance is in units of things per sec – bigger is better • If we are primarily concerned with response time – performance(x) = 1 execution_time(x) " X is n times faster than Y" means Execution_time(Y) Performance(X) n = = Performance(Y) Execution_time(X) EEL 5708/Bölöni Lec 1. 22

CPI Computer Performance CPU time = Seconds = Instructions x Program inst count Cycle

CPI Computer Performance CPU time = Seconds = Instructions x Program inst count Cycle time Cycles x Seconds Instruction Cycle Inst Count CPI Clock Program X Compiler X (X) Inst. Set. X X Organization Technology X X X EEL 5708/Bölöni Lec 1. 23

Cycles Per Instruction (Throughput) “Average Cycles per Instruction” CPI = (CPU Time * Clock

Cycles Per Instruction (Throughput) “Average Cycles per Instruction” CPI = (CPU Time * Clock Rate) / Instruction Count = Cycles / Instruction Count “Instruction Frequency” EEL 5708/Bölöni Lec 1. 24

Example: Calculating CPI bottom up Base Machine Op ALU Load Store Branch (Reg /

Example: Calculating CPI bottom up Base Machine Op ALU Load Store Branch (Reg / Freq 50% 20% 10% 20% Reg) Cycles 1 2 2 2 Typical Mix of instruction types in program CPI(i). 5. 4. 2. 4 1. 5 (% Time) (33%) (27%) (13%) (27%) EEL 5708/Bölöni Lec 1. 25