Performance CSCI 312 Computer Organization and Architecture Fall

  • Slides: 30
Download presentation
Performance CSCI 312 Computer Organization and Architecture Fall 2019 Lecture note Dr. Sajedul Talukder

Performance CSCI 312 Computer Organization and Architecture Fall 2019 Lecture note Dr. Sajedul Talukder 28 February

Basic Terminology • Bits • The smallest unit of information in a computer •

Basic Terminology • Bits • The smallest unit of information in a computer • 0 or 1 • Bytes • 8 bits • Eg. 0100 10102 = 4 A 16 = 74 = ‘J’ 8 4 2 1 0100 1010 A = 4 A 4

Basic Terminology (cont) • KB • Kilo Bytes • Eg. 1 KB = 1,

Basic Terminology (cont) • KB • Kilo Bytes • Eg. 1 KB = 1, 024 Bytes ≈ 1, 000 Bytes • MB • Mega Bytes • Eg. 1 MB = 1, 048, 576 ≈ 1, 000 Bytes • GB • Giga Bytes • Eg. 1 GB = 1, 073, 741, 824 ≈ 1, 000, 000 Bytes = 1 X 109 bytes • TB • Tera Bytes • Eg. 1 TB = 1000 GB

Understanding Performance • Algorithm • Determines number of operations executed • Programming language, compiler,

Understanding Performance • Algorithm • Determines number of operations executed • Programming language, compiler, architecture • Determine number of machine instructions executed per operation • Processor and memory system • Determine how fast instructions are executed • I/O system (including OS) • Determines how fast I/O operations are executed

Defining Performance Let’s suppose we define performance in terms of speed.

Defining Performance Let’s suppose we define performance in terms of speed.

Defining Performance • Which airplane has the best performance?

Defining Performance • Which airplane has the best performance?

CPU Execution

CPU Execution

Response Time and Throughput • Response time • How long it takes to do

Response Time and Throughput • Response time • How long it takes to do a task • Throughput • Total work done per unit time • e. g. , tasks/transactions/… per hour • How are response time and throughput affected by • Replacing the processor with a faster version? • Adding more processors? • We’ll focus on response time for now…

Quick Question Decreasing response time almost always improves throughput. Hence, in case 1, both

Quick Question Decreasing response time almost always improves throughput. Hence, in case 1, both response time and throughput are improved. In case 2, no one task gets work done faster, so only throughput increases.

Relative Performance • Define Performance = 1/Execution Time • “X is n time faster

Relative Performance • Define Performance = 1/Execution Time • “X is n time faster than Y”

Relative Performance n Solution: time taken to run a program n n n 10

Relative Performance n Solution: time taken to run a program n n n 10 s on A, 15 s on B Execution Time. B / Execution Time. A = 15 s / 10 s = 1. 5 So A is 1. 5 times faster than B

Measuring Execution Time • Elapsed time • Total response time, including all aspects •

Measuring Execution Time • Elapsed time • Total response time, including all aspects • Processing, I/O, OS overhead, idle time • Determines system performance • CPU time • Time spent processing a given job • Discounts I/O time, other jobs’ shares • Comprises user CPU time and system CPU time • Different programs are affected differently by CPU and system performance

CPU Clock • A crystal oscillator is an electronic oscillator circuit that uses the

CPU Clock • A crystal oscillator is an electronic oscillator circuit that uses the mechanical resonance of a vibrating crystal of piezoelectric material to create an electrical signal provide a stable clock signal for digital integrated circuits with a precise frequency. • Operation of digital hardware governed by a constant-rate clock • For example, a 200 MHz CPU receives 200 million pulses per second Crystal oscillator

CPU Clocking • Operation of CPU is governed by a constant-rate clock Clock period

CPU Clocking • Operation of CPU is governed by a constant-rate clock Clock period Clock (cycles) Data transfer and computation Update state n Clock period: duration of a clock cycle n n e. g. , 250 ps = 0. 25 ns = 250× 10– 12 s Clock frequency (rate): cycles per second n e. g. , 4. 0 GHz = 4000 MHz = 4. 0× 109 Hz

CPU Time • Performance improved by • Reducing number of clock cycles • Increasing

CPU Time • Performance improved by • Reducing number of clock cycles • Increasing clock rate • Hardware designer must often trade off clock rate against cycle count

CPU Time Example

CPU Time Example

CPU Time Example • Computer A: 2 GHz clock, 10 s CPU time •

CPU Time Example • Computer A: 2 GHz clock, 10 s CPU time • Designing Computer B • Aim for 6 s CPU time • Can do faster clock, but causes 1. 2 × clock cycles • How fast must Computer B clock be?

Instruction Count and CPI: clock cycles per instruction

Instruction Count and CPI: clock cycles per instruction

CPI Example

CPI Example

CPI Example • Computer A: Cycle Time = 250 ps, CPI = 2. 0

CPI Example • Computer A: Cycle Time = 250 ps, CPI = 2. 0 • Computer B: Cycle Time = 500 ps, CPI = 1. 2 • Same ISA • Which is faster, and by how much? A is faster… …by this much

CPI in More Detail • If different instruction classes take different numbers of cycles

CPI in More Detail • If different instruction classes take different numbers of cycles n Weighted average CPI Relative frequency

CPI Example • Alternative compiled code sequences using instructions in classes A, B, C

CPI Example • Alternative compiled code sequences using instructions in classes A, B, C n Class A B C CPI for class 1 2 3 IC in sequence 1 2 2+1+2=5 inst. IC in sequence 2 4 1 1 4+1+1=6 inst. Sequence 1: IC = 5 n n Clock Cycles = 2× 1 + 1× 2 + 2× 3 = 10 Avg. CPI = 10/5 = 2. 0 n Sequence 2: IC = 6 n n Clock Cycles = 4× 1 + 1× 2 + 1× 3 =9 Avg. CPI = 9/6 = 1. 5

Performance Summary The BIG Picture • Performance depends on • • Algorithm: affects IC,

Performance Summary The BIG Picture • Performance depends on • • Algorithm: affects IC, possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI Instruction set architecture: affects IC, CPI, Tc

More complex pipeline Simpler pipeline Core 2 • In CMOS IC technology Dynamic Power

More complex pipeline Simpler pipeline Core 2 • In CMOS IC technology Dynamic Power × 30 CMOS primary energy consumption is dynamic energy, switch on->off; off->on controlled by the clock freq. 5 V → 1 V × 1000 § 1. 7 The Power Wall Power Trends

Relative Power

Relative Power

Reducing Power • Suppose a new CPU has • 85% of capacitive load of

Reducing Power • Suppose a new CPU has • 85% of capacitive load of old CPU • 15% voltage and 15% frequency reduction n The power wall n n n We can’t reduce voltage further We can’t remove more heat How to improve overall performance?

Problem

Problem

Pitfall: Amdahl’s Law • Improving an aspect of a computer and expecting a proportional

Pitfall: Amdahl’s Law • Improving an aspect of a computer and expecting a proportional improvement in overall performance n Example: multiply accounts for 80 s/100 s n How much improvement in multiply performance to get 5× overall? n n Can’t be done! Corollary: make the common case fast

Concluding Remarks • Cost/performance is improving • Due to underlying technology development • Hierarchical

Concluding Remarks • Cost/performance is improving • Due to underlying technology development • Hierarchical layers of abstraction • In both hardware and software • Instruction set architecture • The hardware/software interface • Execution time: the best performance measure • Power is a limiting factor • Use parallelism to improve performance

Questions? 29

Questions? 29