Execution time Execution Time processorrelated IC x CPI

  • Slides: 22
Download presentation
Execution time • Execution Time (processor-related) = IC x CPI x T IC =

Execution time • Execution Time (processor-related) = IC x CPI x T IC = instruction count CPI = average number of system clock periods to execute an instruction T = clock period

Review

Review

CS 501 Advanced Computer Architecture Lecture 03 Dr. Noor Muhammad Sheikh

CS 501 Advanced Computer Architecture Lecture 03 Dr. Noor Muhammad Sheikh

Example Consider two SRC programs having three types of instructions given as follows Number

Example Consider two SRC programs having three types of instructions given as follows Number of. . data transfer instructions control instructions ALSU Instructions Program 1 Program 2 2 1 5 1 Compare both the programs for the following parameters 1. Instruction count 2. Speed of execution

Example contd. . 1. 2. Instruction count IC. IC for program 1= 2+2+2=6 IC

Example contd. . 1. 2. Instruction count IC. IC for program 1= 2+2+2=6 IC for program 2= 1+5+1=7 For execution time we can use the following SRC specifications. Instruction Type CPI ET = IC x CPI x T ET 1= (2 x 2)+(2 x 3)+(2 x 4) Control 2 = 18 ALSU 3 ET 2 =(5 x 2)+(1 x 3)+(1 x 4) Data Transfer 4 =17 Note: Since both programs are executing on the same machine, the T factor can be ignored while calculating ET.

Problem: Consider the following SRC code segments for implementing the operation a=b+5 c. Find

Problem: Consider the following SRC code segments for implementing the operation a=b+5 c. Find which one is more efficient in terms of instruction count and execution time. Program 1: Multiplication by using repeated addition in a for loop org 0 a: . dw 1 b: . dw 1 c: . dw 1. org 80 la r 5, 5 lar r 6, mpy lar r 7, next ld r 2, b ld r 3, c la r 4, 0 mpy: brzr r 7, r 5 ; load value of loop ; load address of mpy ; load address of next ; load contents of b ; load contents of c ; load 0 in r 4 ; jump to next after 5 iterations add r 4, r 3 ; r 4 contains r 4+c addi r 5, -1 ; decrement index br r 6 ; loop again next: add r 4, r 2 ; r 4 contains sum of b and 5 c st r 4, a ; store at address a stop

Problem: Consider the following two SRC code segments for implementing the operation a=b+5 c.

Problem: Consider the following two SRC code segments for implementing the operation a=b+5 c. Find which one is more efficient in terms of instruction count and execution time. Program 2: Multiplication using subroutine call. org 0 a: . dw 1 b: . dw 1 c: . dw 1. org 80 lar r 1, mpy ld r 2, b la r 3, 5 ld r 4, c brl r 5, r 1 add r 2, r 7 st r 2, a ; load address of mpy in r 1 ; load contents of b in r 2 ; load index in r 3 ; load contents of c in r 4 ; r 5 contains PC ; r 2 contains sum b+5 c stop mpy: la r 7, 0 lar r 8, again: brzr r 5, r 3 add r 7, r 4 addi r 3, -1 br r 8 ; r 7 contains zero ; r 8 contain again address ; exit loop when index is 0 ; r 7 contains r 7+c ; decrement index

Solution The instructions in both programs can be divided into 3 types and the

Solution The instructions in both programs can be divided into 3 types and the respective count of each type is Number of. . Program 1 Data transfer instructions 7 7 Control instructions 3 4 ALSU instructions 3 3 IC for program 1 = 7 + 3= 13 IC for program 2 = 7 + 4 + 3= 14 Program 2

Solution contd. . For execution time, consider the following SRC specifications. Instruction Type CPI

Solution contd. . For execution time, consider the following SRC specifications. Instruction Type CPI ET = IC x CPI x T Control 2 ET 1= (7 x 4)+(3 x 2)+(3 x 3) ALSU 3 = 43 T Data Transfer 4 ET 2= (7 x 4)+(4 x 2)+(3 x 3) = 45 T Conclusion: Program 1 runs faster than program 2 as obvious from the execution time of both.

MIPS • Millions of Instructions Per Second = IC / (ET x 106) •

MIPS • Millions of Instructions Per Second = IC / (ET x 106) • Capability of different instructions varies from machine to machine, eg. RISC machines have simpler instructions, so the same job will require more instructions • Was popular when the VAX 11/780 was treated as a reference – late 70 s and early 80 s

MIPS as a performance metric • MIPS is inversely proportional to execution time, ET=

MIPS as a performance metric • MIPS is inversely proportional to execution time, ET= IC / (MIPS x 106 )

Example Consider a machine having a 100 MHz clock and three instruction types with

Example Consider a machine having a 100 MHz clock and three instruction types with following Instruction Type CPI parameters. Control 2 Now suppose that two ALSU 3 different compilers generate Data Transfer 4 code for the same program. The instruction count for each is given as follows IC in millions Code from compiler 1 Code from compiler 2 Control 5 10 ALSU 1 1 Data Transfer 1 1

Compare the two codes according to MIPS and according to execution time. Solution: First

Compare the two codes according to MIPS and according to execution time. Solution: First we find the CPI for both code sequences Since CPI = clock cycles for each type of instruction / IC CPI 1= (5 x 2 + 1 x 3 + 1 x 4)/ 7 = 2. 43 CPI 2= (10 x 2 +1 x 3 + 1 x 4)/12 = 2. 25 As MIPS= Clock Rate/ (CPI x 106 ) MIPS 1= 100 x 106 / (2. 43 x 106) = 41. 15 MIPS 2=100 x 106 / (2. 25 x 106) = 44. 44 Hence the code generated by compiler 2 has higher MIPS Rating.

Compare the two codes according to MIPS and according to execution time. Solution: First

Compare the two codes according to MIPS and according to execution time. Solution: First we find the CPI for both code sequences Since CPI = clock cycles for each type of instruction / IC CPI 1= (5 x 2 + 1 x 3 + 1 x 4)/ 7 = 2. 43 CPI 2= (10 x 2 +1 x 3 + 1 x 4)/12 = 2. 25 As MIPS= Clock Rate/ (CPI x 106 ) MIPS 1= 100 x 106 / (2. 43 x 106) As MIPS = IC / (ET x 106) = 41. 15 MIPS= (IC x clock rate)/ ( IC x CPI x 106) MIPS 2=100 x 106 / (2. 25 x 106) = Clock rate/(CPI x 106) = 44. 44 Hence the code generated by compiler 2 has higher MIPS Rating.

Solution contd. . Since ET = IC / (MIPS x 106) ET 1= (7

Solution contd. . Since ET = IC / (MIPS x 106) ET 1= (7 x 106) / (41. 15 x 106) = 0. 17 seconds ET 2= (12 x 106) / ( 44. 44 x 106) = 0. 27 seconds Hence code sequence 1 is much more efficient in terms of execution time.

MFLOPS • Millions of FLoating point Operations Per Second • Using FP operations makes

MFLOPS • Millions of FLoating point Operations Per Second • Using FP operations makes more sense to some compared to using just any instructions • Results vary from FP op to FP op • Better compared to MIPS because of two reasons:

2 reasons 1. 2. FP ops are complex, and therefore, provide a better picture

2 reasons 1. 2. FP ops are complex, and therefore, provide a better picture of the hardware capabilities on which they are run Overheads (get operands, store results, etc. ) are effectively lumped with the FP ops they support

Dhrystones *** • Dhrystone is a general “integer performance” benchmark test originally developed by

Dhrystones *** • Dhrystone is a general “integer performance” benchmark test originally developed by Reinhold Weicker in 1984. • Small program; less than 100 HLL statements • Compiles to about 1 to 1. 5 Kb of code *** The name is a play on the word Whetstone

Disadvantages of using Whetstones and Dhrystones Both Whetstones and Dhrystones are now considered obsolete

Disadvantages of using Whetstones and Dhrystones Both Whetstones and Dhrystones are now considered obsolete because of the following reasons. § Small, fit in cache § Obsolete instruction mix § Prone to compiler tricks § Difficult to reproduce results § Uncontrolled source code

SPEC • System Performance Evaluation Cooperative • (SPEC) was founded in October, 1988, by

SPEC • System Performance Evaluation Cooperative • (SPEC) was founded in October, 1988, by Apollo, Hewlett-Packard, MIPS Computer Systems and SUN Microsystems • Latest version is SPEC CPU 2000

SPEC • The standard SPEC benchmark suite includes: § A compiler § A Boolean

SPEC • The standard SPEC benchmark suite includes: § A compiler § A Boolean minimization program § A spreadsheet program § A number of other programs that stress arithmetic processing speed • It uses a simple metric, elapsed time, to measure performance of competing machines • Machine independent code is used for fair comparison

Advantages • • It provides for ease of publication. Each benchmark carries the same

Advantages • • It provides for ease of publication. Each benchmark carries the same weight. SPECratio is dimensionless. It is not unduly influenced by long running programs. • It is relatively immune to performance variation on individual benchmarks. • It provides a consistent and fair metric.