18 447 Computer Architecture Lecture 4 ISA Tradeoffs

  • Slides: 53
Download presentation
18 -447 Computer Architecture Lecture 4: ISA Tradeoffs (Continued) and MIPS ISA Prof. Onur

18 -447 Computer Architecture Lecture 4: ISA Tradeoffs (Continued) and MIPS ISA Prof. Onur Mutlu Kevin Chang Carnegie Mellon University Spring 2015, 1/21/2015 1

Agenda for Today n Finish off ISA tradeoffs A quick tutorial on MIPS ISA

Agenda for Today n Finish off ISA tradeoffs A quick tutorial on MIPS ISA n Upcoming schedule: n q q Lab 1. 5 & 2 are out today Friday (1/23): Lab 1 due Friday (1/23): Recitation Wednesday (1/28): HW 1 due 2

Upcoming Readings n Next week (Microarchitecture): q q P&H, Chapter 4, Sections 4. 1

Upcoming Readings n Next week (Microarchitecture): q q P&H, Chapter 4, Sections 4. 1 -4. 4 P&P, revised Appendix C – LC 3 b datapath and microprogrammed operation 3

Last Lecture Recap n Instruction processing style q n Elements of an ISA q

Last Lecture Recap n Instruction processing style q n Elements of an ISA q n n 0, 1, 2, 3 address machines Instructions, data types, memory organizations, registers, etc Addressing modes Complex (CISC) vs. simple (RISC) instructions Semantic gap ISA translation 4

ISA-level Tradeoffs: Instruction n Fixed length: Length of all instructions the same Length +

ISA-level Tradeoffs: Instruction n Fixed length: Length of all instructions the same Length + + --n Easier to decode single instruction in hardware Easier to decode multiple instructions concurrently Wasted bits in instructions (Why is this bad? ) Harder-to-extend ISA (how to add new instructions? ) Variable length: Length of instructions different (determined by opcode and sub-opcode) + Compact encoding (Why is this good? ) Intel 432: Huffman encoding (sort of). 6 to 321 bit instructions. How? -- More logic to decode a single instruction -- Harder to decode multiple instructions concurrently n Tradeoffs q q q Code size (memory space, bandwidth, latency) vs. hardware complexity ISA extensibility and expressiveness vs. hardware complexity Performance? Energy? Smaller code vs. ease of decode 5

ISA-level Tradeoffs: Uniform Decode n Uniform decode: Same bits in each instruction correspond to

ISA-level Tradeoffs: Uniform Decode n Uniform decode: Same bits in each instruction correspond to the same meaning Opcode is always in the same location q Ditto operand specifiers, immediate values, … q Many “RISC” ISAs: Alpha, MIPS, SPARC + Easier decode, simpler hardware + Enables parallelism: generate target address before knowing the instruction is a branch -- Restricts instruction format (fewer instructions? ) or wastes space q n Non-uniform decode E. g. , opcode can be the 1 st-7 th byte in x 86 + More compact and powerful instruction format -- More complex decode logic q 6

x 86 vs. Alpha Instruction Formats n x 86: n Alpha: 7

x 86 vs. Alpha Instruction Formats n x 86: n Alpha: 7

MIPS Instruction Format n R-type, 3 register operands 0 6 bit n 6 bit

MIPS Instruction Format n R-type, 3 register operands 0 6 bit n 6 bit 5 bit rd 5 bit shamt 5 bit funct 6 bit R type rs 5 bit rt 5 bit immediate 16 bit I type J-type, 26 -bit immediate operand opcode 6 bit n 5 bit rt I-type, 2 register operands and 16 -bit immediate operand opcode n rs immediate 26 bit J type Simple Decoding q q q 4 bytes per instruction, regardless of format must be 4 -byte aligned (2 lsb of PC must be 2 b’ 00) format and fields easy to extract in hardware 8

ARM 9

ARM 9

A Note on Length and Uniformity n n Uniform decode usually goes with fixed

A Note on Length and Uniformity n n Uniform decode usually goes with fixed length In a variable length ISA, uniform decode can be a property of instructions of the same length q It is hard to think of it as a property of instructions of different lengths 10

A Note on RISC vs. CISC n Usually, … n RISC q q n

A Note on RISC vs. CISC n Usually, … n RISC q q n Simple instructions Fixed length Uniform decode Few addressing modes CISC q q Complex instructions Variable length Non-uniform decode Many addressing modes 11

ISA-level Tradeoffs: Number of Registers n Affects: q q q n Number of bits

ISA-level Tradeoffs: Number of Registers n Affects: q q q n Number of bits used for encoding register address Number of values kept in fast storage (register file) (uarch) Size, access time, power consumption of register file Large number of registers: + Enables better register allocation (and optimizations) by compiler fewer saves/restores -- Larger instruction size -- Larger register file size 12

ISA-level Tradeoffs: Addressing Modes n Addressing mode specifies how to obtain an operand of

ISA-level Tradeoffs: Addressing Modes n Addressing mode specifies how to obtain an operand of an instruction q q q n Register Immediate Memory (displacement, register indirect, indexed, absolute, memory indirect, autoincrement, autodecrement, …) More modes: + help better support programming constructs (arrays, pointerbased accesses) -- make it harder for the architect to design -- too many choices for the compiler? n Many ways to do the same thing complicates compiler design n Wulf, “Compilers and Computer Architecture, ” IEEE Computer 1981 13

x 86 vs. Alpha Instruction Formats n x 86: n Alpha: 14

x 86 vs. Alpha Instruction Formats n x 86: n Alpha: 14

x 86 register indirect absolute Memory SIB + displacement register Register 15

x 86 register indirect absolute Memory SIB + displacement register Register 15

x 86 indexed (base + index) scaled (base + index*4) 16

x 86 indexed (base + index) scaled (base + index*4) 16

X 86 SIB-D Addressing Mode x 86 Manual Vol. 1, page 3 -22 --

X 86 SIB-D Addressing Mode x 86 Manual Vol. 1, page 3 -22 -- see course resources on website Also, see Section 3. 7. 3 and 3. 7. 5 17

X 86 Manual: Suggested Uses of Addressing Modes Static address Dynamic storage Arrays Records

X 86 Manual: Suggested Uses of Addressing Modes Static address Dynamic storage Arrays Records x 86 Manual Vol. 1, page 3 -22 -- see course resources on website Also, see Section 3. 7. 3 and 3. 7. 5 18

X 86 Manual: Suggested Uses of Addressing Modes Static arrays w/ fixed-size elements 2

X 86 Manual: Suggested Uses of Addressing Modes Static arrays w/ fixed-size elements 2 D arrays x 86 Manual Vol. 1, page 3 -22 -- see course resources on website Also, see Section 3. 7. 3 and 3. 7. 5 19

Other Example ISA-level Tradeoffs n n n n n Condition codes vs. not VLIW

Other Example ISA-level Tradeoffs n n n n n Condition codes vs. not VLIW vs. single instruction Precise vs. imprecise exceptions Virtual memory vs. not Unaligned access vs. not Hardware interlocks vs. software-guaranteed interlocking Software vs. hardware managed page fault handling Cache coherence (hardware vs. software) … 20

Back to Programmer vs. (Micro)architect n Many ISA features designed to aid programmers n

Back to Programmer vs. (Micro)architect n Many ISA features designed to aid programmers n But, complicate the hardware designer’s job n Virtual memory q q n n vs. overlay programming Should the programmer be concerned about the size of code blocks fitting physical memory? Addressing modes Unaligned memory access q Compiler/programmer needs to align data 21

MIPS: Aligned Access MSB n byte 2 byte 1 byte 0 byte 7 byte

MIPS: Aligned Access MSB n byte 2 byte 1 byte 0 byte 7 byte 6 byte 5 byte 4 LSB LW/SW alignment restriction: 4 -byte word-alignment q q n byte 3 not designed to fetch memory bytes not within a word boundary not designed to rotate unaligned bytes into registers Provide separate opcodes for the “infrequent” case q q A B C D LWL rd 6(r 0) byte 6 byte 5 byte 4 D LWR rd 3(r 0) byte 6 byte 5 byte 4 byte 3 LWL/LWR is slower Note LWL and LWR still fetch within word boundary 22

X 86: Unaligned Access n n LD/ST instructions automatically align data that spans a

X 86: Unaligned Access n n LD/ST instructions automatically align data that spans a “word” boundary Programmer/compiler does not need to worry about where data is stored (whether or not in a word-aligned location) 23

X 86: Unaligned Access 24

X 86: Unaligned Access 24

What About ARM? n https: //www. scss. tcd. ie/~waldroj/3 d 1/arm_arm. pdf q Section

What About ARM? n https: //www. scss. tcd. ie/~waldroj/3 d 1/arm_arm. pdf q Section A 2. 8 25

Aligned vs. Unaligned Access n Pros of having no restrictions on alignment n Cons

Aligned vs. Unaligned Access n Pros of having no restrictions on alignment n Cons of having no restrictions on alignment n Filling in the above: an exercise for you… 26

CMU 18 447 S’ 13 © 2011 J. C. Hoe 18 447 MIPS ISA

CMU 18 447 S’ 13 © 2011 J. C. Hoe 18 447 MIPS ISA James C. Hoe Dept of ECE, CMU

MIPS R 2000 Program Visible State Program Counter **Note** r 0=0 r 1 r

MIPS R 2000 Program Visible State Program Counter **Note** r 0=0 r 1 r 2 32 bit memory address of the current instruction General Purpose Register File M[0] M[1] M[2] M[3] M[4] 32 32 bit words named r 0. . . r 31 Memory 232 by 8 bit locations (4 Giga Bytes) 32 bit address (there is some magic going on) M[N 1] CMU 18 447 S’ 13 © 2011 J. C. Hoe

Data Format u Most things are 32 bits instruction and data addresses signed and

Data Format u Most things are 32 bits instruction and data addresses signed and unsigned integers just bits u u Also 16 bit word and 8 bit word (aka byte) Floating point numbers IEEE standard 754 float: 8 bit exponent, 23 bit significand double: 11 bit exponent, 52 bit significand CMU 18 447 S’ 13 © 2011 J. C. Hoe

CMU 18 447 S’ 13 © 2011 J. C. Hoe Big Endian vs. Little

CMU 18 447 S’ 13 © 2011 J. C. Hoe Big Endian vs. Little Endian (Part I, Chapter 4, Gulliver’s Travels) u 32 bit signed or unsigned integer comprises 4 bytes MSB (most significant) u 8 bit On a byte addressable machine. . . . Big Endian MSB byte 0 byte 4 byte 8 byte 12 byte 16 byte 1 byte 5 byte 9 byte 13 byte 17 byte 2 byte 6 byte 10 byte 14 byte 18 Little Endian LSB byte 3 byte 7 byte 11 byte 15 byte 19 pointer points to the big end u LSB (least significant) MSB byte 3 byte 7 byte 11 byte 15 byte 19 byte 2 byte 6 byte 10 byte 14 byte 18 byte 1 byte 5 byte 9 byte 13 byte 17 LSB byte 0 byte 4 byte 8 byte 12 byte 16 pointer points to the little end What difference does it make? check out htonl(), ntohl() in in. h

Instruction Formats u CMU 18 447 S’ 13 © 2011 J. C. Hoe 3

Instruction Formats u CMU 18 447 S’ 13 © 2011 J. C. Hoe 3 simple formats R type, 3 register operands 0 6 bit rs 5 bit rt 5 bit rd 5 bit shamt 5 bit funct 6 bit I type, 2 register operands and 16 bit immediate operand opcode rs rt immediate 6 bit 5 bit 16 bit R type I type J type, 26 bit immediate operand opcode 6 bit u immediate 26 bit Simple Decoding J type 4 bytes per instruction, regardless of format must be 4 byte aligned (2 lsb of PC must be 2 b’ 00) format and fields readily extractable

CMU 18 447 S’ 13 © 2011 J. C. Hoe ALU Instructions u u

CMU 18 447 S’ 13 © 2011 J. C. Hoe ALU Instructions u u Assembly (e. g. , register signed addition) ADD rdreg rsreg rtreg Machine encoding 0 6 bit u Semantics rs 5 bit rt 5 bit rd 5 bit 0 5 bit ADD 6 bit GPR[rd] GPR[rs] + GPR[rt] PC + 4 u u Exception on “overflow” Variations Arithmetic: {signed, unsigned} x {ADD, SUB} Logical: {AND, OR, XOR, NOR} Shift: {Left, Right Logical, Right Arithmetic} R type

Reg Instruction Encoding CMU 18 447 S’ 13 © 2011 J. C. Hoe [MIPS

Reg Instruction Encoding CMU 18 447 S’ 13 © 2011 J. C. Hoe [MIPS R 4000 Microprocessor User’s Manual] What patterns do you see? Why are they there?

ALU Instructions u u Assembly (e. g. , regi immediate signed additions) ADDI rtreg

ALU Instructions u u Assembly (e. g. , regi immediate signed additions) ADDI rtreg rsreg immediate 16 Machine encoding ADDI 6 bit u Semantics rs 5 bit rt 5 bit immediate 16 bit GPR[rt] GPR[rs] + sign extend (immediate) PC + 4 u u CMU 18 447 S’ 13 © 2011 J. C. Hoe Exception on “overflow” Variations Arithmetic: {signed, unsigned} x {ADD, SUB} Logical: {AND, OR, XOR, LUI} I type

Reg Immed Instruction Encoding CMU 18 447 S’ 13 © 2011 J. C. Hoe

Reg Immed Instruction Encoding CMU 18 447 S’ 13 © 2011 J. C. Hoe [MIPS R 4000 Microprocessor User’s Manual]

Assembly Programming 101 u Break down high level program constructs into a sequence of

Assembly Programming 101 u Break down high level program constructs into a sequence of elemental operations u E. g. High level Code f = ( g + h ) – ( i + j ) u Assembly Code suppose f, g, h, i, j are in rf, rg, rh, ri, rj suppose rtemp is a free register add rtemp rg rh add rf ri rj sub rf rtemp rf # rtemp = g+h # rf = i+j # f = rtemp – rf CMU 18 447 S’ 13 © 2011 J. C. Hoe

CMU 18 447 S’ 13 © 2011 J. C. Hoe Load Instructions u u

CMU 18 447 S’ 13 © 2011 J. C. Hoe Load Instructions u u Assembly (e. g. , load 4 byte word) LW rtreg offset 16 (basereg) Machine encoding LW 6 bit u Semantics base 5 bit rt 5 bit offset 16 bit I type effective_address = sign extend(offset) + GPR[base] GPR[rt] MEM[ translate(effective_address) ] PC + 4 u Exceptions address must be “word aligned” What if you want to load an unaligned word? MMU exceptions

Store Instructions u u Assembly (e. g. , store 4 byte word) SW rtreg

Store Instructions u u Assembly (e. g. , store 4 byte word) SW rtreg offset 16 (basereg) Machine encoding SW 6 bit u Semantics base 5 bit rt 5 bit offset 16 bit effective_address = sign extend(offset) + GPR[base] MEM[ translate(effective_address) ] GPR[rt] PC + 4 u CMU 18 447 S’ 13 © 2011 J. C. Hoe Exceptions address must be “word aligned” MMU exceptions I type

Assembly Programming 201 u E. g. High level Code A[ 8 ] = h

Assembly Programming 201 u E. g. High level Code A[ 8 ] = h + A[ 0 ] where A is an array of integers (4–byte each) u Assembly Code suppose &A, h are in r. A, rh suppose rtemp is a free register LW rtemp 0(r. A) # rtemp = A[0] add rtemp rh rtemp # rtemp = h + A[0] SW rtemp 32(r. A) # A[8] = rtemp # note A[8] is 32 bytes # from A[0] CMU 18 447 S’ 13 © 2011 J. C. Hoe

Load Delay Slots LW CMU 18 447 S’ 13 © 2011 J. C. Hoe

Load Delay Slots LW CMU 18 447 S’ 13 © 2011 J. C. Hoe ra --- addi r- ra r- u R 2000 load has an architectural latency of 1 inst*. the instruction immediately following a load (in the “delay slot”) still sees the old register value the load instruction no longer has an atomic semantics Why would you do it this way? u Is this a good idea? (hint: R 4000 redefined LW to complete atomically) *BTW, notice that latency is defined in “instructions” not cyc. or sec.

Control Flow Instructions u C Code { code A } if X==Y then {

Control Flow Instructions u C Code { code A } if X==Y then { code B } else { code C } { code D } Control Flow Graph True Assembly Code (linearized) code A if X==Y goto code C code B False code C goto code B code D these things are called basic blocks CMU 18 447 S’ 13 © 2011 J. C. Hoe

(Conditional) Branch Instructions u u Assembly (e. g. , branch if equal) BEQ rsreg

(Conditional) Branch Instructions u u Assembly (e. g. , branch if equal) BEQ rsreg rtreg immediate 16 Machine encoding BEQ 6 bit u Semantics rs 5 bit rt 5 bit immediate 16 bit target = PC + sign extend(immediate) x 4 if GPR[rs]==GPR[rt] then PC target else PC + 4 u u CMU 18 447 S’ 13 © 2011 J. C. Hoe How far can you jump? Variations BEQ, BNE, BLEZ, BGTZ Why isn’t there a BLE or BGT instruction? PC + 4 w/ branch delay slot I type

Jump Instructions u u Assembly J immediate 26 Machine encoding J 6 bit u

Jump Instructions u u Assembly J immediate 26 Machine encoding J 6 bit u Semantics J type immediate 26 bit target = PC[31: 28]x 228 |bitwise or zero extend(immediate)x 4 PC target u u CMU 18 447 S’ 13 © 2011 J. C. Hoe How far can you jump? Variations Jump and Link Jump Registers PC + 4 w/ branch delay slot

Assembly Programming 301 u E. g. High level Code fork if (i == j)

Assembly Programming 301 u E. g. High level Code fork if (i == j) then e = g else e = h f = e u CMU 18 447 S’ 13 © 2011 J. C. Hoe Assembly Code then else join suppose e, f, g, h, i, j are in re, rf, rg, rh, ri, rj bne ri rj L 1: L 2: add re rg r 0 j L 2 add re rh r 0 add rf re r 0. . # L 1 and L 2 are addr labels # assembler computes offset # e = g # e = h # f = e

CMU 18 447 S’ 13 © 2011 J. C. Hoe Branch Delay Slots u

CMU 18 447 S’ 13 © 2011 J. C. Hoe Branch Delay Slots u R 2000 branch instructions also have an architectural latency of 1 instructions the instruction immediately after a branch is always executed (in fact PC offset is computed from the delay slot instruction) branch target takes effect on the 2 nd instruction bne ri rj L 1 add re rg r 0 j L 2 L 1: add re rh r 0 L 1: L 2: add rf re r 0. . L 2: bne ri nop add re j L 2 nop re add re rj L 1 rg r 0 rh r 0 add rf re r 0. .

Strangeness in the Semantics Where do you think you will end up? _s: j

Strangeness in the Semantics Where do you think you will end up? _s: j L 1 j L 2 j L 3 L 1: L 2: j L 4 j L 5 L 3: L 4: L 5: foo bar baz CMU 18 447 S’ 13 © 2011 J. C. Hoe

Function Call and Return u Jump and Link: JAL offset 26 Jump Indirect: JR

Function Call and Return u Jump and Link: JAL offset 26 Jump Indirect: JR rsreg return address = PC + 8 target = PC[31: 28]x 228 |bitwise or zero extend(immediate)x 4 PC target GPR[r 31] return address On a function call, the callee needs to know where to go back to afterwards u target = GPR [rs] PC target PC offset jumps and branches always jump to the same target every time the same instruction is executed Jump Indirect allows the same instruction to jump to any location specified by rs (usually r 31) CMU 18 447 S’ 13 © 2011 J. C. Hoe

Assembly Programming 401 Caller. . . JAL. . . u u u code A.

Assembly Programming 401 Caller. . . JAL. . . u u u code A. . . _myfxn code C. . . _myfxn code D. . . Callee _myfxn: CMU 18 447 S’ 13 © 2011 J. C. Hoe . . . code B. . . JR r 31 . . . A call B return C call B return D. . . How do you pass argument between caller and callee? If A set r 10 to 1, what is the value of r 10 when B returns to C? What registers can B use? What happens to r 31 if B calls another function

Caller and Callee Saved Registers u CMU 18 447 S’ 13 © 2011 J.

Caller and Callee Saved Registers u CMU 18 447 S’ 13 © 2011 J. C. Hoe Callee Saved Registers Caller says to callee, “The values of these registers should not change when you return to me. ” Callee says, “If I need to use these registers, I promise to save the old values to memory first and restore them before I return to you. ” u Caller Saved Registers Caller says to callee, “If there is anything I care about in these registers, I already saved it myself. ” Callee says to caller, “Don’t count on them staying the same values after I am done.

R 2000 Register Usage Convention u u u r 0: r 1: r 2,

R 2000 Register Usage Convention u u u r 0: r 1: r 2, r 3: r 4~r 7: r 8~r 15: r 16~r 23 r 24~r 25 r 26, r 27: r 28: r 29: r 30: r 31: always 0 reserved for the assembler function return values function call arguments “caller saved” temporaries “callee saved” temporaries “caller saved” temporaries reserved for the operating system global pointer stack pointer callee saved temporaries return address CMU 18 447 S’ 13 © 2011 J. C. Hoe

R 2000 Memory Usage Convention CMU 18 447 S’ 13 © 2011 J. C.

R 2000 Memory Usage Convention CMU 18 447 S’ 13 © 2011 J. C. Hoe high address stack space grow down free space stack pointer GPR[r 29] grow up dynamic data static data text low address reserved binary executable

2. 3. 4. 5. 6. 7. 8. 9. caller saves caller saved registers caller

2. 3. 4. 5. 6. 7. 8. 9. caller saves caller saved registers caller loads arguments into r 4~r 7 caller jumps to callee using JAL callee allocates space on the stack (dec. stack pointer) callee saves callee saved registers to stack (also. . . . body calleer 31) (can “nest” additional calls). . . . r 4~r 7, old ofr 29, callee loads results to r 2, r 3 callee restores saved register values JR r 31 caller continues with return values in r 2, r 3. . . . prologue 1. epilogue . . . . Calling Convention CMU 18 447 S’ 13 © 2011 J. C. Hoe

To Summarize: MIPS RISC u Simple operations 2 input, 1 output arithmetic and logical

To Summarize: MIPS RISC u Simple operations 2 input, 1 output arithmetic and logical operations few alternatives for accomplishing the same thing u Simple data movements ALU ops are register to register (need a large register file) “Load store” architecture u Simple branches limited varieties of branch conditions and targets u Simple instruction encoding all instructions encoded in the same number of bits only a few formats Loosely speaking, an ISA intended for compilers rather than assembly programmers CMU 18 447 S’ 13 © 2011 J. C. Hoe