CDA 5155 and 4150 Computer Architecture Week 2

Goals of the course • Advanced coverage of computer architecture – general purpose processors,

Teaching Staff • Professor Gary Tyson – Ph. D: University of California – Davis

Grading in 5155 (Fall’ 14) Programming assignments Exams (2 @ 25% each) In class,

Time Management • 3 hours/week lecture – This is probably the most important time

Week Tentative Course Timeline Date Topic 1 Aug 28 Performance, ISA, Pipelining 2 Sept

Web Resources Course Web Page: http: //www. cs. fsu. edu/~tyson/courses/CDA 5155 Wikipedia: http: //en.

Levels of Abstraction • • • Problem/Idea (English? ) Algorithm (pseudo-code) High-Level languages (C,

Role of Architecture • Responsible for hardware specification: – Instruction set design • Also

Design Issues: Performance • Get acceptable performance out of system. – Scientific: floating point

Calculating Performance • • • Execution time is often the best metric Throughput (tasks/sec)

Design Issues: Cost • Processor – Die size, packaging, heat sink? Gold connectors? –

Other design issues • Some applications care about other design issues. • NASA deep

A Quantitative Approach • Hardware systems performance is generally easy to quantify – Machine

Measuring Performance • Total Execution Time: – A is 3 times faster than B

Measuring Performance • Weighted Execution Time: n ∑ Weighti X Timei i=1 – What

Measuring Performance • Normalized Execution Time: – Compare machine performance to a reference machine

Amdahl’s Law • Rule of Thumb: Make the common case faster http: //en. wikipedia.

Instruction Set Design • Software Systems: named variables; complex semantics. • Hardware systems: tight

Design decisions • How much “state” is in the microarchitecture? – Registers; Flags; IP/PC

Design Challenges: or why is architecture still relevant? • Clock frequency is increasing –

Design Challenges (cont) • Design Complexity – More complex designs to fix frequency/power issues

Techniques for Encoding Operands • Explicit operands: – Includes a field to specify which

Accumulator • Architectures with one implicit register – Acts as source and/or destination –

Stack • Architectures with implicit “stack” – Acts as source(s) and/or destination – Push

Registers • Most general (and common) approach – Small array of storage – Explicit

Memory • Big array of storage – More complex ways of indexing than registers

Addressing modes Register Immediate Base/Displacement Register Indirect Indexed Direct Memory Indirect Autoincrement Add R

Other Memory Issues What is the size of each element in memory? 0 x

Other Memory Issues Big-endian or Little-endian? Store 0 x 114488 FF Points to most

Other Memory Issues Non-word loads? 0 x 000 11 ldb R 3, (000) 00

Other Memory Issues Non-word loads? ldb R 3, (003) 11 FF FF 44 Sign

Other Memory Issues Non-word loads? ldbu R 3, (003) 11 00 00 00 FF

Other Memory Issues Alignment? Word accesses only address ending in 00 Half-word accesses only

Techniques for Encoding Operators • Opcode is translated to control signals that – direct

Handling Control Flow • • • Conditional branches (short range) Unconditional branches (jumps) Function

Encoding branch targets • PC-relative addressing – Makes linking code easier • Indirect addressing

Condition codes • Flags – Implicit: flag(s) specified in opcode (bgt) – Flag(s) set

Higher Level Semantics: Functions • Function call semantics – – • Save PC +

Role of the Compiler • Compilers make the complexity of the ISA (from the

LC processor • Little Computer Fall 2011 – For programming projects • Instruction Set

LC processor R-type instructions opcode reg. A 24 - 22 reg. B 21 -

LC processor I-type instructions opcode reg. A 24 - 22 reg. B 21 -

LC processor O-type instructions opcode unused 24 - 22 21 – 0 noop: do

LC assembly example lw 0 1 lw 1 2 start add 1 beq 0

LC machine code example (address 0): 8454151 (hex 0 x 810007) (address 1): 9043971

Slides: 48

Download presentation

CDA 5155 and 4150 Computer Architecture Week 2: 2 September 2014

Goals of the course • Advanced coverage of computer architecture – general purpose processors, embedded processors, historically significant processors, design tools. – Instruction set architecture – Processor microarchitecture – Systems architecture • Memory systems • I/O systems 2/49

Teaching Staff • Professor Gary Tyson – Ph. D: University of California – Davis – Faculty jobs: • • • California State University Sacramento: 1987 - 1990 University of California – Davis: 1993 - 1994 University of California – Riverside: 1995 - 1996 University of Michigan: 1997 - 2003 Florida State University: 2003 – present 3/49

Grading in 5155 (Fall’ 14) Programming assignments Exams (2 @ 25% each) In class, 75 minutes Team Project (20%) In-order pipeline simulation (10%) Out-of-order pipeline simulation (10%) 3 or 4 students per team Class Participation (10%) 4/49

Time Management • 3 hours/week lecture – This is probably the most important time • 2 hours/week reading – Hennessy/Patterson: – Computer Architecture: A Quantitative Approach • 3 -5 hours/week exam prep • 5+ hours/week Project (1/3 semester) Total: ~10 -15 hours per week. 5/49

Week Tentative Course Timeline Date Topic 1 Aug 28 Performance, ISA, Pipelining 2 Sept 2 Pipelining, Branch Prediction 3 Sept 9 Superscalar, Exceptions 4 Sept 16 Compilers, VLIW 5 Sept 23 Dynamic Scheduling 6 Sept 30 Dynamic Scheduling 7 Oct 7 Advanced pipelines 8 Oct 14 Advanced pipelines 9 Oct 21 Cache design 10 Oct 28 Cache design, VM 11 Nov 4 Multiprocessor, Multithreading 12 Nov 11 Embedded processors 13 Nov 18 Embedded processors 14 Nov 25 Research Topics 15 Dec 2 Research Topics Holidays Due Dates Notes Exam project 6/49

Web Resources Course Web Page: http: //www. cs. fsu. edu/~tyson/courses/CDA 5155 Wikipedia: http: //en. wikipedia. org/wiki/Microprocessor Wisconsin Architecture Page: http: //arch-www. cs. wisc. edu/home 7/49

Levels of Abstraction • • • Problem/Idea (English? ) Algorithm (pseudo-code) High-Level languages (C, Verilog) Assembly instructions (OS calls) Machine instructions (I/O interfaces) Microarchitecture/organization (block diagrams) Logic level: gates, flip-flops (schematic, HDL) Circuit level: transistors, sizing (schematic, HDL) Physical: VLSI layout, feature size, cabling, PC boards. What are the abstractions at each level? 8/49

Role of Architecture • Responsible for hardware specification: – Instruction set design • Also responsible for structuring the overall implementation – Microarchitectural design. • Interacts with everyone – mainly compiler and logic level designers. • Cannot do a good job without knowledge of both sides 11/49

Design Issues: Performance • Get acceptable performance out of system. – Scientific: floating point throughput, memory&disk intensive, predictable – Commercial: string handling, disk (databases), predictable – Multimedia: specific data types (pixels), network? Predictable? – Embedded: what do you mean by performance? – Workstation: Maybe all of the above, maybe not 12/49

Calculating Performance • • • Execution time is often the best metric Throughput (tasks/sec) vs. latency (sec/task) Benchmarks: what are the tasks? – – – What I care about! Representative programs (SPEC, Linpack) Kernels: representative code fragments Toy programs: useful for testing end-conditions Synthetic programs: does nothing but with a representative instruction mix. 13/49

Design Issues: Cost • Processor – Die size, packaging, heat sink? Gold connectors? – Support: fan, connectors, motherboard specifications, etc. • Calculating processor cost: – Cost of device = (die + package + testing) / yield – Die cost = wafer cost / good die yield • Good die yield related to die size and defect density – Support costs: direct costs (components, labor), indirect costs ( sales, service, R&D) – Total costs amortized over number of systems sold(PC vs NASA) 14/49

Other design issues • Some applications care about other design issues. • NASA deep space mission – Reliability: software and hardware (radiation hardening) • AMD – Code compatibility • ARM – Power 15/49

A Quantitative Approach • Hardware systems performance is generally easy to quantify – Machine A is 10% faster than Machine B – Of course Machine B’s advertising will show the opposite conclusion • Many software systems tend to have much more subjective performance evaluations. 16/49

Measuring Performance • Total Execution Time: – A is 3 times faster than B for programs P 1, P 2 1 n n Σ Time i=1 i – Issue: Emphasizes long running programs 17/49

Measuring Performance • Weighted Execution Time: n ∑ Weighti X Timei i=1 – What if P 1 is executed far more frequently? 18/49

Measuring Performance • Normalized Execution Time: – Compare machine performance to a reference machine and report a ratio. • SPEC ratings measure relative performance to a reference machine. 19/49

Amdahl’s Law • Rule of Thumb: Make the common case faster http: //en. wikipedia. org/wiki/Amdahl's_law (Attack longest running part until it is no longer) repeat 20/49

Instruction Set Design • Software Systems: named variables; complex semantics. • Hardware systems: tight timing requirements; small storage structures; simple semantics • Instruction set: the interface between very different software and hardware systems 21/49

Design decisions • How much “state” is in the microarchitecture? – Registers; Flags; IP/PC • How is that state accessed/manipulated? – Operand encoding • What commands are supported? – Opcode; opcode encoding 22/49

Design Challenges: or why is architecture still relevant? • Clock frequency is increasing – This changes the number of levels of gates that can be completed each cycle so old designs don’t work. – It also tend to increase the ratio of time spent on wires (fixed speed of light) • Power – Faster chips are hotter; bigger chips are hotter 23/49

Design Challenges (cont) • Design Complexity – More complex designs to fix frequency/power issues leads to increased development/testing costs – Failures (design or transient) can be difficult to understand (and fix) • We seem far less willing to live with hardware errors (e. g. FDIV) than software errors – which are often dealt with through upgrades – that we pay for!) 24/49

Techniques for Encoding Operands • Explicit operands: – Includes a field to specify which state data is referenced – Example: register specifier • Implicit operands: – All state data can be inferred from the opcode – Example: function return (CISC-style) 25/49

Accumulator • Architectures with one implicit register – Acts as source and/or destination – One other source explicit • Example: C = A + B – Load A – Add B – Store C // (Acc)umulator = A // Acc = Acc + B // C = Acc Ref: “Instruction Level Distributed Processing: Adapting to Shifting Technology” 26/49

Stack • Architectures with implicit “stack” – Acts as source(s) and/or destination – Push and Pop operations have 1 explicit operand • Example: C = A + B – – Push A Push B Add Pop C // Stack = {A} // Stack = {A, B} // Stack = {A+B} // C = A+B ; Stack = {} Compact encoding; may require more instructions though 27/49

Registers • Most general (and common) approach – Small array of storage – Explicit operands (register file index) • Example: C = A + B Register-memory load/store Load R 1, A Load R 2, B Add R 3, R 1, R 2 Store R 3, C 28/49

Memory • Big array of storage – More complex ways of indexing than registers • Build addressing modes to support efficient translation of software abstractions • Uses less space in instruction than 32 -bit immediate field A[i]; use base (i) + displacement (A) (scaled? ) a. ptr; use base (a) + displacement (ptr) 29/49

Addressing modes Register Immediate Base/Displacement Register Indirect Indexed Direct Memory Indirect Autoincrement Add R 4, R 3 Add R 4, #3 Add R 4, 100(R 1) Add R 4, (R 1+R 2) Add R 4, (1001) Add R 4, @(R 3) Add R 4, (R 2)+ 30/49

Other Memory Issues What is the size of each element in memory? 0 x 000 0 -255 Byte 0 x 000 0 - 65535 Half word 0 x 000 0 - ~4 B Word 31/49

Other Memory Issues Big-endian or Little-endian? Store 0 x 114488 FF Points to most significant byte 0 x 000 11 Points to least significant byte 0 x 000 FF 44 88 88 44 FF 11 32/49

Other Memory Issues Non-word loads? 0 x 000 11 ldb R 3, (000) 00 00 00 11 44 88 FF 33/49

Other Memory Issues Non-word loads? ldb R 3, (003) 11 FF FF 44 Sign extended 88 0 x 003 FF 34/49

Other Memory Issues Non-word loads? ldbu R 3, (003) 11 00 00 00 FF 44 Zero filled 88 0 x 003 FF 35/49

Other Memory Issues Alignment? Word accesses only address ending in 00 Half-word accesses only ending in 0 Byte accesses any address 11 44 0 x 002 ldw R 3, (002) is illegal! 88 FF Why is it important to be aligned? How can it be enforced? 36/49

Techniques for Encoding Operators • Opcode is translated to control signals that – direct data (MUX control) – select operation for ALU – Set read/write selects for register/memory/PC • Tradeoff between how flexible the control is and how compact the opcode encoding. – Microcode – direct control of signals (Improv) – Opcode – compact representation of a set of control signals. • You can make decode easier with careful opcode selection 37/49

Handling Control Flow • • • Conditional branches (short range) Unconditional branches (jumps) Function calls Returns Traps (OS calls and exceptions) Predicates (conditional retirement) 38/49

Encoding branch targets • PC-relative addressing – Makes linking code easier • Indirect addressing – Jumps into shared libraries, virtual functions, case/switch statements • Some unusual modes to simplify target address calculation – (segment offset) or (trap number) 39/49

Condition codes • Flags – Implicit: flag(s) specified in opcode (bgt) – Flag(s) set by earlier instructions (compare, add, etc. ) • Register – Uses a register; requires explicit specifier • Comparison operation – Two registers with compare operation specified in opcode. 40/49

Higher Level Semantics: Functions • Function call semantics – – • Save PC + 1 instruction for return Manage parameters Allocate space on stack Jump to function Simple approach: – Use a jump instruction + other instructions • Complex approach: – Build implicit operations into new “call” instruction 41/49

Role of the Compiler • Compilers make the complexity of the ISA (from the programmers point of view) less relevant. – Non-orthogonal ISAs are more challenging. – State allocation (register allocation) is better left to compiler heuristics – Complex Semantics lead to more global optimization – easier for a machine to do. People are good at optimizing 10 lines of code. Compilers are good at optimizing 10 M lines. 42/49

LC processor • Little Computer Fall 2011 – For programming projects • Instruction Set Design opcode reg. A reg. B dest. Reg 43/49

LC processor R-type instructions opcode reg. A 24 - 22 reg. B 21 - 19 18 – 16 add: dest. Reg 15 – 3 2 -0 dest. Reg = reg. A + reg. B nand: dest. Reg = reg. A & reg. B 44/49

LC processor I-type instructions opcode reg. A 24 - 22 reg. B 21 - 19 18 – 16 offset. Field 15 – 0 lw: reg. B = Memory[reg. A + offset. Field] sw: Memory[reg. A +offset. Field] = reg. B beq: if (reg. A= = reg. B) PC = PC + 1 + offset. Field 45/49

LC processor O-type instructions opcode unused 24 - 22 21 – 0 noop: do nothing halt: halt the simulation 46/49

LC assembly example lw 0 1 lw 1 2 start add 1 beq 0 0 noop done halt five. fill neg 1. fill st. Addr. fill five 3 2 1 2 start load reg 1 with 5 (uses symbolic address) load reg 2 with -1 (uses numeric address) decrement reg 1 goto end of program when reg 1==0 go back to the beginning of the loop end of program 5 -1 start will contain the address of start (2) 47/49

LC machine code example (address 0): 8454151 (hex 0 x 810007) (address 1): 9043971 (hex 0 x 8 a 0003) (address 2): 655361 (hex 0 xa 0001) (address 3): 16842754 (hex 0 x 1010002) (address 4): 16842749 (hex 0 x 100 fffd) (address 5): 29360128 (hex 0 x 1 c 00000) (address 6): 25165824 (hex 0 x 1800000) (address 7): 5 (hex 0 x 5) (address 8): -1 (hex 0 xffff) (address 9): 2 (hex 0 x 2) Input for simulator: 8454151 9043971 655361 16842754 16842749 29360128 25165824 5 -1 2 48/49