Advanced Computer Architecture 5 MD 00 5 Z

  • Slides: 38
Download presentation
Advanced Computer Architecture 5 MD 00 / 5 Z 033 Instruction Set Design Henk

Advanced Computer Architecture 5 MD 00 / 5 Z 033 Instruction Set Design Henk Corporaal www. ics. ele. tue. nl/~heco/courses/aca TUEindhoven November 2010 ACA H. Corporaal

Lecture overview • • 9/17/2020 ISA and Evolution Architecture classes Addressing Operands Operations Encoding

Lecture overview • • 9/17/2020 ISA and Evolution Architecture classes Addressing Operands Operations Encoding RISC SIMD extensions ACA H. Corporaal 2

Instruction Set Architecture • The instruction set architecture serves as the interface between software

Instruction Set Architecture • The instruction set architecture serves as the interface between software and hardware • It provides the mechanism by which the software tells the hardware what should be done • Architecture definition: “the architecture of a system/processor is (a minimal description of) its behavior as observed by its immediate users” software instruction set architecture hardware 9/17/2020 ACA H. Corporaal 3

Instruction Set Design Issues • Where are operands stored? – registers, memory, stack, accumulator

Instruction Set Design Issues • Where are operands stored? – registers, memory, stack, accumulator • How many explicit operands are there? – 0, 1, 2, or 3 • How is the operand location specified? – register, immediate, indirect, . . . • What type & size of operands are supported? – byte, int, float, double, string, vector. . . • What operations are supported? – basic operations: add, sub, mul, move, compare. . . – or also very complex operations? 9/17/2020 ACA H. Corporaal 4

Operands • How are operands designated? – fixed – always in the same place

Operands • How are operands designated? – fixed – always in the same place – by opcode – always the same for groups of instructions – by a field in the instruction – requires decode first • What is the format of the data? – – – binary character decimal (packed and unpacked) floating-point – IEEE 754 (others used less and less) size – 8 -, 16 -, 32 -, 64 -, 128 -bit, or vectors of above types and sizes • What is the influence on the ISA (= Instruction-Set Architecture)? 9/17/2020 ACA H. Corporaal 5

Operand Locations 9/17/2020 ACA H. Corporaal 6

Operand Locations 9/17/2020 ACA H. Corporaal 6

Classifying ISAs Accumulator (before 1960): 1 address add A Stack (1960 s to 1970

Classifying ISAs Accumulator (before 1960): 1 address add A Stack (1960 s to 1970 s): 0 address add acc ¬ acc + mem[A] tos ¬ tos + next Memory-Memory (1970 s to 1980 s): 2 address 3 address add A, B, C mem[A] ¬ mem[A] + mem[B] mem[A] ¬ mem[B] + mem[C] Register-Memory (1970 s to present): 2 address add R 1, A load R 1, A R 1 ¬ R 1 + mem[A] R 1 ¬ mem[A] Register-Register (Load/Store) (1960 s to present): 3 address 9/17/2020 ACA H. Corporaal add R 1, R 2, R 3 load R 1, R 2 store R 1, R 2 R 1 ¬ R 2 + R 3 R 1 ¬ mem[R 2] mem[R 1] ¬ R 2 7

Evolution of Architectures Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I,

Evolution of Architectures Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation High-level Language Based (B 5000 1963) Concept of a Processor Family (IBM 360 1964) General Purpose Register Machines Complex Instruction Sets (Vax, Intel 8086 1977 -80) Load/Store Architecture (CDC 6600, Cray 1 1963 -76) RISC (Mips, Sparc, 88000, IBM RS 6000, . . . 1987+) 9/17/2020 ACA H. Corporaal 8

Addressing Modes • Types – Register – data in a register – Immediate –

Addressing Modes • Types – Register – data in a register – Immediate – data in the instruction – Memory – data in memory • Calculation of Effective Address – Direct – address in instruction – Indirect – address in register – Displacement – address = register or PC + offset – Indexed – address = register + register – Memory Indirect – address at address in register • Question: What is the influence on ISA? 9/17/2020 ACA H. Corporaal 9

Types of Addressing Mode (VAX) Addressing Mode Example Action 1. Register direct Add R

Types of Addressing Mode (VAX) Addressing Mode Example Action 1. Register direct Add R 4, R 3 R 4 <- R 4 + R 3 2. Immediate Add R 4, #3 R 4 <- R 4 + 3 3. Displacement Add R 4, 100(R 1) R 4 <- R 4 + M[100 + R 1] 4. Register indirect Add R 4, (R 1) R 4 <- R 4 + M[R 1] 5. Indexed Add R 4, (R 1 + R 2) R 4 <- R 4 + M[R 1 + R 2] 6. Direct Add R 4, (1000) R 4 <- R 4 + M[1000] 7. Memory Indirect Add R 4, @(R 3) R 4 <- R 4 + M[M[R 3]] 8. Autoincrement. Add R 4, (R 2)+ R 4 <- R 4 + M[R 2] R 2 <- R 2 + d 9. Autodecrement Add R 4, (R 2)R 4 <- R 4 + M[R 2] R 2 <- R 2 - d 10. Scaled Add R 4, 100(R 2)[R 3] R 4 <- R 4 + M[100 + R 2 + R 3*d] • Studies by [Clark and Emer] indicate that modes 1 -4 account for 93% of all operands on the VAX 9/17/2020 ACA H. Corporaal 10

Operations • Types – ALU – Integer arithmetic and logical functions – Data transfer

Operations • Types – ALU – Integer arithmetic and logical functions – Data transfer – Loads/stores – Control – Branch, jump, call, return, traps, interrupts – System – O/S calls, virtual memory management – Floating point arithmetic – Decimal arithmetic (BCD: binary coded decimal) – String – moves, compares, search, etc. – Graphics – Pixel/vertex operations – Vector (SIMD) functions – more complex ones • Addressing – Which addressing modes for which operands are supported? 9/17/2020 ACA H. Corporaal 11

80 x 86 Instruction Frequency 9/17/2020 ACA H. Corporaal 12

80 x 86 Instruction Frequency 9/17/2020 ACA H. Corporaal 12

Relative Frequency of Control Instructions • Design hardware to handle branches quickly, since these

Relative Frequency of Control Instructions • Design hardware to handle branches quickly, since these occur most frequently 9/17/2020 ACA H. Corporaal 13

Frequency of Operand Sizes on 32 -bit Load-Store Machines • For floating-point want good

Frequency of Operand Sizes on 32 -bit Load-Store Machines • For floating-point want good performance for 64 bit operands. • For integer operations want good performance for 32 bit operands • Recent architectures also support 64 -bit integers 9/17/2020 ACA H. Corporaal 14

Instruction Encoding • Variable – Instruction length varies based on opcode and address specifiers

Instruction Encoding • Variable – Instruction length varies based on opcode and address specifiers – For example, VAX instructions vary between 1 and 53 bytes, while x 86 instruction vary between 1 and 17 bytes. – Good code density, but difficult to decode and pipeline • Fixed – Only a single size for all instructions – For example MIPS, Power PC, Sparc all have 32 bit instructions – Not as good code density, but easier to decode and pipeline • Hybrid – Have multiple format lengths specified by the opcode – For example, IBM 360/370 – Compromise between code density and ease of decode 9/17/2020 ACA H. Corporaal 15

Instruction Encoding 9/17/2020 ACA H. Corporaal 16

Instruction Encoding 9/17/2020 ACA H. Corporaal 16

Example: MIPS Operands mostly at fixed positions Fixed instruction size; few formats 9/17/2020 ACA

Example: MIPS Operands mostly at fixed positions Fixed instruction size; few formats 9/17/2020 ACA H. Corporaal 17

Compilers and ISA • Compiler Goals – – – All correct programs compile correctly

Compilers and ISA • Compiler Goals – – – All correct programs compile correctly Most compiled programs execute quickly Most programs compile quickly Achieve small code size Provide debugging support • Multiple Source Compilers – Same compiler can compile different languages • Multiple Target Compilers – Same compiler can generate code for different machines – 'cross-compiler' 9/17/2020 ACA H. Corporaal 18

Compiler basics: trajectory Source program Preprocessor Compiler Assembler Library code Error messages Loader/Linker Object

Compiler basics: trajectory Source program Preprocessor Compiler Assembler Library code Error messages Loader/Linker Object program 9/17/2020 ACA H. Corporaal 19

Compiler basics: structure / passes Source code Lexical analyzer Parsing Intermediate code Code optimization

Compiler basics: structure / passes Source code Lexical analyzer Parsing Intermediate code Code optimization Code generation Register allocation Sequential code Scheduling and allocation token generation check syntax check semantic parse tree generation data flow analysis local optimizations global optimizations code selection peephole optimizations making interference graph coloring spill code insertion caller / callee save and restore code exploiting ILP Object code 9/17/2020 ACA H. Corporaal 20

Compiler basics: structure Simple compilation example position : = initial + rate * 60

Compiler basics: structure Simple compilation example position : = initial + rate * 60 Lexical analyzer id : = id + id * 60 temp 1 : = intoreal(60) temp 2 : = id 3 * temp 1 temp 3 : = id 2 + temp 2 id 1 : = temp 3 Syntax analyzer Code optimizer temp 1 : = id 3 * 60. 0 id 1 : = id 2 + temp 1 : = id + id * id 60 Intermediate code generator 9/17/2020 ACA H. Corporaal Code generator movf mulf movf addf movf id 3, r 2 #60, r 2 id 2, r 1 r 1, id 1 21

Designing ISA to Improve Compilation • Provide enough general purpose registers to ease register

Designing ISA to Improve Compilation • Provide enough general purpose registers to ease register allocation ( at least 16) • Provide regular instruction sets by keeping the operations, data types, and addressing modes largely orthogonal • Provide primitive constructs rather than trying to map to a high-level language • Allow compilers to help make the common case fast 9/17/2020 ACA H. Corporaal 22

A "Typical" RISC • • 32 -bit fixed length instruction Only few instruction formats

A "Typical" RISC • • 32 -bit fixed length instruction Only few instruction formats 32 32 -bit GPRs (general purpose registers) 3 -address, reg-reg / reg-imm-reg arithmetic instruction • Single address mode for load/store: base + displacement – no indirection • • 9/17/2020 Simple branch conditions Pipelined implementation Separate Instruction and Data level-1 caches Delayed branch ? ACA H. Corporaal 23

Comparison MIPS with 80 x 86 • How would you expect the x 86

Comparison MIPS with 80 x 86 • How would you expect the x 86 and MIPS architectures to compare on the following ? – CPI on SPEC benchmarks – Ease of design and implementation – Ease of writing assembly language & compilers – Code density – Overall performance • What other advantages/disadvantages are there to the two architectures?

Instruction Set Extensions Subword parallelism • Support graphics and multimedia applications – Intel’s MMX

Instruction Set Extensions Subword parallelism • Support graphics and multimedia applications – Intel’s MMX Technology (introduced in 1997) – Intel’s Internet Streaming SIMD Extensions (SSE – SSE 4) – AMD’s 3 DNow! Technology – Sun’s Visual Instruction Set – Motorola’s and IBM’s Alti. Vec Technology • These extensions improve the performance of – Computer-aided design – Internet applications – Computer visualization – Video games – Speech recognition 9/17/2020 ACA H. Corporaal 25

MMX Data Types MMX Technology supports operations on the following 64 -bit integer data

MMX Data Types MMX Technology supports operations on the following 64 -bit integer data types: Packed byte (eight 8 -bit elements) Packed word (four 16 -bit elements) Packed double word (two 32 -bit elements) Packed quad word (one 64 -bit elements) 9/17/2020 ACA H. Corporaal 26

SIMD Operations • MMX Technology allows a Single Instruction to work on Multiple pieces

SIMD Operations • MMX Technology allows a Single Instruction to work on Multiple pieces of Data (SIMD) A 3 A 2 A 1 A 0 B 3 B 2 B 1 B 0 A 3+B 3 A 2+B 2 A 1+B 1 A 0+B 0 PADD[W]: Packed add word • In the above example, 4 parallel adds are performed on 16 -bit elements • Most MMX instructions only require a single cycle 9/17/2020 ACA H. Corporaal 27

Saturating Arithmetic • Both wrap-around and saturating adds are supported • With saturating arithmetic,

Saturating Arithmetic • Both wrap-around and saturating adds are supported • With saturating arithmetic, results that overflow/underflow are set to the largest/smallest value PADD[W]: Packed wrap-around add 9/17/2020 ACA H. Corporaal PADDUS[W]: Packed saturating add 28

Pack and Unpack Instructions • Pack and unpack instructions provide conversion between standard data

Pack and Unpack Instructions • Pack and unpack instructions provide conversion between standard data types and packed data types PACKSS[DW]: Pack signed, with saturating, double to packed word 9/17/2020 ACA H. Corporaal 29

Multiply-Add Operations • Many graphics applications require multiplyaccumulate operations – – Vector Dot Products:

Multiply-Add Operations • Many graphics applications require multiplyaccumulate operations – – Vector Dot Products: a • b Matrix Multiplies Fast Fourier Transforms (FFTs) Filter implementations PMADDWD: Packed multiply-add word to double 9/17/2020 ACA H. Corporaal 30

Vector Dot Product • A dot product on an 8 -element vector can be

Vector Dot Product • A dot product on an 8 -element vector can be performed using 9 MMX instructions – Without MMX 40 instructions are required 0 a 0*c 0+. . + a 3*c 3 0 a 4*c 4+. . + a 7*c 7 a 0*c 0+. . + a 7*c 7 9/17/2020 ACA H. Corporaal 31

Packed Compare Instructions • Packed compare instructions allow a bit mask to be set

Packed Compare Instructions • Packed compare instructions allow a bit mask to be set or cleared • This is useful when images with certain qualities need to be extracted 9/17/2020 ACA H. Corporaal 32

MMX Instructions • MMX Technology adds 57 new instructions to the x 86 architecture.

MMX Instructions • MMX Technology adds 57 new instructions to the x 86 architecture. • Some of these instructions include (b=byte; w=32 -bit; d=64 -bit) – PADD(b, w, d) Packed addition – PSUB(b, w, d) Packed subtraction – PCMPE(b, w, d) Packed compare equal – PMULLw Packed word multiply low – PMULHw Packed word multiply high – PMADDwd Packed word multiply-add – PSRL(w, d, q) Pack shift right logical – PACKSS(wb, dw) Pack data – PUNPCK(bw, wd, dq) Unpack data – PAND, POR, PXOR Packed logical operations 9/17/2020 ACA H. Corporaal 33

MMX Performance Comparison Application Without MMX With MMX Speedup Video 155. 52 268. 70

MMX Performance Comparison Application Without MMX With MMX Speedup Video 155. 52 268. 70 1. 72 Image Processing 159. 03 743. 90 4. 67 3 D geometry 161. 52 166. 44 1. 03 Audio 149. 80 318. 90 2. 13 Overall 156. 00 255. 43 1. 64 9/17/2020 ACA H. Corporaal 34

MMX Technology Summary • MMX technology extends the Intel x 86 architecture to improve

MMX Technology Summary • MMX technology extends the Intel x 86 architecture to improve the performance of multimedia and graphics applications. • It provides a speedup of 1. 5 to 2. 0 for certain applications. • MMX instructions are hand-coded in assembly or implemented as libraries to achieve high performance. • MMX data types use the x 86 floating point registers to avoid adding state to the processor. – Makes it easy to handle context switches – Makes it hard to perform MMX and floating point instructions at the same time • Only increase the chip area by about 5%. 9/17/2020 ACA H. Corporaal 35

Questions on MMX • What are the strengths and weaknesses of MMX Technology? •

Questions on MMX • What are the strengths and weaknesses of MMX Technology? • How could MMX Technology potentially be improved? • How did the developers of MMX preserve backward compatibility with the x 86 architecture? – Why was this important? – What are the disadvantages of this approach? • What restrictions/limitations are there on the use of MMX Technology? 9/17/2020 ACA H. Corporaal 36

Internet Streaming SIMD Extensions (SSE) • Help improve the performance of video and 3

Internet Streaming SIMD Extensions (SSE) • Help improve the performance of video and 3 D applications • Are designed for streaming data, which is used once and then discarded. • 70 new instructions beyond MMX Technology • Adds 8 new 128 -bit vector registers (XMM 0 – XMM 7) • Provide the ability to perform multiple floating point operations – Four parallel operations on 32 -bit numbers – Reciprocal and reciprocal root instructions - normalization – Packed average instruction – Motion compensation • Provide data prefetch instructions • Make certain applications 1. 5 to 2. 0 times faster 9/17/2020 ACA H. Corporaal 37

Beyond SSE • SSE 2: SIMD on any data type from 8 -bit int

Beyond SSE • SSE 2: SIMD on any data type from 8 -bit int to 64 -bit double, using XMM vector registers • SSE 4: dot-product operation • AVX (Advanced Vector Extensions): 2011 – 16 256 -bit vector registers, YMM 0 -YMM 15 • later extended to 512 and 1024 bits – 3 operand instructions (instead of 2, with one implicit register operand) – 2011: in Intel Sandy Bridge and AMD Bulldozer architectures 9/17/2020 ACA H. Corporaal 38