ARM Introduction Instruction Set Architecture Aleksandar Milenkovic Email

Outline Ø Ø Ø Ø Ø ARM Architecture ARM Organization and Implementation ARM Instruction

ARM History Ø ARM – Acorn RISC Machine (1983 – 1985) § Acorn Computers

ARM’s visible registers Ø User level § 15 GPRs, PC, CPSR (current program status

ARM CPSR format Ø N (Negative), Z (Zero), C (Carry), V (o. Verflow) Ø

ARM memory organization Ø Linear array of bytes numbered from 0 to 232 –

ARM instruction set Ø Load-store architecture § operands are in GPRs § load/store –

ARM instruction set (cont’d) Three-address data processing instructions Conditional execution of every instruction Powerful

I/O system Ø I/O is memory mapped § internal registers of peripherals (disk controllers,

ARM exceptions Ø ARM supports a range of interrupts, traps, and supervisor calls –

ARM cross-development toolkit Ø Software development § tools developed by ARM Limited § public

Outline Ø Ø Ø ARM Architecture ARM Assembly Language Programming ARM Organization and Implementation

ARM Instruction Set Ø Data Processing Instructions Ø Data Transfer Instructions Ø Control flow

Data Processing Instructions Ø Classes of data processing instructions § § Arithmetic operations Bit-wise

Data Processing Instructions (cont’d) Arithmetic Operations Bit-wise Logical Operations ADD r 0, r 1,

Data Processing Instructions (cont’d) Ø Immediate operands: immediate = (0 ->255) x 22 n,

ARM shift operations Ø Ø Ø LSL – Logical Shift Left LSR – Logical

Setting the condition codes Ø Any DPI can set the condition codes (N, Z,

Multiplies Ø Example (Multiply, Multiply-Accumulate) MUL r 4, r 3, r 2 r 4

Data transfer instructions Ø Single register load and store instructions § transfer of a

Data Transfer Instructions (cont’d) Register-indirect addressing Single register load and store LDR r 0,

Data Transfer Instructions (cont’d) COPY: ADR r 1, TABLE 1 ADR r 2, TABLE

Data Transfer Instructions Multiple register data transfers LDMIA r 1, {r 0, r 2,

Multiple register transfer addressing modes 1018 r 9’ r 9 r 5 r 1

The mapping between the stack and block copy views 25

Conditional execution Ø Conditional execution to avoid branch instructions used to skip a small

Branch and link instructions Ø Branch to subroutine (r 14 serves as a link

Supervisor calls Ø Supervisor is a program which operates at a privileged level –

Jump tables Ø Call one of a set of subroutines depending on a value

Hello ARM World! AREA Hello. W, CODE, READONLY ; declare code area SWI_Write. C

ARM Organization and Implementation Aleksandar Milenkovic E-mail: Web: milenka@ece. uah. edu http: //www. ece.

Outline Ø Ø Ø ARM Architecture ARM Organization and Implementation ARM Instruction Set Architectural

ARM organization A[31: 0] control address register Ø Register file – P C §

Three-stage pipeline Ø Fetch § the instruction is fetched from memory and placed in

ARM single-cycle instruction pipeline 36

ARM single-cycle instruction pipeline fetch sub r 2, r 3, r 6 decode execute

ARM multi-cycle instruction pipeline Decode logic is always generating the control signals for the

ARM multi-cycle LDMIA (load multiple) instruction ldmia r 0, {r 2, r 3} fetch

Control stalls: due to branches Ø Branches often introduce stalls (branch penalty) § Stall

ARM pipelined branch Decision not made until the third clock cycle bne foo sub

Pipeline: how it works Ø All instructions occupy the datapath for one or more

ARM 9 TDMI 5 -stage pipeline next pc pc + 4 Ø Fetch Ø

ARM 9 TDMI Data Forwarding next pc +4 I-cache fetch pc + 4 ADD

ARM 9 TDMI PC generation Ø 3 -stage pipeline next pc +4 I-cache fetch

Data processing instruction datapath activity (Ex) ØReg-Reg Ø Rd = Rn op Rm Ø

STR (store register) datapath activity (Ex 1, Ex 2) ØCompute address (Ex 1) address

The first two (of three) cycles of a branch instruction ØCompute target address register

ARM Implementation Ø Datapath § RTL (Register Transfer Level) Ø Control unit § FSM

2 -phase non-overlapping clock scheme Ø Most ARMs do not operate on edge-sensitive registers

ARM datapath timing Ø Register read § Register read buses – dynamic, precharged during

ARM datapath timing (cont’d) Minimum Datapath Delay = Register read time + Shifter Delay

The original ARM 1 ripple-carry adder Ø Carry logic: use CMOS AOI (And-Or-Invert) gate

ARM 2 4 -bit carry look-ahead scheme Ø Carry Generate (G) Carry Propagate (P)

The ARM 2 ALU logic for one result bit Ø ALU functions data operations

The ARM 6 carry-select adder scheme Ø Compute sums of various fields of the

The ARM 6 ALU organization Ø Not easy to merge the arithmetic and logic

ARM 9 carry arbitration encoding Ø Carry arbitration adder ai bi Ci vi, wi

The cross-bar switch barrel shifter Ø Shifter delay is critical since it contributes directly

The cross-bar switch barrel shifter (cont’d) Ø Precharged logic is used => each switch

Multiplier design Ø All ARMs apart form the first prototype have included support for

The 2 -bit multiplication algorithm, Nth cycle Ø Control settings for the Nth cycle

High speed multiplication Ø Where multiplication performance is very important, more hardware resources must

Carry-propagate (a) and carry-save (b) adder structures Ø Carry propagate adder takes two conventional

ARM high-speed multiplier organization Ø CSA has 4 layers of adders each handling 2

ARM high-speed multiplier organization 67

Slides: 71

Download presentation

ARM Introduction & Instruction Set Architecture Aleksandar Milenkovic E-mail: Web: milenka@ece. uah. edu http: //www. ece. uah. edu/~milenka

Outline Ø Ø Ø Ø Ø ARM Architecture ARM Organization and Implementation ARM Instruction Set Thumb Instruction Set Architectural Support for System Development ARM Processor Cores Memory Hierarchy Architectural Support for Operating Systems ARM CPU Cores Embedded ARM Applications 2

ARM History Ø ARM – Acorn RISC Machine (1983 – 1985) § Acorn Computers Limited, Cambridge, England Ø ARM – Advanced RISC Machine 1990 § ARM Limited, 1990 § ARM has been licensed to many semiconductor manufacturers 3

ARM’s visible registers Ø User level § 15 GPRs, PC, CPSR (current program status register) Ø Remaining registers are used for system -level programming and for handling exceptions r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 r 10 r 11 r 12 r 13 r 14 r 15 (PC) CPSR user mode usable in user mode system modes only r 8_fiq r 9_fiq r 10_fiq r 11_fiq r 12_fiq r 13_fiq r 14_fiq r 13_svc r 14_svc r 13_abt r 14_abt r 13_irq r 14_irq r 13_und r 14_und SPSR_irq SPSR_und SPSR_abt SPSR_fiq SPSR_svc fiq mode svc mode abort mode irq mode undefined mode 4

ARM CPSR format Ø N (Negative), Z (Zero), C (Carry), V (o. Verflow) Ø mode – control processor mode Ø T – control instruction set § T = 1 – instruction stream is 16 -bit Thumb instructions § T = 0 – instruction stream is 32 -bit ARM instructions Ø I F – interrupt enables 5

ARM memory organization Ø Linear array of bytes numbered from 0 to 232 – 1 Ø Data items § bytes (8 bits) § half-words (16 bits) – always aligned to 2 -byte boundaries (start at an even byte address) § words (32 bits) – always aligned to 4 -byte boundaries (start at a byte address which is multiple of 4) 6

ARM instruction set Ø Load-store architecture § operands are in GPRs § load/store – only instructions that operate with memory Ø Instructions § Data Processing – use and change only register values § Data Transfer – copy memory values into registers (load) or copy register values into memory (store) § Control Flow o branch-and-link – save return address to resume the original sequence o trapping into system code – supervisor calls 7

ARM instruction set (cont’d) Three-address data processing instructions Conditional execution of every instruction Powerful load/store multiple register instructions Ability to perform a general shift operation and a general ALU operation in a single instruction that executes in a single clock cycle Ø Open instruction set extension through coprocessor instruction set, including adding new registers and data types to the programmer’s model Ø Very dense 16 -bit compressed representation of the instruction set in the Thumb architecture Ø Ø 8

I/O system Ø I/O is memory mapped § internal registers of peripherals (disk controllers, network interfaces, etc) are addressable locations within the ARM’s memory map and may be read and written using the loadstore instructions Ø Peripherals may use either the normal interrupt (IRQ) or fast interrupt (FIQ) input § normally most interrupt sources share the IRQ input, while just one or two time-critical sources are connected to the FIQ input Ø Some systems may include external DMA hardware to handle high-bandwidth I/O traffic 9

ARM exceptions Ø ARM supports a range of interrupts, traps, and supervisor calls – all are grouped under the general heading of exceptions Ø Handling exceptions § current state is saved by copying the PC into r 14_exc and CPSR into SPSR_exc (exc stands for exception type) § processor operating mode is changed to the appropriate exception mode § PC is forced to a value between 0016 and 1 C 16, the particular value depending on the type of exception § instruction at the location PC is forced to (the vector address) usually contains a branch to the exception handler; the exception handler will use r 13_exc, which is normally initialized to point to a dedicated stack in memory, to save some user registers § return: restore the user registers and then restore PC and CPSR atomically 10

ARM cross-development toolkit Ø Software development § tools developed by ARM Limited § public domain tools (ARM back end for gcc C compiler) C source C libraries C compiler assembler. aof object libraries linker Ø Cross-development § tools run on different architecture from one for which they produce code asm source . axf system model ARMulator debug ARMsd development board 11

Outline Ø Ø Ø ARM Architecture ARM Assembly Language Programming ARM Organization and Implementation ARM Instruction Set Architectural Support for High-level Languages Thumb Instruction Set Architectural Support for System Development ARM Processor Cores Memory Hierarchy Architectural Support for Operating Systems ARM CPU Cores Embedded ARM Applications 12

ARM Instruction Set Ø Data Processing Instructions Ø Data Transfer Instructions Ø Control flow Instructions 13

Data Processing Instructions Ø Classes of data processing instructions § § Arithmetic operations Bit-wise logical operations Register-movement operations Comparison operations Ø Operands: 32 -bits wide; there are 3 ways to specify operands § come from registers § the second operand may be a constant (immediate) § shifted register operand Ø Result: 32 -bits wide, placed in a register § long multiply produces a 64 -bit result 14

Data Processing Instructions (cont’d) Arithmetic Operations Bit-wise Logical Operations ADD r 0, r 1, r 2 r 0 : = r 1 + r 2 AND r 0, r 1, r 2 r 0 : = r 1 and r 2 ADC r 0, r 1, r 2 r 0 : = r 1 + r 2 + C ORR r 0, r 1, r 2 r 0 : = r 1 or r 2 SUB r 0, r 1, r 2 r 0 : = r 1 - r 2 EOR r 0, r 1, r 2 r 0 : = r 1 xor r 2 SBC r 0, r 1, r 2 r 0 : = r 1 - r 2 + C - 1 BIC r 0, r 1, r 2 r 0 : = r 1 and (not) r 2 RSB r 0, r 1, r 2 r 0 : = r 2 – r 1 RSC r 0, r 1, r 2 r 0 : = r 2 – r 1 + C - 1 Register Movement Comparison Operations MOV r 0, r 2 r 0 : = r 2 CMP r 1, r 2 set cc on r 1 - r 2 MVN r 0, r 2 r 0 : = not r 2 CMN r 1, r 2 set cc on r 1 + r 2 TST r 1, r 2 set cc on r 1 and r 2 TEQ r 1, r 2 set cc on r 1 xor r 2 15

Data Processing Instructions (cont’d) Ø Immediate operands: immediate = (0 ->255) x 22 n, 0 <= n <= 12 ADD r 3, #3 r 3 : = r 3 + 3 AND r 8, r 7, #&ff r 8 : = r 7[7: 0], & for hex Ø Shifted register operands § the second operand is subject to a shift operation before it is combined with the first operand ADD r 3, r 2, r 1, LSL #3 r 3 : = r 2 + 8 x r 1 ADD r 5, r 3, LSL r 2 r 5 : = r 5 + 2 r 2 x r 3 16

ARM shift operations Ø Ø Ø LSL – Logical Shift Left LSR – Logical Shift Right ASR – Arithmetic Shift Right ROR – Rotate Right RRX – Rotate Right Extended by 1 place 17

Setting the condition codes Ø Any DPI can set the condition codes (N, Z, V, and C) § for all DPIs except the comparison operations a specific request must be made § at the assembly language level this request is indicated by adding an `S` to the opcode § Example (r 3 -r 2 : = r 1 -r 0 + r 3 -r 2) ADDS r 2, r 0 ; carry out to C ADC r 3, r 1 ; . . . add into high word Ø Arithmetic operations set all the flags (N, Z, C, and V) Ø Logical and move operations set N and Z § preserve V and either preserve C when there is no shift operation, or set C according to shift operation (fall off bit) 18

Multiplies Ø Example (Multiply, Multiply-Accumulate) MUL r 4, r 3, r 2 r 4 : = [r 3 x r 2]<31: 0> MLA r 4, r 3, r 2, r 1 r 4 : = [r 3 x r 2 + r 1] <31: 0> Ø Note § least significant 32 -bits are placed in the result register, the rest are ignored § immediate second operand is not supported § result register must not be the same as the first source register § if `S` bit is set the V is preserved and the C is rendered meaningless Ø Example (r 0 = r 0 x 35) § ADD r 0, LSL #2 ; r 0’ = r 0 x 5 RSB r 3, r 1 ; r 0’’ = 7 x r 0’ 19

Data transfer instructions Ø Single register load and store instructions § transfer of a data item (byte, half-word, word) between ARM registers and memory Ø Multiple register load and store instructions § enable transfer of large quantities of data § used for procedure entry and exit, to save/restore workspace registers, to copy blocks of data around memory Ø Single register swap instructions § allow exchange between a register and memory in one instruction § used to implement semaphores to ensure mutual exclusion on accesses to shared data in multis 20

Data Transfer Instructions (cont’d) Register-indirect addressing Single register load and store LDR r 0, [r 1] r 0 : = mem 32[r 1] STR r 0, [r 1] mem 32[r 1] : = r 0 Note: r 1 keeps a word address (2 LSBs are 0) Base+offset addressing (offset of up to 4 Kbytes) LDR r 0, [r 1, #4] r 0 : = mem 32[r 1 +4] LDRB r 0, [r 1] r 0 : = mem 8[r 1] Note: no restrictions for r 1 Auto-indexing addressing LDR r 0, [r 1, #4]! r 0 : = mem 32[r 1 + 4] r 1 : = r 1 + 4 Post-indexed addressing LDR r 0, [r 1], #4 r 0 : = mem 32[r 1] r 1 : = r 1 + 4 21

Data Transfer Instructions (cont’d) COPY: ADR r 1, TABLE 1 ADR r 2, TABLE 2 LOOP: LDR r 0, [r 1] STR r 0, [r 2] ADD r 1, #4 ADD r 2, #4. . . TABLE 1: . . . TABLE 2: . . . ; r 1 points to TABLE 1 ; r 2 points to TABLE 2 COPY: ADR r 1, TABLE 1 ADR r 2, TABLE 2 LOOP: LDR r 0, [r 1], #4 STR r 0, [r 2], #4. . . TABLE 1: . . . TABLE 2: . . . ; r 1 points to TABLE 1 ; r 2 points to TABLE 2 22

Data Transfer Instructions Multiple register data transfers LDMIA r 1, {r 0, r 2, r 5} r 0 : = mem 32[r 1] r 2 : = mem 32[r 1 + 4] r 5 : = mem 32[r 1 + 8] Note: any subset (or all) of the registers may be transferred with a single instruction Note: the order of registers within the list is insignificant Note: including r 15 in the list will cause a change in the control flow Ø Stack organizations Ø FA – full ascending Ø EA – empty ascending Ø FD – full descending Ø ED – empty descending Ø Block copy view § data is to be stored above or below the address held in the base register § address incrementing or decrementing begins before or after storing the first value 23

Multiple register transfer addressing modes 1018 r 9’ r 9 r 5 r 1 r 0 16 100 c 16 1000 r 9’ r 5 r 1 r 0 16 STMDA r 9!, {r 0, r 1, r 5} 1018 16 100 c 16 r 9 1000 16 STMIB r 9!, {r 0, r 1, r 5} 1018 16 100 c 16 1000 r 5 r 1 r 0 16 STMIA r 9!, {r 0, r 1, r 5} 1018 r 9’ 100 c 16 r 9’ 16 r 5 r 1 r 0 1000 16 STMDB r 9!, {r 0, r 1, r 5} 24

The mapping between the stack and block copy views 25

Control flow instructions 26

Conditional execution Ø Conditional execution to avoid branch instructions used to skip a small number of non-branch instructions Ø Example CMP r 0, #5 BEQ BYPASS ADD r 1, r 0 SUB r 1, r 2 BYPASS: . . . ; ; if (r 0!=5) { ; r 1: =r 1+r 0 -r 2 ; } With conditional execution CMP r 0, #5 ADDNE r 1, r 0 SUBNE r 1, r 2. . . ; ; ; Note: add 2 –letter condition after the 3 -letter opcode ; if ((a==b) && (c==d)) e++; CMP r 0, r 1 CMPEQ r 2, r 3 ADDEQ r 4, #1 27

Branch and link instructions Ø Branch to subroutine (r 14 serves as a link register) BL SUBR ; branch to SUBR. . ; return here SUBR: . . ; SUBR entry point MOV pc, r 14 ; return Ø Nested subroutines SUB 1: SUB 2: BL SUB 1. . ; save work and link register STMFD r 13!, {r 0 -r 2, r 14} BL SUB 2. . LDMFD r 13!, {r 0 -r 2, pc}. . MOV pc, r 14 ; copy r 14 into r 15 28

Supervisor calls Ø Supervisor is a program which operates at a privileged level – it can do things that a user-level program cannot do directly § Example: send text to the display Ø ARM ISA includes SWI (Soft. Ware Interrupt) ; output r 0[7: 0] SWI_Write. C ; return from a user program back to monitor SWI_Exit 29

Jump tables Ø Call one of a set of subroutines depending on a value computed by the program JTAB: BL JTAB. . . CMP r 0, #0 BEQ SUB 0 CMP r 0, #1 BEQ SUB 1 CMP r 0, #2 BEQ SUB 2 Note: slow when the list is long, and all subroutines are equally frequent BL JTAB. . . JTAB: ADR r 1, SUBTAB CMP r 0, #SUBMAX ; overrun? LDRLS pc, [r 1, r 0, LSL #2] B ERROR SUBTAB: DCD SUB 0 DCD SUB 1 DCD SUB 2. . . 30

Hello ARM World! AREA Hello. W, CODE, READONLY ; declare code area SWI_Write. C EQU &0 ; output character in r 0 SWI_Exit EQU &11 ; finish program ENTRY ; code entry point START: ADR r 1, TEXT ; r 1 <- Hello ARM World! LOOP: LDRB r 0, [r 1], #1 ; get the next byte CMP r 0, #0 ; check for text end SWINE SWI_Write. C ; if not end of string, print BNE LOOP SWI_Exit ; end of execution TEXT = “Hello ARM World!”, &0 a, &0 d, 0 END 31

ARM Organization and Implementation Aleksandar Milenkovic E-mail: Web: milenka@ece. uah. edu http: //www. ece. uah. edu/~milenka

Outline Ø Ø Ø ARM Architecture ARM Organization and Implementation ARM Instruction Set Architectural Support for High-level Languages Thumb Instruction Set Architectural Support for System Development ARM Processor Cores Memory Hierarchy Architectural Support for Operating Systems ARM CPU Cores Embedded ARM Applications 33

ARM organization A[31: 0] control address register Ø Register file – P C § 2 read ports, 1 write port + 1 read, 1 write port reserved for r 15 (pc) Ø Barrel shifter – shift or rotate one operand for any number of bits Ø ALU – performs the arithmetic and logic functions required Ø Memory address register + incrementer Ø Memory data registers Ø Instruction decoder and associated control logic incrementer PC register bank instruction decode A L U b u s multiply register & A B b u s barrel shifter control ALU data out register data in register D[31: 0] 34

Three-stage pipeline Ø Fetch § the instruction is fetched from memory and placed in the instruction pipeline Ø Decode § the instruction is decoded and the datapath control signals prepared for the next cycle; in this stage the instruction owns the decode logic but not the datapath Ø Execute § the instruction owns the datapath; the register bank is read, an operand shifted, the ALU register generated and written back into a destination register 35

ARM single-cycle instruction pipeline 36

ARM single-cycle instruction pipeline fetch sub r 2, r 3, r 6 decode execute add fetch 1 decode execute sub fetch cmp r 2, #3 2 add r 0, r 1, #5 3 decode execute cmp time 37

ARM multi-cycle instruction pipeline Decode logic is always generating the control signals for the datapath to use in the next cycle 38

ARM multi-cycle LDMIA (load multiple) instruction ldmia r 0, {r 2, r 3} fetch decodeex ld r 2 ex ld r 3 sub r 2, r 3, r 6 cmp r 2, #3 Instruction delayed fetch Decode stage occupied since ldmia must continue to remember decoded instruction decode ex sub fetch decodeex cmp time sub fetched at normal time but not decoded until LDMIA is finishing 39

Control stalls: due to branches Ø Branches often introduce stalls (branch penalty) § Stall time may depend on whether branch is taken Ø May have to squash instructions that already started executing Ø Don’t know what to fetch until condition is evaluated 40

ARM pipelined branch Decision not made until the third clock cycle bne foo sub r 2, r 3, r 6 foo add r 0, r 1, r 2 fetch decode ex bne fetch decode Two cycles of work thrown away if bne takes place fetch decode ex add time 41

Pipeline: how it works Ø All instructions occupy the datapath for one or more adjacent cycles Ø For each cycle that an instruction occupies the datapath, it occupies the decode logic in the immediately preceding cycle Ø During the fist datapath cycle each instruction issues a fetch for the next instruction but one Ø Branch instruction flush and refill the instruction pipeline 42

ARM 9 TDMI 5 -stage pipeline next pc pc + 4 Ø Fetch Ø Decode Ø instruction is decoded Ø register operands read (3 read ports) Ø Execute Ø an operand is shifted and the ALU result generated, or Ø address is computed B, BL MOV pc SUBS pc Ø Buffer/data Ø data memory is accessed (load, store) Ø Write-back LDR pc Ø write to register file +4 I-cache pc+8 fetch I decode instruction decode r 15 register read LDM/ STM postindex +4 immediate fields mul shift pre-index reg shift ALU execute forwarding paths mux byte repl. load/store address D-cache buffer/ data rot/sgn ex register write-back 43

ARM 9 TDMI Data Forwarding next pc +4 I-cache fetch pc + 4 ADD r 3, r 2, r 1, LSL #3 r 3 : = r 2 + 8 x r 1 ADD r 5, r 3, LSL r 2 r 5 : = r 5 + 2 r 2 x r 3 pc+8 I decode instruction decode r 15 register read ADD r 3, r 2, r 1, LSL #3 r 3 : = r 2 + 8 x r 1 ADD r 8, r 9, r 10 r 8 : = r 9 + r 10 ADD r 5, r 3, LSL r 2 r 5 : = r 5 + 2 r 2 x r 3 LDM/ STM postindex +4 immediate fields mul shift pre-index reg shift ALU forwarding paths mux Stall? LD r 3, [r 2] ADD r 1, r 2, r 3 execute B, BL MOV pc SUBS pc byte repl. r 3 : = mem[r 2] r 1 : = r 2 + r 3 load/store address D-cache buffer/ data rot/sgn ex LDR pc register write-back 44

ARM 9 TDMI PC generation Ø 3 -stage pipeline next pc +4 I-cache fetch pc + 4 Ø PC behavior: operands are read in execution stage r 15 = PC + 8 pc+8 I decode register read Ø 5 -stage pipeline Ø operands are read in decode stage and r 15 = PC + 4? Ø incompatibilities between 3 stage and 5 -stage B, BL implementations => MOV pc SUBS pc unacceptable Ø to avoid this 5 -stage pipeline ARMs emulate the behavior of the older 3 -stage designs instruction decode r 15 LDM/ STM postindex +4 immediate fields mul shift pre-index reg shift ALU execute forwarding paths mux byte repl. load/store address D-cache buffer/ data rot/sgn ex LDR pc register write-back 45

Data processing instruction datapath activity (Ex) ØReg-Reg Ø Rd = Rn op Rm Ø r 15 = AR + 4 AR = AR + 4 address register increment Rd ØReg-Imm PC Rn Ø Rd = Rn op Imm Ø r 15 = AR + 4 AR = AR + 4 registers increment Rd Rm PC Rn mult registers mult as instruction [7: 0] data out data in i. pipe (a) register – register operations data out data in i. pipe (b) register – immediate operations 46

STR (store register) datapath activity (Ex 1, Ex 2) ØCompute address (Ex 1) address register Ø AR = Rn op Disp Ø r 15 = AR + 4 address register increment PC ØStore data (Ex 2) Rn Ø AR = PC Ø mem[AR] = Rd<x: y> Ø If autoindexing => Rn = Rn +/- 4 Rn registers PC registers mult shifter lsl #0 = A / A +B/ Rd = A +B/ A -B [11: 0] data out data in i. pipe (a) 1 st cycle – compute address byte? data in i. pipe (b) 2 nd cycle – store data & auto-index 47

The first two (of three) cycles of a branch instruction ØCompute target address register ØAR = PC + Disp, lsl #2 ØSave return address (if required) Ør 14 = PC ØAR = AR + 4 increment R 14 PC registers mult shifter lsl #2 Third cycle: do a small correction to the value stored in the link register in order that it points to directly at the instruction data out which follows the branch? =A = A+B [23: 0] data in i. pipe (a) 1 st cycle – compute branch target data out data in i. pipe (b) 2 nd cycle – save return address 48

ARM Implementation Ø Datapath § RTL (Register Transfer Level) Ø Control unit § FSM (Finite State Machine) 49

2 -phase non-overlapping clock scheme Ø Most ARMs do not operate on edge-sensitive registers Ø Instead the design is based around 2 -phase non-overlapping clocks which are generated internally from a single clock signal Ø Data movement is controlled by passing the data alternatively through latches which are open during phase 1 or latches during phase 2 50

ARM datapath timing Ø Register read § Register read buses – dynamic, precharged during phase 2 § During phase 1 selected registers discharge the read buses which become valid early in phase 1 Ø Shift operation § second operand passes through barrel shifter Ø ALU operation § ALU has input latches which are open in phase 1, allowing the operands to begin combining in ALU as soon as they are valid, but they close at the end of phase 1 so that the phase 2 precharge does not get through to the ALU § ALU processes the operands during the phase 2, producing the valid output towards the end of the phase § the result is latched in the destination register at the end of phase 2 51

ARM datapath timing (cont’d) Minimum Datapath Delay = Register read time + Shifter Delay + ALU Delay + Register write set-up time + Phase 2 to phase 1 non-overlap time 52

The original ARM 1 ripple-carry adder Ø Carry logic: use CMOS AOI (And-Or-Invert) gate Ø Even bits use circuit show below Ø Odd bits use the dual circuit with inverted inputs and outputs and AND and OR gates swapped around Ø Worst case path: 32 gates long 53

ARM 2 4 -bit carry look-ahead scheme Ø Carry Generate (G) Carry Propagate (P) Ø Cout[3] =Cin[0]. P + G Ø Use AOI and alternate AND/OR gates Ø Worst case: 8 gates long 54

The ARM 2 ALU logic for one result bit Ø ALU functions data operations (add, sub, . . . ) address computations for memory accesses branch target computations bit-wise logical operations §. . . § § 55

ARM 2 ALU function codes 56

The ARM 6 carry-select adder scheme Ø Compute sums of various fields of the word for carry-in of zero and carry-in of one Ø Final result is selected by using the correct carryin value to control a multiplexor Worst case: O(log 2[word width]) gates long Note: Be careful! Fan-out on some of these gates is high so direct comparison with previous schemes is not applicable. 57

The ARM 6 ALU organization Ø Not easy to merge the arithmetic and logic functions => a separate logic unit runs in parallel with the adder, and multiplexor selects the output 58

ARM 9 carry arbitration encoding Ø Carry arbitration adder ai bi Ci vi, wi ai bi ai-1 bi-1 Ci vi, wi 0 0, 0 0 0 - - 0 0, 0 1 1, 1 1 1 - - 1 1, 1 1 0 u 1, 0 0(1) 1(0) 0 0, 0 0 1 u 1, 0 0(1) 1(0) 1 1, 1 u 1, 0 0(1) 1(0) 59

The cross-bar switch barrel shifter Ø Shifter delay is critical since it contributes directly to the datapath cycle time Ø Cross-bar switch matrix (32 x 32) Ø Principle for 4 x 4 matrix right 3 right 2 right 1 no shift in[3] left 1 in[2] left 2 in[1] left 3 in[0] out[1] out[2] out[3] 60

The cross-bar switch barrel shifter (cont’d) Ø Precharged logic is used => each switch is a single NMOS transistor Ø Precharging sets all outputs to logic 0, so those which are not connected to any input during switching remain at 0 giving the zero filling required by the shift semantics Ø For rotate right, the right shift diagonal is enabled + complementary shift left diagonal (e. g. , ‘right 1’ + ‘left 3’) Ø Arithmetic shift right: use sign-extension => separate logic is used to decode the shift amount and discharge those outputs appropriately 61

Multiplier design Ø All ARMs apart form the first prototype have included support for integer multiplication § older ARM cores include low-cost multiplication hardware that supports only the 32 -bit result multiply and multiply-accumulate § recent ARM cores have high-performance multiplication hardware and support 64 -bit result multiply and multiply-accumulate Ø Low cost implementation § Use the datapath iteratively, employing the barrel shifter and ALU to generate 2 -bit product in each clock cycle § use early termination to stop the iterations when there are no more ones in the multiply register 62

The 2 -bit multiplication algorithm, Nth cycle Ø Control settings for the Nth cycle of the multiplication Ø Use existing shifter and ALU + additional hardware § dedicated two-bits-per-cycle shift register for the multiplier and a few gates for the Booth’s algorithm control logic (overhead is a few per cent on the area of ARM core) 63

High speed multiplication Ø Where multiplication performance is very important, more hardware resources must be dedicated § in some embedded systems the ARM core is used to perform real-time digital signal processing (DSP) – DSP programs are typically multiplication intensive Ø Use intermediate results which include partial sums and partial carries § Carry-save adders are used for this Ø These two binary results are added together at the end of multiplication § The main ALU is used for this 64

Carry-propagate (a) and carry-save (b) adder structures Ø Carry propagate adder takes two conventional (irredundant) binary numbers as inputs and produces a binary sum Ø Carry save adder takes one binary and one redundant (partial sum and partial carry) input and produces a sum in redundant binary representation (sum and carry) 65

ARM high-speed multiplier organization Ø CSA has 4 layers of adders each handling 2 multiplier bits => multiply 8 -bits per clock cycle Ø Partial sum and carry are cleared at the beginning or initialized to accumulate a value Ø Multiplier is shifted right 8 -bits per cycle in the ‘Rs’ register Ø Carry sum and carry are rotated right 8 bits per cycle Ø Performance: up to 4 clock cycles (early termination is possible) Ø Complexity: 160 bits in shift registers, 128 bits of carry-save adder logic (up to 10% of simpler cores) 66

ARM high-speed multiplier organization 67

ARM 2 register cell circuit 68

ARM register bank floorplan 69

ARM core datapath buses 70

ARM control logic structure 71