ARM Processor Architecture I Speaker LungHao Chang Advisor

ARM Processor Architecture (I) Speaker: Lung-Hao Chang 張龍豪 Advisor: Porf. Andy Wu 吳安宇教授 Graduate

Outline q Thumb instruction set q ARM/Thumb interworking q ARM organization q Summary ARM

Thumb instruction set ARM Platform Design SOC Consortium Course Material 09/21/2003 3

Thumb-ARM Difference q Thumb instruction set is a subset of the ARM instruction set

Register Access in Thumb q Not all registers are directly accessible in Thumb q

Thumb Accessible Registers Shaded registers have restricted access ARM Platform Design SOC Consortium Course

Branches q Thumb defines three PC-relative branch instructions, each of which have different offset

Data Processing Instruction q Subset of the ARM data processing instructions q Separate shift

Load or Store Register q Two pre-indexed addressing modes – Base register + offset

Block Data Transfers q Memory copy, incrementing base pointer after transfer – STMIA Rb!,

Thumb Instruction Entry and Exit q T bit, bit 5 of CPSR – If

Miscellaneous q Thumb SWI instruction format – Same effect as ARM, but SWI number

ARM Thumb-2 core technology q New instruction set for the ARM architecture q Enhanced

Thumb Instruction Set (1/3) ARM Platform Design SOC Consortium Course Material 09/21/2003 14

Thumb Instruction Set (2/3) ARM Platform Design SOC Consortium Course Material 09/21/2003 15

Thumb Instruction Set (3/3) ARM Platform Design SOC Consortium Course Material 09/21/2003 16

Thumb Instruction Format ARM Platform Design SOC Consortium Course Material 09/21/2003 17

ARM/Thumb interworking ARM Platform Design SOC Consortium Course Material 09/21/2003 18

The Need for Interworking q The code density of Thumb and its performance from

ARM/Thumb Interworking q Interworking can be carried out using the Branch Exchange instruction –

Example ; start off in ARM state CODE 32 ADR r 0, Into_Thumb+1 ;

ARM organization ARM Platform Design SOC Consortium Course Material 09/21/2003 22

3 -Stage Pipeline ARM Organization q Register Bank – 2 read ports, 1 write

3 -Stage Pipeline (1/2) q Fetch – The instruction is fetched from memory and

3 -Stage Pipeline (2/2) q At any time slice, 3 different instructions may occupy

Multi-cycle Instruction q Memory access (fetch, data transfer) in every cycle q Datapath used

Data Processing Instruction q All operations take place in a single clock cycle ARM

Data Transfer Instructions q Computes a memory address similar to a data processing instruction

Branch Instructions q The third cycle, which is required to complete the pipeline refilling,

Branch Pipeline Example q Breaking the pipeline q Note that the core is executing

5 -Stage Pipeline ARM Organization q Tprog = Ninst * CPI / fclk –

5 -Stage Pipeline Organization (1/2) q Fetch – The instruction is fetched from memory

5 -Stage Pipeline Organization (2/2) q Buffer/Data – Data memory is accessed if required.

Pipeline Hazards q There are situations, called hazards, that prevent the next instruction in

Structural Hazards q When a machine is pipelined, the overlapped execution of instructions requires

Example q A machine has shared a single-memory pipeline for data and instructions. As

Solution (1/2) q To resolve this, we stall the pipeline for one clock cycle

Solution (2/2) q Another solution is to use separate instruction and data memories. q

Data Hazards q Data hazards occur when the pipeline changes the order of read/write

Forwarding q The problem with data hazards, introduced by this sequence of instructions can

Forwarding Architecture q Forwarding works as follows: – The ALU result from the EX/MEM

Forward Data Clock cycle number ADD R 1, R 2, R 3 SUB R

Without Forward Clock cycle number ADD R 1, R 2, R 3 SUB R

Data Forwarding q Data dependency arises when an instruction needs to use the result

Stalls are required LDR R 1, @(R 2) SUB R 4, R 1, R

The Pipeline with one Stall LDR R 1, @(R 2) SUB R 4, R

LDR Interlock q In this example, it takes 7 clock cycles to execute 6

Optimal Pipelining q In this example, it takes 6 clock cycles to execute 6

LDM Interlock (1/2) q In this example, it takes 8 clock cycles to execute

LDM Interlock (2/2) q In this example, it takes 9 clock cycles to execute

Control hazards (1/2) q Control hazards can cause a greater performance loss for ARM

Control hazards (2/2) q The number of clock cycles can be reduced by two

Branch prediction q Branch prediction is to predict the branch as no taken, simply

Predict Not Taken q The pipeline with this scheme implemented behaves as shown below:

Predict Taken q An alternative scheme is to predict the branch as taken. q

Summary q Instruction set – 32 bit ARM instruction – 16 bit Thumb instruction

References [1] http: //twins. ee. nctu. edu. tw/courses/ip_core_02/index. html [2] ARM System-on-Chip Architecture, Second

Slides: 57

Download presentation

ARM Processor Architecture (I) Speaker: Lung-Hao Chang 張龍豪 Advisor: Porf. Andy Wu 吳安宇教授 Graduate Institute of Electronics Engineering, National Taiwan University Modified from National Chiao-Tung University IP Core Design course

Outline q Thumb instruction set q ARM/Thumb interworking q ARM organization q Summary ARM Platform Design SOC Consortium Course Material 09/21/2003 2

Thumb instruction set ARM Platform Design SOC Consortium Course Material 09/21/2003 3

Thumb-ARM Difference q Thumb instruction set is a subset of the ARM instruction set and the instructions operate on a restricted view of the ARM registers q Most Thumb instructions are executed unconditionally (All ARM instructions are executed conditionally) q Many Thumb data processing instructions use 2 2 address format, i. e. the destination register is the same as one of the source registers (ARM data processing instructions, with the exception of the 64 bit multiplies, use a 3 -address format) q Thumb instruction formats are less regular than ARM instruction formats => dense encoding ARM Platform Design SOC Consortium Course Material 09/21/2003 4

Register Access in Thumb q Not all registers are directly accessible in Thumb q Low register r 0 – r 7 – fully accessible q High register r 8 – r 12 – only accessible with MOV, ADD, CMP q SP (Stack Pointer), LR (Link Register) & PC (Program Counter) – limited accessibility, certain instructions have implicit access to these q CPSR – only indirect access q SPSR – no access ARM Platform Design SOC Consortium Course Material 09/21/2003 5

Thumb Accessible Registers Shaded registers have restricted access ARM Platform Design SOC Consortium Course Material 09/21/2003 6

Branches q Thumb defines three PC-relative branch instructions, each of which have different offset ranges – Offset depends upon the number of available bits q Conditional Branches – B<cond> label – 8 -bit offset: range of -128 to 127 instruction (+/-256 bytes) – Only conditional Thumb instructions q Unconditional Branches – B label – 11 -bit offset: range of -1024 to 1023 instructions (+/-2 K bytes) q Long Branches with Link – BL subroutine – Implemented as a pair of instructions – 22 -bit offset: range of -2097152 to 2097151 instruction (+/-4 M bytes) 7 SOC Consortium Course Material 09/21/2003 ARM Platform Design

Data Processing Instruction q Subset of the ARM data processing instructions q Separate shift instructions (e. g. LSL, ASR, LSR, ROR) LSL Rd, Rs, #Imm 5 ASR Rd, Rs ; Rd: =Rs <shift> #Imm 5 ; Rd: =Rd <shift> Rs q Two operands for data processing instructions – Act on low registers BIC Rd, Rs ADD Rd, #Imm 8 ; Rd: =Rd AND NOT Rs ; Rd: =Rd+#Imm 8 – Also three operand forms of add, subtract and shifts ADD Rd, Rs, #Imm 3 ; Rd: =Rs+#Imm 3 q Condition code always set by low register operations ARM Platform Design SOC Consortium Course Material 09/21/2003 8

Load or Store Register q Two pre-indexed addressing modes – Base register + offset register – Base register + 5 -bit offset, where offset scaled by • 4 for word accesses (range of 0 -124 bytes / 0 -31 words) – STR Rd, [Rb, #Imm 7] • 2 for halfword accesses (range of 0 -62 bytes / 0 -31 halfwords) – LDRH Rd, [Rb, #Imm 6] • 1 for bytes accesses (range of 0 -31 bytes) – LDRB Rd, [Rb, #Imm 5] q Special forms – Load with PC as base with 1 K byte immediate offset (word aligned) • Used for loading a value from a literal pool – Load and store with SP as base with 1 K byte immediate offset (word aligned) • Used for accessing local variables on the stack ARM Platform Design SOC Consortium Course Material 09/21/2003 9

Block Data Transfers q Memory copy, incrementing base pointer after transfer – STMIA Rb!, {Low Reg list} – LDMIA Rb!, {Low Reg list} q Full descending stack operations – – PUSH {Low Reg list} PUSH {Low Reg List, LR} POP {Low Reg list} POP {Low Reg List, PC} q The optional addition of the LR/PC provides support for subroutine entry/exit ARM Platform Design SOC Consortium Course Material 09/21/2003 10

Thumb Instruction Entry and Exit q T bit, bit 5 of CPSR – If T = 1, the processor interprets the instruction stream as 16 -bit Thumb instruction – If T = 0, the processor interprets if as standard ARM instructions q Thumb Entry – ARM cores startup, after reset, execution ARM instructions – Executing a branch and Exchange instruction (BX) • Set the T bit if the bottom bit of the specified register was set • Switch the PC to the address given in the remainder of the register q Thumb Exit – Executing a thumb BX instruction ARM Platform Design SOC Consortium Course Material 09/21/2003 11

Miscellaneous q Thumb SWI instruction format – Same effect as ARM, but SWI number limited to 0 -255 – Syntax: • SWI <SWI number> 15 8 7 1 1 0 1 1 1 0 SWI number q Indirect access to CPSR and no access to SPSR, so no MRS or MSR instructions q No coprocessor instruction space ARM Platform Design SOC Consortium Course Material 09/21/2003 12

ARM Thumb-2 core technology q New instruction set for the ARM architecture q Enhanced levels of performance, energy efficiency, and code density for a wide range of embedded applications ARM Platform Design SOC Consortium Course Material 09/21/2003 13

Thumb Instruction Set (1/3) ARM Platform Design SOC Consortium Course Material 09/21/2003 14

Thumb Instruction Set (2/3) ARM Platform Design SOC Consortium Course Material 09/21/2003 15

Thumb Instruction Set (3/3) ARM Platform Design SOC Consortium Course Material 09/21/2003 16

Thumb Instruction Format ARM Platform Design SOC Consortium Course Material 09/21/2003 17

ARM/Thumb interworking ARM Platform Design SOC Consortium Course Material 09/21/2003 18

The Need for Interworking q The code density of Thumb and its performance from narrow memory make it ideal for the bulk of C code in many systems. However there is still a need to change between ARM and Thumb state within most applications: – ARM code provides better performance from wide memory • Therefore ideal for speed-critical parts of an application – Some functions can only be performed with ARM instructions, e. g. • Access to CPSR (to enable/disable interrupts & to change mode) • Access to coprocessors – Exception Handling • ARM state is automatically entered for exception handling, but system specification may require usage of Thumb code for main handler – Simple standalone Thumb programs will also need an ARM assembler header to change state and call the Thumb routine ARM Platform Design SOC Consortium Course Material 09/21/2003 19

ARM/Thumb Interworking q Interworking can be carried out using the Branch Exchange instruction – BX Rn ; Thumb state Branch ; Exchange – BX<condition> Rn ; ARM state Branch q Can also be used as an absolute branch without a state change ARM Platform Design SOC Consortium Course Material 09/21/2003 20

Example ; start off in ARM state CODE 32 ADR r 0, Into_Thumb+1 ; generate branch target ; address & set bit 0 ; hence arrive Thumb state BX r 0 ; branch exchange to Thumb … CODE 16 ; assemble subsequent as Thumb Into_Thumb … ADR r 5, Back_to_ARM ; generate branch target to ; word-aligned address, ; hence bit 0 is cleared. BX r 5 ; branch exchange to ARM … CODE 32 ; assemble subsequent as ARM Back_to_ARM … ARM Platform Design SOC Consortium Course Material 09/21/2003 21

ARM organization ARM Platform Design SOC Consortium Course Material 09/21/2003 22

3 -Stage Pipeline ARM Organization q Register Bank – 2 read ports, 1 write ports, access any register – 1 additional read port, 1 additional write port for r 15 (PC) q Barrel Shifter – Shift or rotate the operand by any number of bits q ALU q Address register and incrementer q Data Registers – Hold data passing to and from memory q Instruction Decoder and Control ARM Platform Design SOC Consortium Course Material 09/21/2003 23

3 -Stage Pipeline (1/2) q Fetch – The instruction is fetched from memory and placed in the instruction pipeline q Decode – The instruction is decoded and the datapath control signals prepared for the next cycle q Execute – The register bank is read, an operand shifted, the ALU result generated and written back into destination register ARM Platform Design SOC Consortium Course Material 09/21/2003 24

3 -Stage Pipeline (2/2) q At any time slice, 3 different instructions may occupy each of these stages, so the hardware in each stage has to be capable of independent operations q When the processor is executing data processing instructions , the latency = 3 cycles and the throughput = 1 instruction/cycle ARM Platform Design SOC Consortium Course Material 09/21/2003 25

Multi-cycle Instruction q Memory access (fetch, data transfer) in every cycle q Datapath used in every cycle (execute, address calculation, data transfer) q Decode logic generates the control signals for the data path use in next cycle (decode, address calculation) ARM Platform Design SOC Consortium Course Material 09/21/2003 26

Data Processing Instruction q All operations take place in a single clock cycle ARM Platform Design SOC Consortium Course Material 09/21/2003 27

Data Transfer Instructions q Computes a memory address similar to a data processing instruction q Load instruction follow a similar pattern except that the data from memory only gets as far as the ‘data in’ register on the 2 nd cycle and a 3 rd cycle is needed to transfer the data from there to the destination register ARM Platform Design SOC Consortium Course Material 09/21/2003 28

Branch Instructions q The third cycle, which is required to complete the pipeline refilling, is also used to mark the small correction to the value stored in the link register in order that is points directly at the instruction which follows the branch ARM Platform Design SOC Consortium Course Material 09/21/2003 29

Branch Pipeline Example q Breaking the pipeline q Note that the core is executing in the ARM state ARM Platform Design SOC Consortium Course Material 09/21/2003 30

5 -Stage Pipeline ARM Organization q Tprog = Ninst * CPI / fclk – Tprog: the time that execute a given program – Ninst: the number of ARM instructions executed in the program => compiler dependent – CPI: average number of clock cycles per instructions => hazard causes pipeline stalls – fclk: frequency q Separate instruction and data memories => 5 stage pipeline q Used in ARM 9 TDMI ARM Platform Design SOC Consortium Course Material 09/21/2003 31

5 -Stage Pipeline Organization (1/2) q Fetch – The instruction is fetched from memory and placed in the instruction pipeline q Decode – The instruction is decoded and register operands read from the register files. There are 3 operand read ports in the register file so most ARM instructions can source all their operands in one cycle q Execute – An operand is shifted and the ALU result generated. If the instruction is a load or store, the memory address is computed in the ALU ARM Platform Design SOC Consortium Course Material 09/21/2003 32

5 -Stage Pipeline Organization (2/2) q Buffer/Data – Data memory is accessed if required. Otherwise the ALU result is simply buffered for one cycle q Write back – The result generated by the instruction are written back to the register file, including any data loaded from memory ARM Platform Design SOC Consortium Course Material 09/21/2003 33

Pipeline Hazards q There are situations, called hazards, that prevent the next instruction in the instruction stream from being executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining. q There are three classes of hazards: – Structural Hazards: They arise from resource conflicts when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution. – Data Hazards: They arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline. – Control Hazards: They arise from the pipelining of branches and other instructions that change the PC ARM Platform Design SOC Consortium Course Material 09/21/2003 34

Structural Hazards q When a machine is pipelined, the overlapped execution of instructions requires pipelining of functional units and duplication of resources to allow all possible combinations of instructions in the pipeline. q If some combination of instructions cannot be accommodated because of a resource conflict, the machine is said to have a structural hazard. ARM Platform Design SOC Consortium Course Material 09/21/2003 35

Example q A machine has shared a single-memory pipeline for data and instructions. As a result, when an instruction contains a data-memory reference (load), it will conflict with the instruction reference for a later instruction (instr 3): Clock cycle number instr 1 2 3 4 5 load IF ID EX MEM WB IF ID EX MEM Instr 1 Instr 2 Instr 3 ARM Platform Design 6 SOC Consortium Course Material 7 09/21/2003 8 WB 36

Solution (1/2) q To resolve this, we stall the pipeline for one clock cycle when a data-memory access occurs. The effect of the stall is actually to occupy the resources for that instruction slot. The following table shows how the stalls are actually implemented. Clock cycle number instr 1 2 3 4 5 load IF ID EX MEM WB stall IF ID EX Instr 1 Instr 2 Instr 3 ARM Platform Design 6 SOC Consortium Course Material 7 8 9 MEM WB 09/21/2003 37

Solution (2/2) q Another solution is to use separate instruction and data memories. q ARM used Harvard architecture, so we do not have this hazard ARM Platform Design SOC Consortium Course Material 09/21/2003 38

Data Hazards q Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instructions on the unpipelined machine. Clock cycle number ADD R 1, R 2, R 3 SUB R 4, R 5, R 1 AND R 6, R 1, R 7 OR R 8, R 1, R 9 XOR R 10, R 11 ARM Platform Design 1 2 3 4 5 6 IF ID EX MEM WB IF IDsub EX MEM WB IF IDand EX MEM WB IF IDor EX MEM WB IF IDxor EX SOC Consortium Course Material 7 09/21/2003 8 9 MEM WB 39

Forwarding q The problem with data hazards, introduced by this sequence of instructions can be solved with a simple hardware technique called forwarding. Clock cycle number ADD R 1, R 2, R 3 SUB R 4, R 5, R 1 AND R 6, R 1, R 7 ARM Platform Design 1 2 3 4 5 IF ID EX MEM WB IF IDsub EX MEM WB IF IDand EX MEM SOC Consortium Course Material 6 09/21/2003 7 WB 40

Forwarding Architecture q Forwarding works as follows: – The ALU result from the EX/MEM register is always fed back to the ALU input latches. – If the forwarding hardware detects that the previous ALU operation has written the register corresponding to the source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file. forwarding paths ARM Platform Design SOC Consortium Course Material 09/21/2003 41

Forward Data Clock cycle number ADD R 1, R 2, R 3 SUB R 4, R 5, R 1 AND R 6, R 1, R 7 1 2 3 4 5 6 IF ID EXadd MEMadd WB IF ID EXsub MEM WB IF ID EXand MEM 7 WB q The first forwarding is for value of R 1 from EXadd to EXsub. The second forwarding is also for value of R 1 from MEMadd to EXand. This code now can be executed without stalls. q Forwarding can be generalized to include passing the result directly to the functional unit that requires it q A result is forwarded from the output of one unit to the input of another, rather than just from the result of a unit to the input of the same unit. 42 SOC Consortium Course Material 09/21/2003 ARM Platform Design

Without Forward Clock cycle number ADD R 1, R 2, R 3 SUB R 4, R 5, R 1 AND R 6, R 1, R 7 ARM Platform Design 1 2 3 4 5 6 7 IF ID EX MEM WB IF stall IDsub EX MEM WB IF IDand EX SOC Consortium Course Material 09/21/2003 8 9 MEM WB 43

Data Forwarding q Data dependency arises when an instruction needs to use the result of one of its predecessors before the result has returned to the register file => pipeline hazards q Forwarding paths allow results to be passed between stages as soon as they are available q 5 -stage pipeline requires each of the three source operands to be forwarded from any of the intermediate result registers q Still one load stall LDR r. N, […] ADD r 2, r 1, r. N ; use r. N immediately – One stall – Compiler rescheduling ARM Platform Design SOC Consortium Course Material 09/21/2003 44

Stalls are required LDR R 1, @(R 2) SUB R 4, R 1, R 5 AND R 6, R 1, R 7 OR R 8, R 1, R 9 1 2 3 4 5 6 7 IF ID EX MEM WB IF ID EXsub MEM WB IF ID EXand MEM WB IF ID EXE MEM 8 WB q The load instruction has a delay or latency that cannot be eliminated by forwarding alone. ARM Platform Design SOC Consortium Course Material 09/21/2003 45

The Pipeline with one Stall LDR R 1, @(R 2) SUB R 4, R 1, R 5 AND R 6, R 1, R 7 OR R 8, R 1, R 9 1 2 3 4 5 6 7 IF ID EX MEM WB IF ID stall IF 8 EXsub MEM WB stall ID EX MEM WB stall IF ID EX MEM 9 WB q The only necessary forwarding is done for R 1 from MEM to EXsub. ARM Platform Design SOC Consortium Course Material 09/21/2003 46

LDR Interlock q In this example, it takes 7 clock cycles to execute 6 instructions, CPI of 1. 2 q The LDR instruction immediately followed by a data operation using the same register cause an interlock ARM Platform Design SOC Consortium Course Material 09/21/2003 47

Optimal Pipelining q In this example, it takes 6 clock cycles to execute 6 instructions, CPI of 1 q The LDR instruction does not cause the pipeline to interlock ARM Platform Design SOC Consortium Course Material 09/21/2003 48

LDM Interlock (1/2) q In this example, it takes 8 clock cycles to execute 5 instructions, CPI of 1. 6 q During the LDM there are parallel memory and write back cycles ARM Platform Design SOC Consortium Course Material 09/21/2003 49

LDM Interlock (2/2) q In this example, it takes 9 clock cycles to execute 5 instructions, CPI of 1. 8 q The SUB incurs a further cycle of interlock due to it using the highest specified register in the LDM instruction ARM Platform Design SOC Consortium Course Material 09/21/2003 50

Control hazards (1/2) q Control hazards can cause a greater performance loss for ARM pipeline that data hazards. q When a branch is executed, it may or may out change the PC (program counter) to something other than its current value plus 4. q The simplest method of dealing with branches is to stall the pipeline as soon as the branch is detected until we reach the EX stage Branch successor+1 ARM Platform Design IF ID EXE MEM WB IF (stall) Stall IF ID EXE MEM WB IF ID SOC Consortium Course Material EXE 09/21/2003 MEM WB 51

Control hazards (2/2) q The number of clock cycles can be reduced by two steps – Find our whether the branch is taken or not taken earlier in the pipeline – Compute the taken PC (i. e. , the address of the branch target) earlier q We will discuss branch prediction schemes ARM Platform Design SOC Consortium Course Material 09/21/2003 52

Branch prediction q Branch prediction is to predict the branch as no taken, simply allowing the hardware to continue as if the branch were not executed. q Care must be taken not to change the machine state until the branch outcome is definitely known. ARM Platform Design SOC Consortium Course Material 09/21/2003 53

Predict Not Taken q The pipeline with this scheme implemented behaves as shown below: Untaken Branch Instr IF Instr i+1 ID EXE MEM WB IF Instr I+2 ID EXE MEM WB IF ID EXE MEM WB Taken Branch Instr IF ID EXE MEM WB Instr i+1 IF idle IF ID EXE MEM WB Branch target+1 ARM Platform Design SOC Consortium Course Material 09/21/2003 54

Predict Taken q An alternative scheme is to predict the branch as taken. q ARM employs a static branch prediction mechanism – Conditional branches that branch backwards are predicted to be taken – Conditional branches that branch forwards are predicted not to be taken ARM Platform Design SOC Consortium Course Material 09/21/2003 55

Summary q Instruction set – 32 bit ARM instruction – 16 bit Thumb instruction q ARM/Thumb interworking q ARM organization – 3 -stage pipeline • Fetch/Decode/Execute – 5 -stage pipeline • Fetch/Decode/Execute/Buffer/Write Back • Pipeline hazards – Structure hazard – Data hazard – Control hazard ARM Platform Design SOC Consortium Course Material 09/21/2003 56

References [1] http: //twins. ee. nctu. edu. tw/courses/ip_core_02/index. html [2] ARM System-on-Chip Architecture, Second Edition, edited by S. Furber, Addison Wesley Longman: ISBN 0 -20167519 -6. [3] Architecture Reference Manual, Second Edition, edited by D. Seal, Addison Wesley Longman: ISBN 0 -201 -73719 -1. [4] www. arm. com ARM Platform Design SOC Consortium Course Material 09/21/2003 57