ARM CortexA 9 MPCore processor Presented by Chris

ARM Cortex-A 9 MPCore™ processor Presented by. Chris Cai (xiaocai 2) Rehana Tabassum (tabassu 2) Sam Mussmann (mussmnn 2)

Background “The architectural simplicity of ARM processors leads to very small implementations, and small implementations mean devices can have very low power consumption. Implementation size, performance, and very low power consumption are key attributes of the ARM architecture. ” ARM Architecture Reference Manual ARMv 7 -A edition

Background (2) ARM is RISC • Uniform register file • Load/store architecture • Simple addressing

Background (3) • The ARM Cortex-A 9 processor is the high performance choice in a family of low power, cost-sensitive devices. • The Cortex-A 9 microarchitecture is delivered either as a Cortex-A 9 single core processor or a scalable multicore processor: the Cortex-A 9 MPCore ™ processor

Where is it used? • Examples: - Apple A 5 (i. Phone 4 S, i. Pad 2, i. Pad mini) http: //en. wikipedia. org/wiki/ARM_Cortex-A 9_MPCore#Implementations http: //en. wikipedia. org/wiki/Iphone_4 s

Where is it used? (2) • Examples: - NVIDIA Tegra 2 (Motorola Xoom, Droid X 2) http: //en. wikipedia. org/wiki/ARM_Cortex-A 9_MPCore#Implementations http: //en. wikipedia. org/wiki/Motorola_Xoom

Where is it used? (3) • Examples: - Play. Station Vita http: //en. wikipedia. org/wiki/ARM_Cortex-A 9_MPCore#Implementations http: //en. wikipedia. org/wiki/Play. Station_Vita

What are its specs? • The Cortex A 9 core: - Gives 2. 50 DMIPS/MHz/core (Dhrystone MIPS) - Generally clocked between 800 MHz and 2 GHz - Possible to run > 1 GHz and < 250 m. W http: //arm. com/products/processors/cortex-a 9. php? tab=Specifications http: //www. linuxfordevices. com/c/a/News/ARM-spins-multicoreenabled-Cortex-core/

Presentation Overview • Micro-architecture • Memory System • Multi-core

Microarchitecture Overview • Variable length, out of order, superscalar pipeline – Two instructions are fetched in one cycle – Issue up to 4 instructions per cycle into: • • Primary data processing pipeline Secondary data processing pipeline Load-store pipeline Compute engine (FPU/NEON) pipeline • Speculative execution – Supporting virtual renaming of physical registers and removing pipelines stalls due to data dependencies

Cortex. A 9 Microarchitecture Rename Issue Execute Writeback Decode Instruction Fetch Memory www. arm. com/files/pdf/armcortexa-9 processors. pdf

Instruction Fetch • Instruction cache size: 16 KB, 32 KB, or 64 KB • Superscalar pipeline: fetching two instructions at once • Branch Prediction: – Global History Buffer: 1 K ~ 16 K entries – Branch-Target Address Cache: 512 ~ 4 K entries – Return stack of 4 x 32 bits • Fast-loop mode: instruction loop that are smaller than 64 bytes often complete without additional instruction cache accesses

Instruction Decode • Super Scalar Decoder - Capable of decoding two full instructions per cycle

Rename • Register Renaming - Resolving data dependencies and unroll small loops by hardware

Issue • Issue can be fed maximum of 2 instructions per cycle • Issue can dispatch up to 4 instructions per cycle • Out of order selection of instructions from queue

Execute • Variable length Executing Stage (1 ~ 3 cycles) - Most Instructions finish within 1 cycle - Instruction which folds shifts and rotates can take 3 cycles • ADD r 0, r 1, r 2 (1 cycle) • ADD r 0, r 1, r 2 LSL #2 (2 cycle) • Corresponds to a = b + (c << 2); • ADD r 0, r 1, r 2 LSL r 3 (3 cycle) • Corresponds to a = b + (c << d);

Execute (2) • NEON Media Processing Engine - NEON technology supports instructions targeted primarily at audio, video, 3 D graphics, image and speech processing. http: //www. arm. com/files/pdf/AT_-_NEON_for_Multimedia_Applications. pdf

Execute (3) • What is NEON? – NEON is a wide SIMD data processing architecture • 32 registers, 64 bit wide or 16 registers, 128 bit wide – NEON instructions perform “Packed SIMD” processing • Registers can be considered as “vector” of same data type • Instructions perform the same operation in all lanes http: //www. arm. com/files/pdf/AT_-_NEON_for_Multimedia_Applications. pdf

Execute (4) • NEON Media Processing Engine supports vector computations on: - half-precision (16 bit), single-precision (32 bit), doubleprecision (64 bit) floating-point numbers - 8, 16, 32 and 64 bit signed and unsigned integers • Supported Operations Include: - addition, subtraction, multiplication maximum or minimum of a vector of operands Inverse square-root approximation (y = x^-(1/2)) many more

Memory • dependent load-store instructions forwarded for resolution within memory system • 2 -level TLB structure – micro TLB • 32 entries on data side and 32 or 64 entries on instruction side • to reduce power consumed in translation and protection look-ups – main TLB http: //infocenter. arm. com/help/topic/com. arm. doc. ddi 0388 i/DDI 0388 I_cortex_a 9_r 4 p 1_trm. pdf

Memory (2) • Data prefetcher – monitor cache line requests by processor and cache misses to determine how much data to prefetch – can prefetch up to 8 independent data streams – prefetch and allocate data in the L 1 data cache, as long as it keeps hitting in the prefetched cache line – When stop prefetching?

Memory Hierarchy Cortex A 9 MPcore CPU Instruct ion Cache Data Cache Snoop Control Unit (SCU) CPU Instruct ion Cache Data Cache Accelerator Coherence Port L 2 Cache Main Memory http: //infocenter. arm. com/help/topic/com. arm. doc. ddi 0407 i/DDI 0407 I_cortex_a 9_mpcore_r 4 p 1_trm. pdf

L 1 caches Cortex A 9 MPcore • Non-unified CPU D$ I$ CPU CPU D$ D$ D$ I$ I$ SCU AXI RW 64 -bit bus I$ ACP AXI RW 64 -bit bus L 2 Cache - 32 bytes line length - can be disabled independently • • • 16, 32 or 64 KB 4 - way associative support for Security Extensions I cache: VIPT D cache: PIPT - reduce number of caches flushes and refills and save energy Main Memory http: //infocenter. arm. com/help/topic/com. arm. doc. ddi 0407 i/DDI 0407 I_cortex_a 9_mpcore_r 4 p 1_trm. pdf

L 2 cache Cortex A 9 MPcore CPU D$ I$ CPU CPU D$ D$ D$ I$ I$ SCU AXI RW 64 -bit bus I$ ACP • • shared, unified Off-chip 128 KB to 8 MB 4 to 16 -way associative AXI RW 64 -bit bus L 2 Cache Main Memory http: //infocenter. arm. com/help/topic/com. arm. doc. ddi 0407 i/DDI 0407 I_cortex_a 9_mpcore_r 4 p 1_trm. pdf

Snoop Control Unit Cortex A 9 MPcore CPU D$ I$ CPU CPU D$ D$ D$ I$ I$ SCU AXI RW 64 -bit bus I$ • Integral part of cache memory systems • Connects processors to memory system through AXI interfaces ACP AXI RW 64 -bit bus L 2 Cache Main Memory http: //infocenter. arm. com/help/topic/com. arm. doc. ddi 0407 i/DDI 0407 I_cortex_a 9_mpcore_r 4 p 1_trm. pdf

Snoop Control Unit (1) • SCU functions : - maintain data cache coherency - initiate L 2 memory accesses - arbitrate between processors’ simultaneous request for L 2 accesses - manages accesses from ACP • does not support instruction cache coherency http: //infocenter. arm. com/help/topic/com. arm. doc. ddi 0407 i/DDI 0407 I_cortex_a 9_mpcore_r 4 p 1_trm. pdf

Accelerator Coherence Port • optional AXI 64 -bit slave port • allows to connect to non-cached system mastering peripherals and accelerators —DMA engine or cryptographic accelerator • SCU enforces memory coherency http: //www. arm. com/files/pdf/ARMCortex. A-9 Processors. pdf

Multi-Core http: //www. arm. com/files/pdf/ARMCortex. A-9 Processors. pdf

Cache Coherence – MESI http: //en. wikipedia. org/wiki/MESI_protocol

Cache Coherence – MESI (2) ARM MPCore has optimizations to MESI: • Duplicated tag RAMs All done in the Snoop Control Unit System Level Benchmarking Analysis of the Cortex A 9 MPCore Roberto Mijat, ARM Connected Community Technical Symposium, 2009

Cache Coherence – MESI (2) ARM MPCore has optimizations to MESI: • Duplicated tag RAMs • Cache-2 -Cache transfer All done in the Snoop Control Unit System Level Benchmarking Analysis of the Cortex A 9 MPCore Roberto Mijat, ARM Connected Community Technical Symposium, 2009

Cache Coherence – MESI (2) ARM MPCore has optimizations to MESI: • Duplicated tag RAMs • Cache-2 -Cache transfer • Migratory Lines All done in the Snoop Control Unit System Level Benchmarking Analysis of the Cortex A 9 MPCore Roberto Mijat, ARM Connected Community Technical Symposium, 2009

Generalized Interrupt Control • Which core services interrupts? • GIC gives the programmer control • Centralizes interrupts, then dispatches to individual core(s) System Level Benchmarking Analysis of the Cortex A 9 MPCore Roberto Mijat, ARM Connected Community Technical Symposium, 2009