Tegra Xavier Introduction to ARMv 8 Kristoffer Robin

Tegra Xavier Introduction to ARMv 8 Kristoffer Robin Stokke, Ph. D Dolphin Interconnect Solutions And ging g u Deb

Goals of Lecture q. To give you q Something concrete to start on q Some examples from «real life» where you may encounter these topics q. Every year I try to include something new. . . q Which means more freebees for you! q. Simple introduction to ARMv 8 NEON programming environment q Register environment, instruction syntax q «Families» of instructions q Important for debugging, writing code and general understanding q. Programming examples q Intrinsics q Inline assembly q. Performance analysis using gprof q. Introduction to GDB debugging

Keep This Under Your Pillow q. ARM’s overview and information on NEON instructions q https: //developer. arm. com/documentation/dui 0204/j/neon-and-vfp-programming q. GNU compiler intrinsics list: q https: //gcc. gnu. org/onlinedocs/gcc-4. 6. 4/gcc/ARM-NEON-Intrinsics. html q. Some non-formal calling conventions and snacks q https: //medium. com/mathieugarcia/introduction-to-arm 64 -neon-assembly-930 c 4 a 48 bb 2 a q. This may also be useful q https: //community. arm. com/developer/tools-software/oss-platforms/b/android-blog/posts/armneon-programming-quick-reference

Modern Heterogeneous So. C Architectures Manufacturer CPU Cache RAM GPU DSP Hardware Accelerators Tegra X 1 Nvidia 4 ARM Cortex A 57 + 4 A 53 2 MB L 2, 48 k. B I$ 32 k. B D$ (L 1) 4 GB 256 -core Maxwell - • ISP Tegra Xavier Nvidia 8 Carmel ARMv 8 2 MB L 3 (shared) 8 MB L 2 (shared 2 cores) 128 k. B I$ 64 k. B D$ 16 GB 512 -core Volta - • CNN Blocks ISP Kilobytes 4 GB - 16 VLIW cores • ISP • CNN Blocks • ++ more 8 GB Adreno GPU Hexagon VLIW + SIMD • ISP • LTE Impl. Dep. « 3 D graphics» Myriad X Intel Movidius 2 SPARC SDA 845 Qualcomm 8 Kryo ARMbased IMX 6 Q Freescale (NXP) 4 ARM Cortex A -9 04. 01. 2022 1 MB L 2 • 5

Tegra Xavier CPU Cache Hierarchy ARM 8. 2 Cores (Carmel) • (Half-precision floating point!) • SIMD! • 64 k. B L 1 Cache • Per core • 4 MB L 2 Cache • Per dual • 2 MB L 3 Cache • Shared between all cores • Least recently used eviction strategy (LRU) 04. 01. 2022 Faster this way • Core 128 k. B $I 64 k. B $D 4 MB L 2 Cache 2 MB L 3 Cache 16 GB RAM 6

CPU Hierarchies and Performance • Let’s do an experiment! ☺ • Reading or writing 800 MB • Vary the size to read back-to-back – E. g. read 24 k. B repeatedly from same buffer, until 800 MB have been read • Buffersize detemines location of data • Under ideal conditions (no contention. . ) – Below 50 k. B, all reads are cached in L 1 – Below 4 MB, all reads are cached in L 2 – Above 6 MB. . . Nothing gets cached 04. 01. 2022 CPU Core 20 k Loops @ 40 k. B L 1 64 k. B 40 k Loops @ 80 k. B L 2 4 MB 100 Loops @ 8 MB L 2 4 MB 7

Code Example and Profiling • • • Compile with –pg Run app: . /main Analyze – Gprof. /main gmon. out • NB: Prefetch op – Prfm <type><target><policy> reg|label – Type • pld (for load) • pst (for store) • pli (for instruction) • Target – L 1 or L 2 (or L 3) – Policy • keep (normal) • stream (use once) prfm pldl 1 keep[x 0] (address in x 0) • 04. 01. 2022 Read Write L 1 20 ms 30 ms L 2 70 ms RAM 220 ms 180 ms 8

ARMv 8 Registers 31 x 64 -bit general purpose registers X 0 X 8 x 24 x 16 32 x 128 -bit vector registers SP WZR Zero registers Stack pointer WSP V 16 V 8 V 0 XZR PC V 24

The Vector Registers V 0 -V 31: Packing q Data in V 0 -V 31 are packed, and you control how they are packed Example: 16 bytes or 8 bytes Lanes Example: 8 half-words or 4 half-words

Intrinsics, Inline Assembly or Assembly? Intrinsics Inline Assembly • You will need to understand int 32 x 8 to vector; int 32 x 8 vector; assembly • Debug your program // Do stuff on vector • Understand how to use __asm__( Vector = vaddq_s 32(vector, vector) “vadd [v]. s 4, [v]. s 4” intrinsics correctly Assembly. text. arm. global double_elements: vadd. i 32 v 0, v 0 bx lr. end : : [vector] “q” (vector) : ) Goes inside C functions Goes in. s file Level of Difficulty 04. 01. 2022 11

Data Types C World Assembly World v 0. 8 b uint 8 x 8_t 8 -bit unsigned integer Vector registers are specified by the 8 elements following: 8 B/16 B/4 H/8 H/2 S/4 S/2 D B: bytes H: half-word (16 -bit) 16 S: elements word (32 -bit) D: doubleword (64 -bit) uint 8 x 16_t 8 -bit unsigned integer float 32 x 2_t 32 -bit floating point An S-vector can therefor occupy signed or unsigned integers, or floating point values. instruction or intrinsic, when needed. 04. 01. 2022 v 0. 2 s 2 elements float 32 x 4_t. Meaning is encoded in the assembly 32 -bit floating point v 0. 16 b v 0. 4 s 4 elements 12

Example: Loading or Storing Something void * inp = malloc( 64 ) char * inp = malloc( 64 ) __asm__( «ld 1 {v 0. 16 b, v 1. 16 b, v 2. 16 b, v 3. 16 b}, [%[inp]]» : : [inp] «r» (inp)) uint 8 x 16_t vectors[4]; void * out = malloc( 64 ) char * inp = malloc( 64 ) __asm__( // Assume we have done something // intelligent with v 0, v 1, v 2 and v 3 «st 1 {v 0. 16 b, v 1. 16 b, v 2. 16 b, v 3. 16 b}, [%[out]]» : : [out] «r» (out) : «memory» ) uint 8 x 16_t vectors[4]; 04. 01. 2022 for(i=0; i < 4; i++) vectors[i] = vld 1 q_u 8(inp + i*16) for(i=0; i < 4; i++) vst 1 q_u 8(inp + i*16, vectors[i]) 13

How Does Intrinsics Map to Assembly? • If you want to write some piece of inline assembly – But compiler spits out errors and you don’t know the syntax • Try to write it by intrinsics – Then objdump –D <your executable> | less – Type /<insert_your_function_name> + hit return and search • Alternatively – – 04. 01. 2022 Gdb <your executable> Break <your_source_file>. c: <your_line_number> Type run -> enter Layout asm -> inspect 14

Example: Loading or Storing Something • There are vector types and intrinsics for clustered vectors • In this case, four 128 -bit registers • Be careful! • Compiler seems to rearrange the contents in some unintuitive way char * inp = malloc( 64 ) For(i=0; i < 64; i++) inp[i] = i; // Contents are consecutive in memory. . Uint 8 x 16 x 4 vectors; vectors = vld 4 q(inp); // Contents are not consecutive in vectors!! 04. 01. 2022 15

Example: Vector Packing Data types Size Bytes 1 B Half-words 2 B Words 4 B Double words 8 B Half precision 2 B Single precision 4 B Double precision 8 B

Other Examples int 16 x 4_t v 0, v 1; Int 16 x 4_t result; result = vadd_s 16(v 0, v 1) result = vsub_s 16(v 0, v 1) result = vmul_s 16(v 0, v 1) Addition, subtraction and multiplication int 8 x 16_t v 0; int 8_t init = 0; v 0 = vdupq_n_s 8(init); Initialise all lanes Float 32 x 4_t v 0; Float 32_t val; Val = vgetq_lane_f 32(v 0, 0) Val += 42. 0 F; Uint 32 x 4_t v 0; float 32_x 4_t result; Result = vcvtq_f 32_u 32( v 0 ) V 0 = vsetq_lane_f 32(val, v 0, 0) Get and set a specific lane Convert unsigned int to float 04. 01. 2022 17

Programming With Intrinsics More in a bit!

Programming Example: Intrinsics

Inline Assembly q Mostly harder than using intrinsics q However, gives more control (and better performance? ) q Not always straightforward to figure out what mnemonics to use q Tips: disassemble intrinsics and look with objdump or gdb Operand constraints > «m» : memory address > «r» : general purpose register > «f» : floating point register > «i» : immediate ++ more Specify dirty registers and more

Programming Example: Inline Assembly

Lookup Tables (LUT) • Powerful approximation • Use LUTs to realise complex mathematics! • For example prime numbers. . • Some “index” points into a LUT offset that contains precomputed values • 3 Index «vector» 04. 01. 2022 7 5 3 2 LUT (four-element) Output stored in a vector 5 Output «vector» 22

Table Lookup in ARM Neon q Vector table lookup: vtbl v 0, {v 1, v 2, . . . , vn}, vm Two flavours: q. V 0: destination vector q{v 1, v 2}: LUT vtbl(max 2 x 128 -bit vectors!!) qvm: index vector Any element out of range for LUT returns 0 vtbx v 0 0 15 Any element out of range for LUT 16 leaves the destination unchanged v 1 v 2 0 8 4 6 0 vm 18 4 18 31 24 25 14 19

• Let’s try to use LUTs to transpose matrices. • Don’t go thinking 4 x 4 or 8 x 8. – Start easy, then let’s see if we can observe any patterns. 04. 01. 2022 24

Matrix Transpose (Super Simple) Stride = 1 a a Destination Vector a LUT (matrix) a Index Vector 0 stride

2 x 2 matrix, stride = 2 a b a c c d b d Destination Vector a c b d LUT (matrix) a b c d Index Vector 0 2 1 3 stride

3 x 3 matrix, stride = 3 a b c d e f g h i stride a d g b e h c f i Destination Vector a d g b e h c LUT (matrix) a b c d e Index Vector 0 3 6 1 4 7 2 5 8 stride f f i g h i

How to think? • For the «first output row» = 0 – Element output n is taken from n*stride in «input matrix» • For the “next row” = 1 – Element output n is taken from n*stride + 1 • So generally, for output element n in output row i – Element is taken from n*stride + i 04. 01. 2022 28

Matrix Transpose Example 04. 01. 2022 29

The Gamma Transform • Human eye is sensitive to variation in luminance • The gamma transform. . – «stretches» small variations in luminance – Can make it easier to see detail • Gamma is also used to adjust for non-linearity in old monitors – Image data is actually transmitted to the display with gamma applied to it as a form of «back compatibility» – Which is extremely confusing – Google «gamma correction explained» and watch the madness 04. 01. 2022 30

The Gamma Transform 1 Output luminance value Output value Resulting larger change in luminance Input luminance value 0 Input Value 1 Small change in luminance 04. 01. 2022 https: //wolfcrow. com/what-is-display-gamma-and-gamma-correction/ 31

Higher gamma compresses the low-lights, but extends high-lights Lower gamma compresses the high-lights, but extends the low-lights 04. 01. 2022 32

Non-Temporal Loads and Stores • • Reading remote RAM from PCIe • Very slow due to core is hanging while datapath is fetching response However, ARM provides special memory load & store instructions • Non-temporal loads and stores • Relaxed memory ordering • Hardware implementation may be able to improve performance; for example by not waiting for reads to arrive in destination registers 04. 01. 2022 CPU RAM Tegra Xavier PCIe CPU RAM Tegra Xavier 33

Non-Temporal Loads and Stores «ldnp q 0, q 1, [address]» “stnp q 0, q 1, [address]” Temporal Instructions NX reads remote RAM (x 2 link) < 10 MBps Non. Temporal Instructions CPU RAM Tegra Xavier PCIe 200 -300 MBps CPU RAM Tegra Xavier 04. 01. 2022 34

Example: Gamma Correction Using ARM Neon 04. 01. 2022 35

When Things Go Wrong 04. 01. 2022 36

GDB Example

Tips q Build functions to print out macroblocks from vector registers and memory q Start small – test out independent parts of the code that are easy to verify q When in trouble, step through the code, display the relevant registers, verify with output you know is working

Detecting Adidas Features 04. 01. 2022 39

Good Luck! q You’ll be fine.

Matrix Transpose tbl v 0. 4 s, {v 1. 4 s}, v 2. 4 s a c b d v 0. 4 s 0 2 1 3 v 2. 4 s a b Think like this: For each output row, stride select increasing column c d v 1. 4 s a b a c c d b d stride

Code Profiling q Compile with –pg Time to Finish 100 M computations q Run application: . /main for Matrix Multiply (MM) and Transpose Operations q Run gprof. /main gmon. out 100 80 60 40 20 0 Transpose, lazy ? Transpose, NEON assembly Series 1 MM, NEON intrinsics Column 1 Column 2 MM, NEONassembly