MachineLevel Programming I Basics Comp 21000 Introduction to
Machine-Level Programming I: Basics Comp 21000: Introduction to Computer Organization & Systems Instructor: John Barr * Modified slides from the book “Computer Systems: a Programmer’s Perspective”, Randy Bryant & David O’Hallaron, 2011 1
Today: Machine Programming I: Basics ¢ ¢ History of Intel processors and architectures C, assembly, machine code Assembly Basics: Registers, operands, move Intro to x 86 -64 2
Turning C into Object Code § Code in files p 1. c p 2. c § Compile with command: gcc –O 0 –no-pie p 1. c p 2. c -o p Use basic optimizations (-O 0) [-O 0 means (almost) no optimization] § Put resulting binary in file p compile to 64 bit code § text C program (p 1. c p 2. c) Compiler (gcc –O 0 -S) (-S means stop after compiling) text asm program (p 1. s p 2. s) Assembler (gcc or as) binary object program (p 1. o p 2. o) Linker (gcc or ld) binary Static libraries (. a) executable program (p) 3
Compiling Into Assembly C Code (sum. c) long plus(long x, long y); void sumstore(long x, long y, long *dest) { long t = plus(x, y); *dest = t; } Generated x 86 -64 Assembly sumstore: pushq movq call movq popq ret %rbp %rdx, %rbx plus %rax, (%rbx) %rbp Some compilers use instruction “leave” Obtain this on the server with the command: gcc –O 0 –S sum. c Produces file sum. s Use the –O 0 flag (dash uppercase ‘O’ number 0) to minimize optimization. Otherwise you’ll get strange register moving. On 32 -bit machines can force 64 -bit code with the option -march=x 86 -64 4
Compiling Into Assembly C Code (sum. c) long plus(long x, long y); void sumstore(long x, long y, long *dest) { long t = plus(x, y); *dest = t; } Generated x 86 -64 Assembly sumstore: pushq movq call movq popq ret %rbx %rdx, %rbx plus %rax, (%rbx) %rbx Some compilers use instruction “leave” Obtain (on server machine) with command gcc –O 0 –g sum. c -g puts in hooks for gdb the gnu debugger Can examine assembly code in gdb 5
How it used to be done ¢ pdp programming § See https: //youtu. be/XV-7 J 5 y 1 TQc 6
gdb ¢ examine source code: § list first. Line. Num, second. Line. Num § where first. Line. Num is the first line number of the code that you want to examine second. Line. Num is the ending line number of the code. ¢ break points § break line. Num § line. Num is the line number of the statement that you want to break at. ¢ debugging § § stepi next where // // step one C instruction (steps into) step one assembly instruction steps one C instruction (steps over) info about where execution is stopped 7
gdb ¢ Viewing data § print x This prints the value currently stored in the variable x. § print &x § This prints the memory address of the variable x. § display x § This command will print the value of variable x every time the program stops. § 8
gdb ¢ Getting low level info about the program § info line. Num This command provides some information about line number line. Num including the memory address (in hex) where it is stored in RAM. § disassem mem. Address 1 mem. Address 2 prints the assembly language instructions located in memory between mem. Address 1 and mem. Address 2. § x mem. Address This command prints the contents of the mem. Adress in hexadecimal notation. You can use this command to examine the contents of a piece of data. There are many more commands that you can use in gdb. While running you can type help to get a list of commands. 9
Assembly Language ¢ ¢ An assembly language program has 3 major pieces Labels § Used for control (loops, if statements) ¢ Instructions § Have specific format ¢ Operands § May be 0, 1 or 2 § Register or memory addr sumstore: pushq movq call movq popq ret %rbx %rdx, %rbx plus %rax, (%rbx) %rbx 10
Assembly Characteristics: Data Types ¢ “Integer” data of 1, 2, 4 or 8 bytes § Data values § Addresses (untyped pointers) Data is stored in memory or a register ¢ Floating point data of 4, 8, or 10 bytes ¢ Code: Byte sequences encoding series of instructions ¢ No aggregate types such as arrays or structures § Just contiguously allocated bytes in memory 11
Assembly Characteristics: Instructions ¢ Perform arithmetic function on register or memory data ¢ Transfer data between memory and register § Load data from memory into register § Store register data into memory ¢ Transfer control § Unconditional jumps to/from procedures § Conditional branches 12
Assembly Language Versions ¢ ¢ There is only one Intel x 86 -64 machine language (1’s and 0’s) There are two ways of writing assembly language on top of this: § Intel version § AT&T version ¢ UNIX/LINUX in general and the gcc compiler in particular uses the AT&T version. That’s also what the book uses and what we’ll use in these slides. § Why? UNIX was developed at AT&T Bell Labs! ¢ Most reference material is in the Intel version 13
Assembly Language Versions ¢ There are slight differences, the most glaring being the placement of source and destination in an instruction: § Intel: instruction dest, src § AT&T: instruction src, dest ¢ Intel code omits the size designation suffixes § mov instead of movl ¢ Intel code omits the % before a register § rax instead of %rax Intel code has a different way of describing locations in memory ¢ § [rbp+8] instead of 8(%rbp) sumstore: pushq movq call movq popq ret %rbx %rdx, %rbx plus %rax, (%rbx) %rbx 14
Object Code for sumstore 0 x 0400595: 0 x 53 0 x 48 0 x 89 0 xd 3 0 xe 8 0 xf 2 0 xff • 0 x 48 0 x 89 • 0 x 03 0 x 5 b • 0 xc 3 ¢ Assembler § § ¢ Total of 14 bytes Each instruction 1, 3, or 5 bytes Starts at address 0 x 0400595 Translates. s into. o Binary encoding of each instruction Nearly-complete image of executable code Missing linkages between code in different files Linker § Resolves references between files § Combines with static run-time libraries E. g. , code for malloc, printf § Some libraries are dynamically linked § Linking occurs when program begins execution § 15
Machine Instruction Example ¢ *dest = t; ¢ Quad words in x 86 -64 parlance § Operands: t: Register %rax dest: Register %rbx *dest: Memory M[%rbx] § movq %rax, (%rbx) ¢ 0 x 40059 e: 48 89 03 C Code § Store value t where designated by dest Assembly § Move 8 -byte value to memory ¢ Object Code § 3 -byte instruction § Stored at address 0 x 40059 e Symbol table § Assembler must change labels into memory locations § To do this, keeps a table that associates a memory location for every label § Called a symbol table 16
Disassembling Object Code Disassembled 00000400595: 53 400596: 48 89 400599: e 8 f 2 40059 e: 48 89 4005 a 1: 5 b 4005 a 2: c 3 ¢ <sumstore>: d 3 ff ff ff 03 push mov callq mov pop retq %rbx %rdx, %rbx 400590 <plus> %rax, (%rbx) %rbx Disassembler objdump –d sum Otool –t. V sum // in Mac OS § Useful tool for examining object code § -d means disassemble § sum is the file name (e. g. , could be a. out) § Analyzes bit pattern of series of instructions § Produces approximate rendition of assembly code § Can be run on either a. out (complete executable) or. o file 17
Alternate Disassembly Disassembled Object 0 x 0400595: 0 x 53 0 x 48 0 x 89 0 xd 3 0 xe 8 0 xf 2 0 xff 0 x 48 0 x 89 0 x 03 0 x 5 b 0 xc 3 Dump of assembler code for function sumstore: 0 x 00000400595 <+0>: push %rbx 0 x 00000400596 <+1>: mov %rdx, %rbx 0 x 00000400599 <+4>: callq 0 x 400590 <plus> 0 x 0000040059 e <+9>: mov %rax, (%rbx) 0 x 000004005 a 1 <+12>: pop %rbx 0 x 000004005 a 2 <+13>: retq ¢ Within gdb Debugger gdb sum disassemble sumstore § Disassemble procedure x/14 xb sumstore § Examine the 14 bytes starting at sumstore 18
Direct creation of assembly programs Assembly Program ¢ In. s file ¢ % gcc –S p. c Result is an assembly program in the file p. s Assembler directives Assembly commands Symbols . file. text. type sum. 0: pushl movl subl movl addl leave ret. size. globl main. type main: pushl movl subl andl subl movl leave ret. size. section. ident "assem. Test. c" sum. 0, @function %ebp %esp, %ebp $4, %esp 12(%ebp), %eax 8(%ebp), %eax sum. 0, . -sum. 0 main, @function %ebp %esp, %ebp $8, %esp $-16, %esp $0, %eax main, . -main. note. GNU-stack, "", @progbits "GCC: (GNU) 3. 4. 6 19
Features of Disassemblers l Disassemblers determine the assembly code based purely on the byte sequences in the object file. § Do not need source file l Disassemblers use a different naming convention than the GAS assembler. § Example: omits the “l” from the suffix of many instructions. l Disassembler uses nop instruction (no operation). § § Does nothing; just fills space. Necessary in some machines because of branch prediction Necessary in some machines because of addressing restrictions And sometimes the disassembler just can’t figure out what’s going on 20
What Can be Disassembled? % objdump -d WINWORD. EXE: file format pei-i 386 No symbols in "WINWORD. EXE". Disassembly of section. text: 30001000 <. text>: 30001000: 55 push %ebp 30001001: 8 b ec mov %esp, %ebp Reverse engineering forbidden by 30001003: 6 a ff push $0 xffff User License Agreement 30001005: 68 Microsoft 90 10 00 End 30 push $0 x 30001090 3000100 a: 68 91 dc 4 c 30 push $0 x 304 cdc 91 ¢ ¢ Anything that can be interpreted as executable code Disassembler examines bytes and reconstructs assembly source 21
Machine Programming I: Summary ¢ History of Intel processors and architectures § Evolutionary design leads to many quirks and artifacts ¢ C, assembly, machine code § Compiler must transform statements, expressions, procedures into low-level instruction sequences ¢ Assembly Basics: Registers, operands, move § The x 86 move instructions cover wide range of data movement forms ¢ Intro to x 86 -64 § A major departure from the style of code seen in IA 32 22
- Slides: 22