=============[ Review: C semantics as compiled to the CPU instructions ]============ How are the basic items of C represented in actual CPU computation semantics? Functions: an address in memory where the function's binary executable code starts. The linker knows a few more things about a function: its length in binary code, locations inside its code that reference global addresses, and a few other things. But fundamentally, the function is the address of its first CPU instruction to jump to when the function is called. E.g.: fact: // <<--- address of function pushq %rbp movq %rsp, %rbp subq $7, %rsp movl %edi, -4(%rbp) cmpl $0, -4(%rbp) jne .L2 movl $1, %eax jmp .L3 .L2: movl -4(%rbp), %eax subl $1, %eax movl %eax, %edi call fact // <<---- jump to that address (after pushing the address // of the next instruction onto the stack) imull -4(%rbp), %eax .L3: leave ret Local variables: offsets from the register used as the Base Pointer for the function's stack frame. The compiler knows the type and the size of the value, in bytes. E.g., in the above code: movl %edi, -4(%rbp) // <<--- local copy of the parameter "n" gets written movl -4(%rbp), %eax // <<--- local copy of the parameter "n" gets read (to make n-1) Global variables: global addresses in the program's virtual space, notionally. On 32-bit CPUs, these tended to be true global 4-byte addresses. On 64-bit CPUs, these are actually offsets from the current instruction, so instructions referencing the same global variable will include different offsets as a part of their encoding, but these offsets will be exactly as far apart as the instructions themselves (check this!) E.g.: .file "hello.c" .text .section .rodata .LC0: .string "Hello" // <<--- C string, i.e., char array {'H', 'e', 'l', 'l', 'o', '\n'} .text .globl main .type main, @function main: pushq %rbp movq %rsp, %rbp subq $16, %rsp movl $10, -4(%rbp) .L2: subl $1, -4(%rbp) leaq .LC0(%rip), %rax // <<---- offset from the current instruction to .LC0 movq %rax, %rdi call puts@PLT In assembly: 0x000055555555514c <+19>: 48 8d 05 b1 0e 00 00 lea 0xeb1(%rip),%rax // offset 0xeb1 is off of the next RIP: ^^^^^^^^^^^ 0x0000555555555153 <+26>: 48 89 c7 mov %rax,%rdi // Indeed: (gdb) display/x 0x0000555555555153 + 0xeb1 3: /x 0x0000555555555153 + 0xeb1 = 0x555555556004 (gdb) x/s 0x555555556004 0x555555556004: "Hello" (gdb) x/8xb 0x555555556004 0x555555556004: 0x48 0x65 0x6c 0x6c 0x6f 0x00 0x00 0x00 Pointers: an address to where a binary byte representation starts in memory. The C compiler keeps track of what type that value is and how long it is, unless the address has the special type "void*". On the CPU level, this is the contents of a register or an offset off a value in a register. Arrays X[n]: the address where you expect to find the start of n copies of whatever length (X is a type) of bytes is used to store type X, back to back, For example, int[10] will be start of the 40 byte area, 10 x 4 bytes for an int. C struct: the address of where you expect to find the start of the byte representations of the members of the struct, back to back, possibly interspersed with a few bytes of "slack" for alignment (unless the struct has __attribute(packed)__, in which case there will be no slack). For example, expect an int to start on a 4-byte boundary even though preceded by a char or a short---unless packed. The packed mode is specifically useful to parse network packets that arrive without slack/padding. E.g., see our previous example of gets2.c struct inps { char c[10]; // c is a pointer, char* int cnt; } inp; Dump of assembler code for function main: 0x0000555555555159 <+0>: push %rbp 0x000055555555515a <+1>: mov %rsp,%rbp => 0x000055555555515d <+4>: sub $0x30,%rsp 0x0000555555555161 <+8>: mov %fs:0x28,%rax 0x000055555555516a <+17>: mov %rax,-0x8(%rbp) 0x000055555555516e <+21>: xor %eax,%eax 0x0000555555555170 <+23>: movl $0xa,-0x14(%rbp) // inp.cnt = 10 // inp is stored locally, its fields are adjacent and addressed by offsets from RBP: // cnt is at 0x14 and the 10-char array at 0x20 off RBP 0x0000555555555177 <+30>: lea -0x20(%rbp),%rax //<<--- address to the start of char[10], // pointer c gets passes to gets() 0x000055555555517b <+34>: mov %rax,%rdi // (in RDI, as per the x86-64 ABI) 0x000055555555517e <+37>: call 0x555555555050 0x0000555555555183 <+42>: mov -0x14(%rbp),%eax // inp.cnt is read 0x0000555555555186 <+45>: mov %eax,-0x24(%rbp) // ... into i, local, at -0x24 off RBP 0x0000555555555189 <+48>: jmp 0x55555555519b 0x000055555555518b <+50>: subl $0x1,-0x24(%rbp) 0x000055555555518f <+54>: lea -0x20(%rbp),%rax // the char* c is passes to puts() 0x0000555555555193 <+58>: mov %rax,%rdi 0x0000555555555196 <+61>: call 0x555555555030 0x000055555555519b <+66>: cmpl $0x0,-0x24(%rbp) 0x000055555555519f <+70>: jns 0x55555555518b 0x00005555555551a1 <+72>: mov $0x2a,%eax 0x00005555555551a6 <+77>: mov -0x8(%rbp),%rdx 0x00005555555551aa <+81>: sub %fs:0x28,%rdx 0x00005555555551b3 <+90>: je 0x5555555551ba 0x00005555555551b5 <+92>: call 0x555555555040 <__stack_chk_fail@plt> 0x00005555555551ba <+97>: leave 0x00005555555551bb <+98>: ret End of assembler dump. Suggested experiments: 1. Make the struct global, i.e., move its definition outside main(). What happens? What does the disassembly look like now? 2. Add __attribute__((packed)) right after the closing brace of the struct definition. What happens? struct inps { const char c[10]; int cnt; } __attribute__((packed)) inp; ===================[ Undefined Behaviors in C ]===================== The behaviors that we observed with gets2c() and other examples are repeatable, at least on a given combination of CPU ISA and compiler. However, they are not a part of the C standard: they are _undefined_ and can only be understood from the compiled assembly---which is why we started looking at it! Nevertheless, many real programs depend on what the specific compilers have historically done. John Regehr, A Guide to Undefined Behavior in C and C++: https://blog.regehr.org/archives/213 -- Part 1 https://blog.regehr.org/archives/226 -- Part 2 https://blog.regehr.org/archives/232 -- Part 3 https://blog.regehr.org/archives/1520 -- "Undefined Behavior in 2017" (also on Youtube) Papers from MIT's systems group regarding "time bombs" in Undefined Behavior: https://people.csail.mit.edu/nickolai/papers/wang-stack.pdf https://people.csail.mit.edu/nickolai/papers/wang-undef-2012-08-21.pdf