Godbolt: Behind the Compiler – Programming by Design

(Updated July 26, 2024)

Overview

The C compiler produces platform-specific executable code, which can also be viewed as assembly language, given the correct set of options provided to the compiler. This can be particularly useful when we want to learn a bit more about what is going on behind the scenes.

Maybe we want to:

Learn more about assembly language.
Learn more about how language statements get converted to executable code.
Learn more about how memory is handled, especially the stack.

Now, some students reading this may be thinking:

I have only had Java programming…
I have only had limited experience with Python…
I have never programmed in C…
I am not sure I know what assembly language is…

Of course, this is absolutely fine. These have been common statements made in the past. There is truly no worry here because this is a learning experience. We will take our time and begin the process of applying what we do know against the new material, and we will see, often quite swiftly, that we know more than we think, and all we need is a different perspective to bring it all to light.

The Beginning…

The following piece of C code can be followed by any first-year programming student regardless of their language of study.

int main (void) {
    int x, g = 5;
    int l = 10;

    for (x = 0; x < 10; x++)
        g = g + x;

    return 0;   // success
}

Example C program to sum a group of values.

We can see three variables declared, of which two are assigned values at the time of declaration (lines 2-3), further we can see a simple range-based for loop that iterates over the values 0-9 (line 5) in which successive values are added to a running total contained in g (line 6). Finally, a value of zero, indicating a successful end of the program, is returned (line 8).

Before we go any further, it must be stated that this code was chosen for its revelation rather than its results. In fact, we do not care about the value of g at the end. It is of no consequence. What we do care about is the resulting assembly language code produced by the compiler given the different options we provide.

Godbolt’s Compiler Explorer

There are many ways to get at the assembly language of C source code. We can install a compiler on a Linux box, run a command with a few specific options and then examine the files produced. Easy peasy.

Another option is to use the Compiler Explorer created by Matt Godbolt.

The image below is an annotated layout detailing the use of Compiler Explorer.

In your browser, once you arrive at Compiler Explorer:

Check the language is set to C.
Check the compiler is set to x86-64 gcc 11.2 or later version. Really any x86-64 gcc compiler will be fine.
Check that compiler options are empty (for now).

Paste the following code into the left side:

int main (void) {
    int x, g = 5;   // x is rbp-4, g is rbp-8
    int l = 10;     // l is rbp-12

    for (x = 0; x < 10; x++)
        g = g + x;

    return 0;   // success
}

Witness the assembly language produced on the right.

Important Note!

If you want to see assembly language for other CPUs, try some of the following:

ARM64 GCC 13.2.0 or later (ARM powers Apple, Broadcom, NVIDIA, etc. ARM was once known as Acorn RISC Machine, then Advanced RISC Machine.)
6502 cc65 2.19 or later (Old MOS65XX CPU from the 1970s)
RISC-V (64 bits) GCC 13.2.0 or later (open source RISC CPU)
POWER64 GCC 13.2.0 or later (IBM servers running AIX)

Explaining the Results

The next bit we will examine is a side-by-side presentation of the source and the code that was produced by Compiler Explorer. A copious amount of whitespace has been added to line up the C code with its roughly corresponding assembly language code.

Important Note!

In the Compiler Explorer, you can simply place your cursor in the assembly window and the corresponding lines of code within the source language will be highlighted.

First we need to review a few details.

Base Pointer and Stack Pointer

The base pointer (rbp) CPU register is the starting point of the stack frame for the currently active function call (whether main or some other function). This is also know as the bottom of the currently active stack. It resides in higher memory and from this location the stack pointer (rsp) register grows toward lower memory. The stack pointer is also known as the current top of the stack.

The stack pointer grows and shrinks relative to the current base pointer. Since all local (automatic) variables are allocated on the stack, the first thing that is typically done upon arrival in any function is to push the current rbp and then copy the rsp to rbp. This sets up a new frame for the call and allows us to use copious stack space at will knowing we will restore the stack to its previous glory before returning to the calling function.

This is a basic layout of memory for a given process running in memory:

        Highest Memory Location

            |-----------|
            |   argv    |  Command line arguments and
            |   env     |  system environment variables.
   rbp -->  |-----------|
            |   Stack   |  Functions and automatic variables
   rsp -->  |...........|
            |     |     | 
            |     |     |
            |     v     |
            |           |
            |           | 
            |     ^     |
            |     |     |
            |     |     |
            |...........|
            |    Heap   | Memory allocated at runtime (malloc)
            |-----------|
            |   .bss    | Unitialized data - globals and static. (BSS)
            |-----------|
            |   .data   | Initialized data - globals and static. (DS)
            |-----------|
            |   .text   | Code (TEXT, Code Segment)
            |-----------|

         Lowest Memory Location

This memory abstraction is applied to all running programs in Linux and therefore very easy to apply to our discussion.

In a 64-bit system we are using rbp and rsp (32-bit uses ebp, esp). The diagram shows the basics of where the rbp and rsp may be in relation to a running program. It is impossible to tell exactly where these registers are pointing since they change anytime we invoke another function during runtime. Therefore these registers are forever changing in a large application.

Back to the Code

As noted earlier, it is expected the any function that is invoked saves the current base pointer and makes the current stack pointer its base pointer. This gets cleaned up by the function restoring the base pointer to its original value before leaving thereby fixing everything before returning to the caller.

            |-----------|
  rbp -->   |   ????    |
            |-----------|
            |   ????    | Various data values on the stack before invoking main.
            |-----------|
            |   ????    |
            |-----------|
  rsp -->   |   addr    | Return address is pushed on the stack as a result of calling main
            |-----------|
            |           | Next empty frame
            |-----------|

The side-by-side view shown below is the C code matched line for line with the assembly language produced. This helps to see how each line of C code often becomes multiple lines of assembly.

int main (void) {


    int x, g = 5;   // x is rbp-4, g is rbp-8
    int l = 10;     // l is rbp-12

    for (x = 0; x < 10; x++)

        g = g + x;





    return 0;   // success

}

main:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-8], 5
        mov     DWORD PTR [rbp-12], 10
        mov     DWORD PTR [rbp-4], 0
        jmp     .L2
.L3:
        mov     eax, DWORD PTR [rbp-4]
        add     DWORD PTR [rbp-8], eax
        add     DWORD PTR [rbp-4], 1
.L2:
        cmp     DWORD PTR [rbp-4], 9
        jle     .L3
        mov     eax, 0
        pop     rbp
        ret

Finally, change the compiler options to “-O” to add baseline optimizations. Review the output.

I will look something like this:

main:
        mov     eax, 0
        ret

Why?

Most compilers have some form of optimization available. The GCC compiler has many different stages of optimization available.

Why did adding optimizations change the code produced by the compiler? Consider for a moment what the loop does. It loops while adding a value to g at each iteration.

The key is that nothing is done with g after that. The compiler knows this and then decides that maybe it is not needed. And if g is not needed, then maybe the loop is not needed either. Since there is no further use for x without g, the loop is tossed as well.

What about the variable initialization at the beginning? It’s not needed either. The variable l is never used at all, and if none of the other are needed, then there is no point in even using the stack space. All that is left is to return zero.

So, the final thing to note that is missing is the stack manipulation. We do not bother saving the rbp and setting up the stack for local vars – because now there are none to worry about.

Now consider a program that uses g in a meager printf.

#include <stdio.h>

int main (void) {
    int x, g = 5;   // x is rbp-4, g is rbp-8
    int l = 10;     // l is rbp-12

    for (x = 0; x < 10; x++)
        g = g + x;

    printf("%d\n", g);
    return 0;   // success
}

With no optimizations, you will see something like the following:

#include <stdio.h>

int main (void) {



    int x, g = 5;   // x is rbp-4, g is rbp-8
    int l = 10;     // l is rbp-12

    for (x = 0; x < 10; x++)

        g = g + x;





    printf("%d\n", g);




    return 0;   // success

}

.LC0:
        .string "%d\n"
main:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     DWORD PTR [rbp-8], 5
        mov     DWORD PTR [rbp-12], 10
        mov     DWORD PTR [rbp-4], 0
        jmp     .L2
.L3:
        mov     eax, DWORD PTR [rbp-4]
        add     DWORD PTR [rbp-8], eax
        add     DWORD PTR [rbp-4], 1
.L2:
        cmp     DWORD PTR [rbp-4], 9
        jle     .L3
        mov     eax, DWORD PTR [rbp-8]
        mov     esi, eax
        mov     edi, OFFSET FLAT:.LC0
        mov     eax, 0
        call    printf
        mov     eax, 0
        leave
        ret

This is a good example of seeing a library call invoked in assembly language. The majority of the code is the same as above. You will notice that two arguments are passed to printf using the x86_64 model. The first argument is in edi (the format string) and the second in esi (the value from g).

Important Note!

The order of the populating of esi then edi seems to be a carryover from the 32-bit days where the arguments were pushed onto the stack from right to left.

Remember that the format string is first and the value provided for the conversion is second. But the compiler produces code that still processes from right to left.

The reason eax is set to zero is to tell printf that no floating point registers are in use (not using %f as a conversion).

Call Model Complexity

So, why do we subtract 16 from rsp? The answer lies in the fact that now we will be using the stack. See, in the previous code, we never called anything. So, local (automatic) variables were just locations on the stack. But we have only been using offsets from the rsp. With this version, we have to protect the local variables from being clobbered when we call printf – because the return address will be pushed onto the stack. We have to move the stack pointer to protect the variables we have already used, and we have to do so on 16-byte alignment boundaries – this is an Intel requirement for performance.


Study this...it will take some time to see it all...

            |-----------|
Orig rbp--> |   ????    |
            |-----------|
            |   ????    | Various data values on the stack prior to invoking main.
            |-----------|
            |   ????    |
            |-----------|
Orig rsp--> |   addr    | Address to go to when returning from main
            |-----------|
  rbp-->    | Orig rbp  | This was rsp before the adjustment, but after push rbp; mov rbp, rsp
            |-----------|
  rbp-4     |           | x
            |-----------|
  rbp-8     |           | g
            |-----------|
  rbp-12    |           | l
            |-----------|
  rbp-16    |   ????    | unused space - this is rsp after the adjustment, but before the call to printf
            |-----------|
  rbp-20    |   addr    | return address after calling printf and the new location of rsp.
            |-----------|
  rbp-24    |           | Next empty frame
            |-----------|

Oh, and the leave instruction is the equivalent of:

        mov     rsp, rbp
        pop     rbp

Which resets rsp by putting back the saved value from rbp and then popping the previously saved rbp from the stack. Poof! The stack is restored to its original state!

Now, set the compiler options to “-O” again and you will see something like:

.LC0:
        .string "%d\n"
main:
        sub     rsp, 8
        mov     esi, 50
        mov     edi, OFFSET FLAT:.LC0
        mov     eax, 0
        call    printf
        mov     eax, 0
        add     rsp, 8
        ret

This results from an optimization known as loop unrolling. It is only possible to do it when the compiler can determine the precise number of iterations (which can be done with both of our examples).

The loop result will have a value of 50 in g. The loop unrolling can assess the final value and simply have it ready. When we do the printing, the value is moved to esi.

It is valuable to note that rsp is adjusted by 8 here before the call to printf to maintain 16-byte alignment before the return address is pushed.

Can you figure out why it is 8?