(Updated October 13, 2024)
Table of Contents
Overview
This tutorial explains some of the basics of assembly language for the x86_64. This is focused primarily on the 64-bit programming model and calling subroutines on the Linux platform. This is not the same as working in the 32-bit environment and is significantly different than working on the Windows platform. While the concepts presented here are applied to some degree on Mac OS, the differences are enough that you cannot simply drop the code on a Mac and expect things to work as described here.
Assembly language is a set of mnemonic instructions – symbols that represent processor operation codes or opcodes. It is not the same as writing in Java, C, or Python. It is much more primitive. Higher-level languages like C are translated directly to this representation and can run directly on the processor. Other languages like Java are translated to bytecode, which is essentially a similar type of low-level representation but is interpreted above the level of the CPU by running the Java class in a Java Virtual Machine (JVM).
So what is the difference? The C program will run fast since it is built more closely to the CPU. However, the C program is not portable and will likely need to be recompiled on a new platform. The Java program is intended to be portable, hence the JVM, but will run a bit slower than the C equivalent. There is always a trade-off.
An assembly language program is assembled by a program called the assembler. The result will be an object file that can then be linked with other object files and libraries to produce a complete program.
The code examples are designed to be assembled with the NASM assembler in 64-bit mode on a Linux platform. They can be adjusted to run on Mac and Windows. However, the details of such adjustments are well beyond the scope of this introduction. You can assemble these programs using a few simple steps.
Assemble - using the nasm command convert the assembly language to an object file. Link - using the ld command with the object file, combine with system libraries to produce an executable. Run - Invoke the program to view the results.
The following is a Linux shell transcript of the steps to assemble, link, and run the example presented in the next section. (This assumes some editor was previously used to create the hello.asm
file.)
$ nasm -felf64 hello.asm
$ ls -la hello*
-rw-rw-r-- 1 student student 291 Sep 19 07:46 hello.asm
-rw-rw-r-- 1 student student 912 Sep 19 07:49 hello.o
$ ld hello.o -o hello
$ ls -la hello*
-rwxrwxr-x 1 student student 8952 Sep 19 07:49 hello
-rw-rw-r-- 1 student student 291 Sep 19 07:46 hello.asm
-rw-rw-r-- 1 student student 912 Sep 19 07:49 hello.o
$ ./hello
Hello, World!
$
The steps above, and many of the pieces hidden from view are noted in the diagram below:
Illustration 1: Assembly language program flow to become executable.
This charts the flow of your assembly source program through the assembler to produce an object file (and an optional list file). This is often combined with other object file and libraries (static and shared) by the linker to produce an executable program. To run this program, the operating system has a loader program whose sole responsibility is to prepare and load the program into memory so that it is ready to run.
Hello World
The idea behind assembly language is to provide the programmer with instructions (opcodes) along with various addressing models to move data and perform a host of operations on the data. However, as mentioned earlier, it is primitive.
Assembly language source lines are generally made up of the following
label: instruction operand(s) ; comment
You may also see them as
label: instruction operand(s) ; comment
Consider this simple assembly language program the prints the obligatory "Hello, World!"
.
section .data
hello: db "Hello, World!",0xa
len: equ $ - hello
section .text
global _start
_start:
mov rax, 1 ; write syscall
mov rdi, 1 ; stdout
mov rsi, hello ; text
mov rdx, len ; length
syscall
exit:
mov rax, 60 ; exit
mov rdi, 0 ; return code
syscall
Name | Example | Purpose |
---|---|---|
Labels | hello: , _start: , exit: |
A named location in the program. These are used instead of explicit addresses to represent positions in the code or data. This allows the programmer to not worry too much about memory locations. Note that labels alone on a line ought to have a colon. |
Instructions | mov , syscall |
Assembly language instructions. These are the named actions the CPU is to perform with any provided operands. |
Operands | rdx, len and rsi, hello |
Provides the instruction with the information to work with, where appropriate. (Not all instructions have operands, while some have 1 or 2, or even 3.) |
Directives | global , extern |
Help to inform the assembler about some of the label, sections, and external entities. |
Sections | .text , .data , .bss |
Informs the assembler when a new section has begun so the information can be placed into the correct memory locations at runtime..text – this is where your program instructions live..bss – this is for uninitialized data (your variables)..data – this is for initialized, often constant, data. |
Syntax
The syntax of the source code listed here follows the Intel model. There is an alternative version known as the AT&T model. While both will produce the same code, the Intel version is arguably easier to understand and learn. The AT&T version is also used by GNU’s gas
and Mac’s as
assemblers.
For example:
hello.s (AT&T – gas, as) | hello.asm (Intel – NASM) |
---|---|
|
|
Instructions
The work done by the CPU is strictly based on the instructions provided by the programmer. Essentially, each instruction performs one primitive operation, which includes moving data, performing arithmetic, making decisions, and branching to new locations in the program.
These are some general classifications of instruction. The most common are noted with a few instructions of that type.
Binary arithmetic - signed and unsigned integer math, binary coded decimal along with logical and bit shift operations. (add, sub, imul) Logical and shift/rotate - used to manipulate bits. (and, xor, not, sal, shl, ror) Floating point - support for many forms of numeric presentation. (fsubr, fdivr) Data transfer - move information from place to place. (mov, xchg, push, pop) Control transfer - branching and subroutine calls. (cmp, jmp, jne, call, ret) String - Move, compare and scan strings. (movs, cmps, scas, lods) Flag control - alter the state of the EFLAGS register. (stc, clc, sti, cli, pushf, popf)
There are many different categories of instructions, and there are so many instructions that they cannot be listed in a small tutorial such as this. So, we will present some standard instructions and links to additional documentation.
Instruction | Example | Outcome |
---|---|---|
mov |
mov rax, rdx |
Moves value in rdx into rax . |
add |
add ebx, eax |
Perform ebx + eax and store back into ebx . |
cmp |
cmp ax, 0 |
Compares ax to zero by subtracting and setting flags as appropriate. |
xchg |
xchg eax, [data] |
Exchanges 32-bit quantities contained in eax and data |
sub |
sub [sum], rax |
Subtract 64-bit rax from value at sum placing result at sum . |
inc |
inc rbx |
Increment value in rbx by one. |
dec |
dec dx |
Decrement value in dx by one. |
xor |
xor rax, rax |
Performs exclusive-or of 1st operand with 2nd operand placing the results in 1st operand. This example zeroes the rax register. |
syscall |
syscall |
Invokes a privileged OS system call handler on behalf of the calling program. |
x86 and amd64 Instructions
Intel Developer Reference
Operands
Register
Registers are internal, named locations of the CPU that can hold values. The registers can receive constants, data from memory, or other registers.
In the beginning, when the 8086 was new, the traditional names for the registers were:
AX - accumulator - this is where the majority of computation occurs. BX - base register - this is used as an offer into other memory locations. CX - counter register - can be combined with the loop instruction as a loop-control variable. DX - data register - can hold the overflow of arithmetic operations and be used in I/O operations. SP - stack pointer - indicates the current position of the next empty stack frame. BP - base pointer - also used with the stack pointer to access parameters and local variables. SI - source index - initially used for indirect addressing, can also be used with string operations. DI - destination index - initially used for indirect addressing, can also be used with string operations. SS - Stack Segment - Segment relative to the stack pointer (SP). ES - Extra Segment - Alternate segment for other data references - often for strings and used with SI and DI. DS - Data Segment - Segment relative to all unqualified memory references. CS - Code Segment - Segment relative to the instruction pointer (IP). IP - instruction pointer - indicates the location of the next instruction to be executed. FLAGS - flags register - point in time state of the CPU in a series of bit indicators.
As the iterations of the x86 progressed, the AX register (16 bits) became the EAX register (32 bits) up to today’s RAX register of 64 bits. The same is true for them all.
Eventually, more registers were added to complement the number of values that may be in flight in a given program allowing the CPU to keep more values on-chip rather than make many requests to/from memory. This allows for speed-up of programs as well since the fastest memory exists within the CPU itself.
The chart below lists some of the most common registers. You will notice that specific registers have more than one name.
(64-bit registers) |---- Dual Named Registers ---| R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14 R15 RAX RCX RDX RBX RSP RBP RSI RDI (32-bit registers) |---- Dual Named Registers ---| R0D R1D R2D R3D R4D R5D R6D R7D R8D R9D R10D R11D R12D R13D R14D R15D EAX ECX EDX EBX ESP EBP ESI EDI (16-bit registers) |---- Dual Named Registers ---| R0W R1W R2W R3W R4W R5W R6W R7W R8W R9W R10W R11W R12W R13W R14W R15W AX CX DX BX SP BP SI DI (8-bit registers) low-order bits of AX, CX, DX, BX |---- Dual Named Registers ---| R0B R1B R2B R3B R4B R5B R6B R7B R8B R9B R10B R11B R12B R13B R14B R15B AL CL DL BL SPL BPL SIL DIL (8-bit registers) high-order bits of AX, CX, DX, BX AH CH DH BH
Register layout based on capacity. |--------------------------- RAX & R0 ------------------------------| (same as dq and resq) |----------- EAX & R0D ------------| (same as dd and resd) |----AX & R0W ----| (same as dw and resw) --------------------------------------------------------------------- | 32 bits | 16 bits | 8 bits | 8 bits | --------------------------------------------------------------------- |-- AH --|--AL &--| (each are the same |-- R0B--| as db and resb)
Specific registers are available only in certain modes. The R registers are only available if the CPU is operating in 64-bit mode. In that mode, it also provides the remaining traditional names for 32- and 16-bit software.
Memory
To declare memory in your programs, you must first know if this will be in the .data
or .bss
segment. Items in .data
can be initialized, but those in .bss
cannot.
Size in Bytes (Bits) | .data | .bss | Equivalent |
---|---|---|---|
1 (8) | db | resb | AH, R0B |
2 (16) | dw | resw | AX, R0W |
4 (32) | dd | resd | EAX, R0D |
8 (64) | dq | resq | RAX, R0 |
section .data
hello: db "Hello!" ; a string
len: equ $ - hello ; the calculated length of the string
section .bss
result: resq 1 ; allocate 1 quadword for results.
section .text
; ...
mov rax, 32 ; rax is now 32
add rax, 5 ; add 5 to rax (37)
sub rax, 10 ; subtract 10 from rax (27)
mov [result], rax ; store rax into result
The last line is important. It is a form of direct addressing. Where the address to store the data is encoded with the instruction. However, since the address would be taken literally as in
mov rax, result
We have to place the square brackets around the address to indicate we mean that value at that address, not the address itself.
Addressing Modes
Implied
nop
Immediate
mov rax, 27
Register
mov rax, rdx
Direct/Displacement
mov rax, [result]
Register Indirect
mov bx, 7c00H
mov ax, [bx]
After these, they get a little more complex.
Based
The following examples are methods to indirectly manage data in a memory location. The square brackets indicate that the value contained inside is not the data but rather where the data lives. They are the equivalent of pointers to the data.
[ disp ] [ reg ] [ reg + reg * scale ] [ reg + disp ] [ reg + reg * scale + disp ]
Any general-purpose register can be used for reg
. The value for disp
represents a displacement from the base, which is the segment in which it is contained.
The scale value can be 1, 2, 4, or 8, representing the number of bytes. This is often used to move down to the next position of an array of elements of a given size (for example, 4 for int
, 8 for long
).
Some examples are shown below:
mov dx, [bx]
mov [intp], ecx
sub [rax + 100], 32768
xchg cx, [si + ax*4]
add rax, [rsi + rcx*8 + 100]
Immediate
Immediate operands are essentially constants. They are also the values of labels within the program. So these could be any data value as a constant expressed within the program or a label representing a location in the data section.
Some examples are shown below:
section .text
cmp rbx, 10
mov cx, 12
add edx, 100
mov esi, hello
; ...
ret
hello db "Hello!"
FLAGS
The flags register is the CPU’s way of keeping track of certain events due to executing instructions. Some are used to set the CPU state for specific operations – privileged and non-privileged.
Many of the flags present can be safely ignored by the casual assembly language programmer since they are used for specific purposes other than program control. Let us begin by examining the details of the FFLAGS register.
The visual below represents the 64 bits of the RFLAGS, which also contains the 32-bit EFLAGS and the traditional 16-bit FLAGS.
6 3 3322222222221111 111111 3 2 1098765432109876 5432109876543210 -------------------------------------------------------------------- | RESERVED |-RESERVED-IVVAVR|-NIOODITSZ-A-P-C| -------------------------------------------------------------------- |------------------------------RFLAGS------------------------------| |--------------EFLAGS-------------| |-----FLAGS------|
Bit | Name | Clear/Set | Purpose |
---|---|---|---|
0 | CF | nc/cy (no carry/carry) | Carry Flag. This is set when a mathematical operation overflows (carry) and is cleared when it underflows (borrow). |
1 | RESERVED | ||
2 | PF | pe/po (parity even/parity odd) | Parity Flag. |
3 | RESERVED | ||
4 | AF | na/ac (no aux carry/aux carry) | Auxilliary Carry Flag. |
5 | RESERVED | ||
6 | ZF | nz/zr (not zero/zero) | Zero flag. This flag is set whenever an instruction deals with a zero value, including storing a zero in the register, arithmetic operations, and comparisons, which are generally in the form of subtraction. This flag is integral to branching instructions. |
7 | SF | pl/ng (plus/negative) | Sign Flag. This is set if the result of an operation is negative. |
8 | TF | Trap Flag. This is generally used for debuggers to step through code. | |
9 | IF | ei/di (enable interrupts/disable interrupts) | Interrupt Enable Flag. This generally is not modified and indicates that hardware interrupts have been enabled. |
10 | DF | up/dn (up/down) | Direction Flag. This is typically used with the SI and DI registers and string instructions. Clear indicates movement up toward higher memory. |
11 | OF | nv/ov (normal value/overflow value) | Overflow Flag. This is set if the arithmetic operation results in a value too large for the register. |
12-13 | IOPL | I/O Privilege Level. This is used to calculate the current privilege state of an executing program. | |
14 | NT | Nexted Task. | |
15 | RESERVED | RESERVED | |
16 | RF | Resume Flag. | |
17 | VM | Virtual Mode. Represents the Virtual 8086 mode. This is a compatibility mode. | |
18 | AC | Alignment Check / Access Control. This is generally set if alignment checking for memory references will be performed. | |
19 | VIF | Virtual Interrupt Flag. | |
20 | VIP | Virtual Interrupt Pending. | |
21 | ID | ID flag. When set, this allows the use of CPUID instructions. | |
22-31 | RESERVED | RESERVED | |
32-63 | RESERVED | RESERVED |
So, why are the flags so important? Well, we have to consider how decisions are made. When we are comparing a value and then jumping to a new location based on the result, these make all the difference.
Here are two very simple tests:
cmp rax, 35
jle otherlocation
Which means
if rax <= 35 goto otherlocation
and
loop:
mov eax, 45
; ...
dec ebx
jnz loop
Which means
loop: eax = 45 ;... ebx-- if ebx != 0 goto loop
The branching instructions jle
and jnz
can do their task based on the instruction before. Those instructions set the appropriate flags based on the comparison and decrementing. Arithmetic, comparison, test, and increment/decrement instructions affect the flags.
Below is a list of branching instructions and the FLAG values necessary to make the branch.
unsigned comparisons ja ( CF = 0 and ZF = 0 ) jae ( CF = 0 and ZF = 1 ) jb ( CF = 1 ) jbe ( CF = 1 or ZF = 1 ) signed comparisons je, jz ( ZF = 1) jne, jnz ( ZF = 0) jg ( SF = OF and ZF = 0 ) jge ( SF = OF or ZF = 1 ) jl ( SF != OF ) jle ( SF != OF or ZF = 1 ) jo ( OF = 1 ) jno ( OF = 0 ) js ( SF = 1 ) jns ( SF = 0 ) counter register jcxz ( CX = 0 ) jecxz ( ECX = 0 ) jrcxz ( RCX = 0 )
System Calls
There are a few hundred system calls defined in Linux. There is a similar group of them in MacOS (but with different values and call requirements). There also exists a defined mechanism for Windows x64 and IA32 to perform basic I/O.
A superb reference for syscalls is the Searchable Linux Syscall Table.
This section is focused on the x86-64 system call table – and only a few syscalls to provide some examples. This model also follows the AMD64 ABI call model, where all calls follow the details of the following table.
Register | Purpose |
---|---|
RAX | System Call Number and Return Value |
RDI | First argument |
RSI | Second argument |
RDX | Third argument |
R10 | Fourth argument |
R8 | Fifth argument |
R9 | Sixth argument |
This is a relatively simple implementation to remember and is slightly different from the kernel in the AMD64 model.
Syscalls do not need additional arguments beyond 6. The AMD64 calling convention is further discussed in the next section.
You will notice that RAX
is a dual-purpose register. It contains the system call number when the call is made, but it also will have a return value when the call is complete and returns to the caller.
Syscalls are function calls broken down into individual instructions with a Kernel entry point to perform the task requested.
Write
For example, to write data to a stream, the syscall number is 1 (write
), and the arguments are given in the order shown in the function call below:
ssize_t write(int fd, const void *buf, size_t count)
So, in assembly language form, this is:
section .data
hello: db "Hello!"
len: equ $ - hello
section .text
; ...
mov rax, 1 ; write syscall
mov rdi, 1 ; fd (stdout)
mov rsi, hello ; buf
mov rdx, len ; count
syscall
Upon return from the write
syscall, the rax
register will hold the value of the number of bytes actually written.
Read
Then there is syscall 0 (read
) for reading from a data stream based on this:
ssize_t read(int fd, void *buf, size_t count)
Its basic assembly language form looks something like this
section .bss
line: resb 80
section .text
; ...
mov rax, 0 ; read syscall
mov rdi, 0 ; fd (stdin)
mov rsi, line ; buf
mov rdx, 80 ; count
syscall
Upon return from the read
syscall, the rax
register will hold the value of the number of bytes read. This number may be less than what you provided in rdx
since this is supplied to read
as a maximum number.
Exit
Of course, we have already seen the exit
syscall, number 60:
void exit(int status)
Becomes this:
exit:
mov rax, 60 ; exit
xor rdi, rdi ; status is 0 (success)
syscall
Returning a status value of zero indicates no error when the program is completed. The programmer defines any other value to have whatever meaning they apply to it.
The xor
instruction is a faster, smaller way of moving a zero into a register. (You can read the Intel Documentation on optimizing core execution, section 3.5.1.8, if you want to know why.)
Assembly Language and C
Imagine a simple for
loop.
for ( x = 0; x < 10; x++ )
printf("%d\n", x);
It is sleek, elegant, and simple to write. Now consider what the same task looks like in the following assembly source.
global main
extern printf
section .text
main:
mov rbx, 0 ; x = 0
loop:
cmp rbx, 10 ;
jge done ; x < 10
mov rdi, format
mov rsi, rbx
mov rax, 0 ; indicate # of XMM regs
call printf ; printf("%d\n", x)
inc rbx ; x++
jmp loop
done:
mov rax, 0 ; return 0
ret
section .data
format: db "%d",0xa,0
Indeed, the program is that long to do the same as the for
loop in C. Because we are so close to the CPU, we must write the instructions that precisely describe the process of building and managing a loop. That is roughly half the code, and the other half is the setup and call to printf
.
The AMD64 calling convention defines the interactions between the calling and called functions. How this works is the following:
- The calling function owns general purpose registers:
rbp
,rbx
, andr12
-r15
. The called function must preserve them on the stack to be restored before returning to the caller. - All other registers belong to the called function. This means the caller is responsible for preserving the contents of any registers they want to keep from being destroyed.
- User applications use
rdi
,rsi
,rdx
,rcx
,r8
andr9
for the first 6 arguments. - User application must pass additional arguments beyond 6 on the stack in reverse order.
- Kernel syscalls use
rdi
,rsi
,rdx
,r10
,r8
andr9
for up to 6 arguments. That is the limit. - The kernel destroys
rcx
andr11
on syscalls. - Syscalls return results in
rax
. Errors are indicated by a range of -4095 to -1. This is also -errno
- the global error value in C.
Of course, this is more complex than is noted here. These are simply the finer details that most programmers are interested in knowing.
The AMD64 calling model for Linux is well described in System V Application Binary InterfaceAMD64 Architecture Processor Supplement