Assembly Language With NASM – Programming by Design

(Updated October 13, 2024)

Assembly Language and C

Overview

This tutorial explains some of the basics of assembly language for the x86_64. This is focused primarily on the 64-bit programming model and calling subroutines on the Linux platform. This is not the same as working in the 32-bit environment and is significantly different than working on the Windows platform. While the concepts presented here are applied to some degree on Mac OS, the differences are enough that you cannot simply drop the code on a Mac and expect things to work as described here.

Additional reading

The full NASM documentation can be found here: https://www.nasm.us/doc/

Assembly language is a set of mnemonic instructions – symbols that represent processor operation codes or opcodes. It is not the same as writing in Java, C, or Python. It is much more primitive. Higher-level languages like C are translated directly to this representation and can run directly on the processor. Other languages like Java are translated to bytecode, which is essentially a similar type of low-level representation but is interpreted above the level of the CPU by running the Java class in a Java Virtual Machine (JVM).

Important Note!

This means that you will work at a level directly beneath the programming level that C provides and well below the level of the Java execution environment. Remember that this is a primitive expression of work to be done in its simplest form. This means everything is truly done step by step.

So what is the difference? The C program will run fast since it is built more closely to the CPU. However, the C program is not portable and will likely need to be recompiled on a new platform. The Java program is intended to be portable, hence the JVM, but will run a bit slower than the C equivalent. There is always a trade-off.

An assembly language program is assembled by a program called the assembler. The result will be an object file that can then be linked with other object files and libraries to produce a complete program.

The code examples are designed to be assembled with the NASM assembler in 64-bit mode on a Linux platform. They can be adjusted to run on Mac and Windows. However, the details of such adjustments are well beyond the scope of this introduction. You can assemble these programs using a few simple steps.

Assemble - using the nasm command convert the assembly language to an object file.
Link - using the ld command with the object file, combine with system libraries to produce an executable.
Run - Invoke the program to view the results.

The following is a Linux shell transcript of the steps to assemble, link, and run the example presented in the next section. (This assumes some editor was previously used to create the hello.asm file.)

$ nasm -felf64 hello.asm
$ ls -la hello*
-rw-rw-r-- 1 student student 291 Sep 19 07:46 hello.asm
-rw-rw-r-- 1 student student 912 Sep 19 07:49 hello.o
$ ld hello.o -o hello
$ ls -la hello*
-rwxrwxr-x 1 student student 8952 Sep 19 07:49 hello
-rw-rw-r-- 1 student student  291 Sep 19 07:46 hello.asm
-rw-rw-r-- 1 student student  912 Sep 19 07:49 hello.o
$ ./hello
Hello, World!
$

The steps above, and many of the pieces hidden from view are noted in the diagram below:

Illustration 1: Assembly language program flow to become executable.

This charts the flow of your assembly source program through the assembler to produce an object file (and an optional list file). This is often combined with other object file and libraries (static and shared) by the linker to produce an executable program. To run this program, the operating system has a loader program whose sole responsibility is to prepare and load the program into memory so that it is ready to run.

Hello World

The idea behind assembly language is to provide the programmer with instructions (opcodes) and various addressing models to move data and perform a host of operations on it. However, as mentioned earlier, it is primitive.

Assembly language source lines are generally made up of the following

label:     instruction    operand(s)     ; comment

You may also see them as

label:
  instruction    operand(s)     ; comment

Consider this simple assembly language program that prints the obligatory "Hello, World!".

  section .data

hello:   db "Hello, World!",0xa
len:     equ $ - hello

  section .text
  global _start

_start:
  mov rax, 1     ; write syscall
  mov rdi, 1     ; stdout
  mov rsi, hello ; text
  mov rdx, len   ; length
  syscall

exit:
  mov rax, 60    ; exit
  mov rdi, 0     ; return code
  syscall

Name	Example	Purpose
Labels	`hello:`, `_start:`, `exit:`	A named location in the program. These are used instead of explicit addresses to represent positions in the code or data. This allows the programmer to not worry too much about memory locations. Note that labels alone on a line ought to have a colon.
Instructions	`mov`, `syscall`	Assembly language instructions. These are the named actions the CPU is to perform with any provided operands.
Operands	`rdx, len` and `rsi, hello`	Provides the instruction with the information to work with, where appropriate. (Not all instructions have operands, while some have 1 or 2, or even 3.)
Directives	`global`, `extern`	Help to inform the assembler about some of the labels, sections, and external entities.
Sections	`.text`, `.data`, `.bss`	Informs the assembler when a new section has begun so the information can be placed into the correct memory locations at runtime. `.text` – this is where your program instructions live. `.bss` – this is for uninitialized data (your variables). `.data` – this is for initialized, often constant, data.

Syntax

The syntax of the source code listed here follows the Intel model. There is an alternative version known as the AT&T model. While both will produce the same code, the Intel version is arguably easier to understand and learn. The AT&T version is also used by GNU’s gas and Mac’s as assemblers.

For example:

hello.s (AT&T – gas, as) hello.asm (Intel – NASM)

hello.s (AT&T – gas, as)	hello.asm (Intel – NASM)
`.section .data hello: .ascii "Hello, World!\n" len = . - hello .section .text .global _start _start: mov $1, %rax # write syscall mov $1, %rdi # stdout mov $hello, %rsi # text mov $len, %rdx # length syscall exit: mov $60, %rax # exit xor %rdi, %rdi # return code syscall`	`section .data hello: db "Hello, World!",0xa len equ $ - hello section .text global _start _start: mov rax, 1 ; write syscall mov rdi, 1 ; stdout mov rsi, hello ; text mov rdx, len ; length syscall exit: mov rax, 60 ; exit mov rdi, 0 ; return code syscall`

.section .data

hello: .ascii "Hello, World!\n"
len = . - hello

.section .text
.global _start

_start:
  mov $1, %rax     # write syscall
  mov $1, %rdi     # stdout
  mov $hello, %rsi # text
  mov $len, %rdx   # length
  syscall

exit:
  mov $60, %rax    # exit
  xor %rdi, %rdi   # return code
  syscall

section .data

hello: db "Hello, World!",0xa
len equ $ - hello

section .text
global _start

_start:
  mov rax, 1     ; write syscall
  mov rdi, 1     ; stdout
  mov rsi, hello ; text
  mov rdx, len   ; length
  syscall

exit:
  mov rax, 60    ; exit
  mov rdi, 0     ; return code
  syscall

Instructions

The work done by the CPU is strictly based on the instructions provided by the programmer. Essentially, each instruction performs one primitive operation, which includes moving data, performing arithmetic, making decisions, and branching to new locations in the program.

These are some general classifications of instruction. The most common are noted with a few instructions of that type.

Binary arithmetic - signed and unsigned integer math, binary coded decimal along with logical and bit shift operations. (add, sub, imul)
Logical and shift/rotate - used to manipulate bits. (and, xor, not, sal, shl, ror)
Floating point - support for many forms of numeric presentation. (fsubr, fdivr)
Data transfer - move information from place to place. (mov, xchg, push, pop)
Control transfer - branching and subroutine calls. (cmp, jmp, jne, call, ret)
String - Move, compare and scan strings. (movs, cmps, scas, lods)
Flag control - alter the state of the EFLAGS register. (stc, clc, sti, cli, pushf, popf)

There are many different categories of instructions, and there are so many instructions that they cannot be listed in a small tutorial such as this. So, we will present some standard instructions and links to additional documentation.

Instruction	Example	Outcome
`mov`	`mov rax, rdx`	Moves value in `rdx` into `rax`.
`add`	`add ebx, eax`	Perform `ebx` + `eax` and store back into `ebx`.
`cmp`	`cmp ax, 0`	Compares `ax` to zero by subtracting and setting flags as appropriate.
`xchg`	`xchg eax, [data]`	Exchanges 32-bit quantities contained in `eax` and `data`
`sub`	`sub [sum], rax`	Subtract 64-bit `rax` from value at `sum` placing result at `sum`.
`inc`	`inc rbx`	Increment value in `rbx` by one.
`dec`	`dec dx`	Decrement value in `dx` by one.
`xor`	`xor rax, rax`	Performs exclusive-or of 1st operand with 2nd operand placing the results in 1st operand. This example zeroes the `rax` register.
`syscall`	`syscall`	Invokes a privileged OS system call handler on behalf of the calling program.

x86 and amd64 Instructions
Intel Developer Reference

Operands

Register

Registers are internal, named locations of the CPU that can hold values. The registers can receive constants, data from memory, or other registers.

In the beginning, when the 8086 was new, the traditional names for the registers were:

AX - accumulator - this is where the majority of computation occurs.
BX - base register - this is used as an offer into other memory locations.
CX - counter register - can be combined with the loop instruction as a loop-control variable.
DX - data register - can hold the overflow of arithmetic operations and be used in I/O operations.
SP - stack pointer - indicates the current position of the next empty stack frame.
BP - base pointer - also used with the stack pointer to access parameters and local variables.
SI - source index - initially used for indirect addressing, can also be used with string operations.
DI - destination index - initially used for indirect addressing, can also be used with string operations.

SS - Stack Segment - Segment relative to the stack pointer (SP).
ES - Extra Segment - Alternate segment for other data references - often for strings and used with SI and DI.
DS - Data Segment - Segment relative to all unqualified memory references.
CS - Code Segment - Segment relative to the instruction pointer (IP).
IP - instruction pointer - indicates the location of the next instruction to be executed.
FLAGS - flags register - point in time state of the CPU in a series of bit indicators.

As the iterations of the x86 progressed, the AX register (16 bits) became the EAX register (32 bits) up to today’s RAX register of 64 bits. The same is true for them all.

Eventually, more registers were added to complement the number of values that may be in flight in a given program allowing the CPU to keep more values on-chip rather than make many requests to/from memory. This allows for speed-up of programs as well since the fastest memory exists within the CPU itself.

The chart below lists some of the most common registers. You will notice that specific registers have more than one name.

(64-bit registers)
|---- Dual Named Registers ---|
 R0  R1  R2  R3  R4  R5  R6  R7  R8  R9  R10  R11  R12  R13  R14  R15
RAX RCX RDX RBX RSP RBP RSI RDI

(32-bit registers)
|---- Dual Named Registers ---|
R0D R1D R2D R3D R4D R5D R6D R7D R8D R9D R10D R11D R12D R13D R14D R15D
EAX ECX EDX EBX ESP EBP ESI EDI

(16-bit registers)
|---- Dual Named Registers ---|
R0W R1W R2W R3W R4W R5W R6W R7W R8W R9W R10W R11W R12W R13W R14W R15W
AX  CX  DX  BX  SP  BP  SI  DI

(8-bit registers) low-order bits of AX, CX, DX, BX
|---- Dual Named Registers ---|
R0B R1B R2B R3B R4B R5B R6B R7B R8B R9B R10B R11B R12B R13B R14B R15B
AL  CL  DL  BL  SPL BPL SIL DIL

(8-bit registers) high-order bits of AX, CX, DX, BX
AH  CH  DH  BH

Register layout based on capacity.

|--------------------------- RAX & R0 ------------------------------|  (same as dq and resq)
                                 |----------- EAX & R0D ------------|  (same as dd and resd)
                                                  |----AX & R0W ----|  (same as dw and resw)
---------------------------------------------------------------------
|            32 bits             |     16 bits    | 8 bits | 8 bits |
---------------------------------------------------------------------
                                                  |-- AH --|--AL &--|  (each are the same 
                                                           |-- R0B--|   as db and resb)

Specific registers are available only in certain modes. The R registers are only available if the CPU is operating in 64-bit mode. In that mode, it also provides the remaining traditional names for 32- and 16-bit software.

Memory

To declare memory in your programs, you must first know if this will be in the .data or .bss segment. Items in .data can be initialized, but those in .bss cannot.

Size in Bytes (Bits)	.data	.bss	Equivalent
1 (8)	db	resb	AH, R0B
2 (16)	dw	resw	AX, R0W
4 (32)	dd	resd	EAX, R0D
8 (64)	dq	resq	RAX, R0

  section .data
hello:   db   "Hello!"      ; a string
len:     equ  $ - hello     ; the calculated length of the string

  section .bss
result:  resq 1   ; allocate 1 quadword for results.

  section .text

  ; ...
  mov  rax, 32        ; rax is now 32
  add  rax, 5         ; add 5 to rax (37)
  sub  rax, 10        ; subtract 10 from rax (27)
  mov  [result], rax  ; store rax into result

The last line is important. It is a form of direct addressing. Where the address to store the data is encoded with the instruction. However, since the address would be taken literally as in

mov rax, result

We have to place the square brackets around the address to indicate we mean that value at that address, not the address itself.

Addressing Modes

Implied

nop

Immediate

mov rax, 27

mov rax, rdx

Direct/Displacement

mov rax, [result]

mov bx, 7c00H
mov ax, [bx]

After these, they get a little more complex.
Based

The following examples are methods to indirectly manage data in a memory location. The square brackets indicate that the value contained inside is not the data but rather where the data lives. They are the equivalent of pointers to the data.

[ disp ]
[ reg ]
[ reg + reg * scale ]
[ reg + disp ]
[ reg + reg * scale + disp ]

Any general-purpose register can be used for reg. The value for disp represents a displacement from the base, which is the segment in which it is contained.

The scale value can be 1, 2, 4, or 8, representing the number of bytes. This is often used to move down to the next position of an array of elements of a given size (for example, 4 for int, 8 for long).

Some examples are shown below:

mov dx, [bx]
mov [intp], ecx
sub [rax + 100], 32768
xchg cx, [si + ax*4]
add rax, [rsi + rcx*8 + 100]

Immediate

Immediate operands are essentially constants. They are also the values of labels within the program. So these could be any data value as a constant expressed within the program or a label representing a location in the data section.

Some examples are shown below:

section .text
  cmp rbx, 10
  mov cx, 12
  add edx, 100
  mov esi, hello
  ; ...
  ret

hello db "Hello!"

FLAGS

The flags register is the CPU’s way of keeping track of certain events due to executing instructions. Some are used to set the CPU state for specific operations – privileged and non-privileged.

Many of the flags present can be safely ignored by the casual assembly language programmer since they are used for specific purposes other than program control. Let us begin by examining the details of the FFLAGS register.

The visual below represents the 64 bits of the RFLAGS, which also contains the 32-bit EFLAGS and the traditional 16-bit FLAGS.

 6                              3 3322222222221111 111111
 3                              2 1098765432109876 5432109876543210
--------------------------------------------------------------------
|           RESERVED             |-RESERVED-IVVAVR|-NIOODITSZ-A-P-C|
--------------------------------------------------------------------

|------------------------------RFLAGS------------------------------|
                                 |--------------EFLAGS-------------|
                                                  |-----FLAGS------|

Bit	Name	Clear/Set	Purpose
0	CF	nc/cy (no carry/carry)	Carry Flag. This is set when a mathematical operation overflows (carry) and is cleared when it underflows (borrow).
1	RESERVED
2	PF	pe/po (parity even/parity odd)	Parity Flag.
3	RESERVED
4	AF	na/ac (no aux carry/aux carry)	Auxilliary Carry Flag.
5	RESERVED
6	ZF	nz/zr (not zero/zero)	Zero flag. This flag is set whenever an instruction deals with a zero value, including storing a zero in the register, arithmetic operations, and comparisons, which are generally in the form of subtraction. This flag is integral to branching instructions.
7	SF	pl/ng (plus/negative)	Sign Flag. This is set if the result of an operation is negative.
8	TF		Trap Flag. This is generally used for debuggers to step through code.
9	IF	ei/di (enable interrupts/disable interrupts)	Interrupt Enable Flag. This generally is not modified and indicates that hardware interrupts have been enabled.
10	DF	up/dn (up/down)	Direction Flag. This is typically used with the SI and DI registers and string instructions. Clear indicates movement up toward higher memory.
11	OF	nv/ov (normal value/overflow value)	Overflow Flag. This is set if the arithmetic operation results in a value too large for the register.
12-13	IOPL		I/O Privilege Level. This is used to calculate the current privilege state of an executing program.
14	NT		Nexted Task.
15	RESERVED		RESERVED
16	RF		Resume Flag.
17	VM		Virtual Mode. Represents the Virtual 8086 mode. This is a compatibility mode.
18	AC		Alignment Check / Access Control. This is generally set if alignment checking for memory references will be performed.
19	VIF		Virtual Interrupt Flag.
20	VIP		Virtual Interrupt Pending.
21	ID		ID flag. When set, this allows the use of CPUID instructions.
22-31	RESERVED		RESERVED
32-63	RESERVED		RESERVED

So, why are the flags so important? Well, we have to consider how decisions are made. When we are comparing a value and then jumping to a new location based on the result, these make all the difference.

Here are two very simple tests:

cmp rax, 35
jle otherlocation

Which means

if rax <= 35
  goto otherlocation

and

loop:
  mov eax, 45
  ; ...
  dec ebx
  jnz loop

Which means

loop:
  eax = 45
  ;...
  ebx--
  if ebx != 0
    goto loop

The branching instructions jle and jnz can do their task based on the instruction before. Those instructions set the appropriate flags based on the comparison and decrementing. Arithmetic, comparison, test, and increment/decrement instructions affect the flags.

Below is a list of branching instructions and the FLAG values necessary to make the branch.

unsigned comparisons
ja   ( CF = 0 and ZF = 0 )
jae  ( CF = 0 and ZF = 1 )
jb   ( CF = 1 )
jbe  ( CF = 1 or ZF = 1 )

signed comparisons
je, jz   ( ZF = 1)
jne, jnz ( ZF = 0)
jg   ( SF = OF and ZF = 0 )
jge  ( SF = OF or ZF = 1 )
jl   ( SF != OF )
jle  ( SF != OF or ZF = 1 )
jo   ( OF = 1 )
jno  ( OF = 0 )
js   ( SF = 1 )
jns  ( SF = 0 )

counter register
jcxz   ( CX = 0 )
jecxz  ( ECX = 0 )
jrcxz  ( RCX = 0 )

System Calls

There are a few hundred system calls defined in Linux. There is a similar group of them in MacOS (but with different values and call requirements). There also exists a defined mechanism for Windows x64 and IA32 to perform basic I/O.

A superb reference for syscalls is the Searchable Linux Syscall Table.

This section is focused on the x86-64 system call table – and only a few syscalls to provide some examples. This model also follows the AMD64 ABI call model, where all calls follow the details of the following table.

Register	Purpose
RAX	System Call Number and Return Value
RDI	First argument
RSI	Second argument
RDX	Third argument
R10	Fourth argument
R8	Fifth argument
R9	Sixth argument

This is a relatively simple implementation to remember and is slightly different from the kernel in the AMD64 model.

Important Note!

When making syscalls, the only preserved registers are RBX, RSP, RBP, and R12–R15. If you have data in any other registers, you are responsible for preserving the data before making the call.

Syscalls do not need additional arguments beyond 6. The AMD64 calling convention is further discussed in the next section.

You will notice that RAX is a dual-purpose register. It contains the system call number when the call is made, but it also will have a return value when the call is complete and returns to the caller.

Syscalls are function calls broken down into individual instructions with a Kernel entry point to perform the task requested.

Write

For example, to write data to a stream, the syscall number is 1 (write), and the arguments are given in the order shown in the function call below:

ssize_t write(int fd, const void *buf, size_t count)

So, in assembly language form, this is:

  section .data
hello:   db   "Hello!"
len:     equ  $ - hello

  section .text

  ; ...
  mov rax, 1     ; write syscall
  mov rdi, 1     ; fd (stdout)
  mov rsi, hello ; buf
  mov rdx, len   ; count
  syscall

Upon return from the write syscall, the rax register will hold the value of the number of bytes actually written.

Read

Then there is syscall 0 (read) for reading from a data stream based on this:

ssize_t read(int fd, void *buf, size_t count)

Its basic assembly language form looks something like this

  section .bss
line:   resb  80

  section .text

  ; ...
  mov rax, 0     ; read syscall
  mov rdi, 0     ; fd (stdin)
  mov rsi, line  ; buf
  mov rdx, 80    ; count
  syscall

Upon return from the read syscall, the rax register will hold the value of the number of bytes read. This number may be less than what you provided in rdx since this is supplied to read as a maximum number.

Exit

Of course, we have already seen the exit syscall, number 60:

void exit(int status)

Becomes this:

exit:
  mov rax, 60    ; exit
  xor rdi, rdi   ; status is 0 (success)
  syscall

Returning a status value of zero indicates no error when the program is completed. The programmer defines any other value to have whatever meaning they apply to it.

The xor instruction is a faster, smaller way of moving a zero into a register. (You can read the Intel Documentation on optimizing core execution, section 3.5.1.8, if you want to know why.)

Assembly Language and C

Imagine a simple for loop.

for ( x = 0; x < 10; x++ )
  printf("%d\n", x);

It is sleek, elegant, and simple to write. Now consider what the same task looks like in the following assembly source.

  global main
  extern printf

  section .text

main:
  mov rbx, 0      ; x = 0

loop:
  cmp rbx, 10     ;
  jge done        ; x < 10
  mov rdi, format
  mov rsi, rbx
  mov rax, 0      ; indicate # of XMM regs
  call printf     ; printf("%d\n", x)
  inc rbx         ; x++
  jmp loop

done:
  mov rax, 0      ; return 0
  ret
  
  section .data

format:   db "%d",0xa,0

Indeed, the program is that long to do the same as the for loop in C. Because we are so close to the CPU, we must write the instructions that precisely describe the process of building and managing a loop. That is roughly half the code, and the other half is the setup and call to printf.

The AMD64 calling convention defines the interactions between the calling and called functions. How this works is the following:

The calling function owns general purpose registers: rbp, rbx, and r12 - r15. The called function must preserve them on the stack to be restored before returning to the caller.
All other registers belong to the called function. This means the caller is responsible for preserving the contents of any registers they want to keep from being destroyed.
User applications use rdi, rsi, rdx, rcx, r8 and r9 for the first 6 arguments.
User application must pass additional arguments beyond 6 on the stack in reverse order.
Kernel syscalls use rdi, rsi, rdx, r10, r8 and r9 for up to 6 arguments. That is the limit.
The kernel destroys rcx and r11 on syscalls.
Syscalls return results in rax. Errors are indicated by a range of -4095 to -1. This is also -errno - the global error value in C.

Of course, this is more complex than is noted here. These are simply the finer details that most programmers are interested in knowing.

The AMD64 calling model for Linux is well described in System V Application Binary InterfaceAMD64 Architecture Processor Supplement

Table of Contents

Overview

Hello World

Syntax

Instructions

Operands

FLAGS

System Calls

Assembly Language and C