(Updated July 22, 2024)
Overview
The methods of programming CPUs/computers have had a vast history. This document is intended to provide a historical introduction to the complexity of how systems were programmed.
The details contained here concern the 6502 and 6510 processors. The 6510 was used in the Commodore 64. This was chosen for its simplicity and the ability to represent the content meaningfully.
If you are unfamiliar with the various number systems of computing, you may want to read all about them.
Machine Code
Like programming, the term machine code may conjure something different for many. Some may imagine old-time punch cards, paper tape, tape drives, or sequences of binary digits arranged just so. These are technically valid mechanisms for representing the machine code in some form. More precisely, the program defined in the machine code must be in memory for the CPU to do the work represented by the code.
Since memory is typically a contiguous space (no breaks) and our programs are likely also to be contiguous, one possible representation of machine code could be just a stream of bits:
101011100011010001000000101011010011010101000000001000000010101101000000
If we arrange the bits into groups of 8 forming bytes, then we could see it more plainly as:
10101110 00110100 01000000 10101101 00110101 01000000 00100000 00101011 01000000
Each byte in our program has a purpose. Multiple bytes could be combined to make more complex instructions. Of course, binary is tricky to work with. Ben Eater has a great video showing how he programs his primitive breadboard computer using a set of switches to represent the bits in a given memory location. (This also gives away the rest of the show!)
Staring at all the ones and zeroes presents little variation, meaning data entry becomes very clumsy, and errors will creep in. So we could go with decimal:
174 52 64 173 53 64 32 43 64
Indeed, that provides some variation, and we could talk the numbers out to ourselves to put them into the system with potentially fewer errors. However, decimal is generally worse as it doesn’t lend itself to seeing the binary digits. So, hexadecimal (and, in some cases, octal) became one of the preferred choices since we can pack 4 bits into every hexadecimal digit.
A E AE (1010 1110) 34 40 AD 35 40 20 2B 40
Opcodes and Instructions
Now, we must ask ourselves why we needed all those bits, bytes, and hexadecimal digits. What was the point? Those values represent opcodes for the 6502 CPU. Well, opcodes and data. See, all CPUs only deal with two forms of information – opcodes and data. That’s it. Oh, opcode is short for operation code – the numeric value that represents the action to be performed by the CPU.
Some of those numbers represent opcodes, and the rest are data. So which ones are which? We usually start with an opcode – to begin with, data wouldn’t make any sense without a frame of reference. For the 6502, all opcodes are precisely one byte.
After the one-byte opcode, we can have zero, one, or two bytes of data. The number of bytes of data the opcode needs is baked into it.
So, consider:
AE 34 40
These bytes say to load the X register (byte AE) with the value in memory location $4034 (bytes 34, 40). The CPU knows that opcode AE will perform that task and will need two more bytes to see where the data will be fetched from. Let’s break this down:
- The CPU reads the AE opcode. The X register will be loaded with a value from a memory location (absolute addressing).
- It then reads 34 and 40 from memory, knowing this represents the 16-bit address $4034.
- The 8-bit data value in memory location $4034 is then fetched.
- The fetched value is stored in the X register.
The opcode we’ve been working with, AE, represents an instruction called LDX. This is a mnemonic shorthand for LoaD X. (There is also a corresponding STX or STore X.) These mnemonic instructions represent the 56 unique operations of the 6502, and we tend to reference them by these names rather than the numeric values.
The three bytes we’ve been looking at are better written as:
LDX $4034
This is where we move into assembly language.
Assembly Language
To make the programmer’s job significantly easier, tools such as monitors and assemblers were created. This gave the programmer a much easier environment to write code for a given CPU. No more binary or hexadecimal numbers!
The monitor provided a text-based command line interface. With it, you could type pre-written assembler code, and the monitor would assemble each line, convert it to the proper byte sequence, and store it in memory, prompting the user for the next line to assemble. The monitor does not provide some of the more detailed features like memory abstraction (i.e., variables) or forward address calculations. The programmer is responsible for knowing the size of each written instruction to calculate addresses and have the program run correctly.
[INSERT VIDEO HERE!]Consider the following program.
LDX $4034
LDA $4035
JSR $402B
LDX $4036
LDA $4037
JSR $402B
CLC
LDA $4036
ADC $4034
STA $4038
LDA $4037
ADC $4035
STA $4039
LDX $4038
LDA $4039
JSR $BDCD
LDA #$0D
JMP $FFD2
RTS
While this is a valid assembly language program, it requires the programmer to know much about where things are in memory. For example, the starting address of the program is $4000. There are also three memory locations set aside for calculations:
$4034 - $4035 ==> first number $4036 - $4037 ==> second number $4038 - $4039 ==> sum of the first and second number
The program is also making use of a subroutine written by the programmer and two subroutines provided by the system ROMs.
$402B ==> prints a number followed by a new line (user subroutine) $BDCD ==> prints a number (BASIC subroutine) $FFD2 ==> prints a character (KERNAL subroutine)
You can quickly see why tools that provide more abstract views of the program are so beneficial. The assembler program knows how to read and convert assembly language (written text) into machine code. There may be other steps involved with getting the program into memory, but the assembler is the first step in getting a program into proper working order.
Like in higher-level languages, the assembly language development environment is designed to hide nearly everything we’ve discussed thus far and provides basic abstractions like variables. Assembly language programmers can have similar functionality by using names and labels. We can then have a form of the program that is easier to manage, like the following.
*= $4000 ; start at address $4000
define linprt $bdcd
define chrout $ffd2
ldx num1
lda num1+1
jsr printxa
ldx num2
lda num2+1
jsr printxa
adding: clc
lda num2
adc num1
sta sum
lda num2+1
adc num1+1
sta sum+1
; this is printing the sum
prtsum: ldx sum
lda sum+1
printxa: jsr linprt
lda #13
jmp chrout ; rts from call
rts
num1: dcw 3650
num2: dcw 1217
sum: dcw 0
With this form of assembly language, the programmer is freed from the bonds of knowing where everything lives in memory. We use names for everything, and the assembler is burdened with keeping track of where in memory everything will live.
Hopefully, the reader now appreciates the advent of higher-level languages and how they’ve allowed us to develop programs more quickly and with fewer errors than writing in assembly language.