C64 BASIC Tokenizing – Programming by Design

(Updated November 20, 2024)

Overview

The process of tokenizing a BASIC program converts the text to a sequence of compressed codes mixed with ASCII (PETSCII) characters. The idea is to have the “compiling” of the code completed and work within an intermediate representation that:

Reduces the space occupied by the program.
Speeds interpretation of the program.
Removes the need for repeated tokenizing of the same lines as the program progresses.

This is not uncommon today. Java and Python convert to an intermediate form called Bytecode.

Memory

On the C64, memory was a limiting factor. After the system was up and running, only 38,911 bytes were free. For most users at the time, this was more than sufficient. It was brilliant to store the program in memory in tokenized form. As it turns out, the program was also written to disk in the same form. When loaded back into memory, it was ready to go.

Each BASIC program line was stored in memory and connected to the other lines via a linked list. This was critical to be able to find lines in the program quickly. As you can imagine, since the linked list is O(n), the more lines in the program, the slower the program could become if it frequently jumped around in the code.

Given a program like the following:

10 REM TEST
20 FOR X = 1 TO 10
30 PRINT X
40 NEXT X

The complete memory layout looked like this:

Each boxed little-endian value is the address of the next line in the chain. The two bytes following the address are the line number for that BASIC statement.

         BASIC Line                      BASIC Line                 EOF
+--------------------------+    +--------------------------+    +---------+
| addr | line # | ... | 00 |    | addr | line # | ... | 00 |    | 00 | 00 |
+--------------------------+    +--------------------------+    +---------+
    |                            ^   |                            ^
    |                            |   |                            |
    +----------------------------+   +----------------------------+

It’s important to note that the ellipses in the above diagram pertain to something like the following memory range. Using the previous memory dump as a guide:

0x0805 to 0x080B has bytes 0x8F (the tokenized form of REM) followed by 0x20 (space). The next 4 bytes are TEST, followed by 0x00, indicating the end of the line.

So the line

10 REM TEST

is encoded with the bytes

0x0A 0x00 0x8F 0x20 0x54 0x45 0x53 0x54 0x00

The Linked List

Now, let’s explain how this works. On the C64, all tokenized BASIC programs are loaded into memory location 0x0801. The first value is the (little-endian) address of the next line in the chain. Repeating the picture below:

We can now list the addresses and how they connect to form the linked list.

0x0801 has address 0x080C followed by 0x00A0 (line 10).
0x080C has address 0x081D followed by 0x0014 (line 20).
0x081D has address 0x0825 followed by 0x001E (line 30).
0x0825 has address 0x082D followed by 0x0028 (line 40).
0x082D has address 0x0000, the end of the linked list (the null pointer).

This is how the BASIC interpreter moves through the tokenized program in memory. When the user entered the RUN command, the interpreter started at 0x0801 and moved through the linked list as needed. Consider when it encountered a line like

40 GOTO 85

The interpreter would start at 0x0801 and look at the line numbers to find the one that was the target of the GOTO. In this case, it would move through the linked list until it found line 85 or throw an error that the line was not found.

While imagining how this would work for the interpretive phase was easy, remember that this environment was also the editor. This means that we could insert a line of code between two existing lines, and the tokenizer would have to adjust the in-memory program to accommodate that change.

So, if we introduced line 25:

10 REM TEST
20 FOR X = 1 TO 10
25 PRINT " X IS ";
30 PRINT X
40 NEXT X

Memory would be adjusted like so:

0x0801 has address 0x080C followed by 0x00A0 (line 10).
0x080C has address 0x081D followed by 0x0014 (line 20).
0x081D has address 0x082C followed by 0x0019 (line 25).
0x082C has address 0x0834 followed by 0x001E (line 30).
0x0834 has address 0x083C followed by 0x0028 (line 40).
0x083C has address 0x0000, the end of the linked list (the null pointer).

Re-numbering and Other Tools

Renumbering BASIC programs was often necessary when the programmer realized there was not enough space between existing lines to accommodate a change. We will not go into the details here, but there is a fantastic writeup from Mass:Werk that describes in excruciating detail what it was like on these older platforms. For the BASIC programmer, a renumbering tool was worth its weight in gold.

In other words, cartridges and assembly language programs would be written to sit in memory and be available to assist developers. When the PET, VIC-20, and C-64 were released, there was only so much that could be included in the original design. After they hit the market, developers realized that specific tools would be beneficial.

Some developers like Jim Butterfield wrote many of the popular tools of the time. These included better editors and word processors. Some system tools extended the BASIC language, enhanced the system command set, provided assemblers for the 6502/6510 CPU, and memory inspection tools. Micromon is one of those tools and was used to inspect the memory of the program examined in this writeup.