Group Project 2 (of 2) – Programming by Design

CISS-111 Group Project 2

Write a BASIC tokenizer for the Commodore 64 (C64). Read Crafting Interpreters, Chapter 4 sections 4.4 and later for a basic understanding of what you are building. Take as much code as you think you need from that model.

You don’t need it all, nor do you need it necessarily as written. But this gives you a good starting point for building the Tokenizer (called Scanner in his book) class noted later in the project.

Historical Notes

BASIC (Beginners All-purpose Symbolic Instruction Code) is very well known and very old. Here are a few historical videos to provide some background.

The Birth of BASIC
The Basics of BASIC
101 BASIC Computer Games (Creative Computing)
101 BASIC Computer Games (Digital Equipment Corporation)

Learning outcomes

Working with a translation table.
Working with existing data to build a new representation.
Working with data type manipulations.
Working with switch expressions.
Working with unsigned byte data.
Working with unique file formats.
Working with established practices.
Working with an emulator provided by your instructor or through VICE (VIC-20 and C-64 emulation).
Confirmation program produces desired results.

Setup

The Commodore 64 stored BASIC programs in tokenized form. This was a carryover from previous platforms (VIC-20, PET), which had an order of magnitude less memory. By tokenizing the lines as they were entered, the program occupied less memory overall and was interpreted faster.

Let’s look at an example. The following program was typed in by the user (or read in from tape/disk)

10 A=10:PRINTA
20 PRINT"ABC","DEF","GHI"
30 A$="STUFF"
40 PRINT"A IS";A,"A$ IS ";A$
50 INPUT "PLEASE ENTER YOUR NAME";N$
60 PRINT "YOUR NAME IS ";N$

If you want to paste it into a VICE c64sc emulator, it must be lower case.

10 a=10:printa
20 print"abc","def","ghi"
30 a$="stuff"
40 print"a is";a,"a$ is ";a$
50 input "please enter your name";n$
60 print "your name is ";n$

Historical Note

Once the system was up and running, the VIC-20 only had 3,583 bytes available to programs. By comparison, the Commodore 64 had 38,911 bytes.

Also, the 6502/6510 processors used were little-endian.

Here is the program after it’s typed in:

Once the program is typed in, you can run it:

As mentioned, the program is stored in memory in tokenized form. This is shown below:

Of particular interest in this hex dump is the order of the bytes.

Note

The program’s format on disk/tape has two additional bytes at the beginning. The value 0x0801 (as 0x01 0x08) indicates the location in memory where the program is to be loaded. All tokenized BASIC programs will have these two bytes. You will not see them in the hex dump since they are discarded when the program is loaded. What you see in the hex dump starts at byte 3 of the disk file.

Each line is tokenized and placed into memory using a linked list. This is useful to speed us line searching when the program has to jump to a new location (GOTO, GOSUB, THEN, etc). Here is a rough memory layout of a line.

-------------------------------------------------------------------------------------
| lo-byte next line | hi-byte next line | lo-byte line number | hi-byte line number ~
-------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------
~ tokenized commands and raw text | zero byte for end of line | next address or two zeros |
-------------------------------------------------------------------------------------------

This is further illustrated in the following picture:

The blue boxes represent the address of the next line. The orange boxes are the line numbers and the white box shows the zero byte marking the end of the line; there is always one zero byte before the next address in the hex dump. The red box shows the three zeroes indicating the end of the program.

Your team will produce a program that will tokenize a source program and write it to a file, properly formatted to run on a Commodore 64. You can use the instructor’s website emulator or VICE to drag and drop the PRG file anywhere in the display area.

It is recommended to do this in at least two files:

C64BASIC.java
Tokenizer.java

Below is some code to get you started.

Tokenizer.java

import java.util.HashMap;
import java.util.Map;
import java.util.Arrays;
import java.util.Collections;

public class Tokenizer {

    private final StringBuffer source;
    static Map<String,Byte> keywords = new HashMap<>();
    static Map<String,Byte> keycodes = new HashMap<>();
    static String[] commands;
    byte[] code = new byte[38913]; // 38911 basic byte free!
    private int start = 0;
    private int current = 0;
    private int cidx = 0; // code index
    private int nexti;    // where to put address of next tokenized line
    private boolean LINENO; // looking for a line number?

    public Tokenizer(String source) {
        this.source = new StringBuffer(source);
        // load the HashMaps
        populateKeywords();
        populateKeycodes();

        code[cidx++] = (byte)(0x01);
        code[cidx++] = (byte)(0x08);
        nexti = cidx;
        //cidx += 2; // leave 2 bytes for address link to next line
    }

    public byte[] scanTokens() {
        LINENO = true;
        while (!isAtEnd()) {
            // We are at the beginning of the next lexeme.
            start = current;
            scanToken();
        }
        addToken(0);
        // return only what we need.
        byte[] c = new byte[cidx];
        System.arraycopy(code, 0, c, 0, cidx);
        return c;
    }

    /* Code from Crafting Interpreters goes here-ish... 

    private void scanToken() {}
    private void command() {}
    private void number() {}
    private void string() {}
    private void keycode() {}
    private char peek() {}
    private static boolean isDigit(char c) {}
    private static boolean isAlpha(char c) {}
    private boolean isAtEnd() {}
    private char advance() {}
    private void addToken(byte b) {}
    private void addToken(int i) {}
    private void addToken(String s) {}

    */

    private char petscii(char c) {
        if ( c >= 'A' && c <= 'Z' ) return (char)(c + 32);
        if ( c >= 'a' && c <= 'z' ) return (char)(c - 32);
        return c;
    }

    private static void populateKeywords(){
        keywords.put("end",(byte)0x80);
        keywords.put("for",(byte)0x81);
        keywords.put("next",(byte)0x82);
        keywords.put("data",(byte)0x83);
        keywords.put("input#",(byte)0x84);
        keywords.put("input",(byte)0x85);
        keywords.put("dim",(byte)0x86);
        keywords.put("read",(byte)0x87);
        keywords.put("let",(byte)0x88);
        keywords.put("goto",(byte)0x89);
        keywords.put("run",(byte)0x8a);
        keywords.put("if",(byte)0x8b);
        keywords.put("restore",(byte)0x8c);
        keywords.put("gosub",(byte)0x8d);
        keywords.put("return",(byte)0x8e);
        keywords.put("rem",(byte)0x8f);
        keywords.put("stop",(byte)0x90);
        keywords.put("on",(byte)0x91);
        keywords.put("wait",(byte)0x92);
        keywords.put("load",(byte)0x93);
        keywords.put("save",(byte)0x94);
        keywords.put("verify",(byte)0x95);
        keywords.put("def",(byte)0x96);
        keywords.put("poke",(byte)0x97);
        keywords.put("print#",(byte)0x98);
        keywords.put("print",(byte)0x99);
        keywords.put("cont",(byte)0x9a);
        keywords.put("list",(byte)0x9b);
        keywords.put("clr",(byte)0x9c);
        keywords.put("cmd",(byte)0x9d);
        keywords.put("sys",(byte)0x9e);
        keywords.put("open",(byte)0x9f);
        keywords.put("close",(byte)0xa0);
        keywords.put("get",(byte)0xa1);
        keywords.put("new",(byte)0xa2);
        keywords.put("tab(",(byte)0xa3);
        keywords.put("to",(byte)0xa4);
        keywords.put("fn",(byte)0xa5);
        keywords.put("spc(",(byte)0xa6);
        keywords.put("then",(byte)0xa7);
        keywords.put("not",(byte)0xa8);
        keywords.put("step",(byte)0xa9);
        /*
        keywords.put("+",(byte)0xaa);
        keywords.put("-",(byte)0xab);
        keywords.put("*",(byte)0xac);
        keywords.put("/",(byte)0xad);
        keywords.put("↑",(byte)0xae);
        */
        keywords.put("and",(byte)0xaf);
        keywords.put("or",(byte)0xb0);
        /*
        keywords.put(">",(byte)0xb1);
        keywords.put("=",(byte)0xb2);
        keywords.put("<",(byte)0xb3);
        */
        keywords.put("sgn",(byte)0xb4);
        keywords.put("int",(byte)0xb5);
        keywords.put("abs",(byte)0xb6);
        keywords.put("usr",(byte)0xb7);
        keywords.put("fre",(byte)0xb7);
        keywords.put("pos",(byte)0xb9);
        keywords.put("sqr",(byte)0xba);
        keywords.put("rnd",(byte)0xbb);
        keywords.put("log",(byte)0xbc);
        keywords.put("exp",(byte)0xbd);
        keywords.put("cos",(byte)0xbe);
        keywords.put("sin",(byte)0xbf);
        keywords.put("tan",(byte)0xc0);
        keywords.put("atn",(byte)0xc1);
        keywords.put("peek",(byte)0xc2);
        keywords.put("len",(byte)0xc3);
        keywords.put("str$",(byte)0xc4);
        keywords.put("val",(byte)0xc5);
        keywords.put("asc",(byte)0xc6);
        keywords.put("chr$",(byte)0xc7);
        keywords.put("left$",(byte)0xc8);
        keywords.put("right$",(byte)0xc9);
        keywords.put("mid$",(byte)0xca);
        keywords.put("go",(byte)0xcb);

        commands = keywords.keySet().toArray(new String[1]);
        Arrays.sort(commands, Collections.reverseOrder());
    }

    private static void populateKeycodes() {
        keycodes.put("{white}", (byte)0x05);
        keycodes.put("{return}", (byte)0x0d);
        keycodes.put("{down}", (byte)0x11);
        keycodes.put("{reverse on}", (byte)0x12);
        keycodes.put("{home}", (byte)0x13);
        keycodes.put("{delete}", (byte)0x14);
        keycodes.put("{red}", (byte)0x1c);
        keycodes.put("{right}", (byte)0x1d);
        keycodes.put("{green}", (byte)0x1e);
        keycodes.put("{blue}", (byte)0x1f);
        keycodes.put("{pound}", (byte)0x5c);
        keycodes.put("{orange}", (byte)0x81);
        keycodes.put("{black}", (byte)0x90);
        keycodes.put("{up}", (byte)0x91);
        keycodes.put("{reverse off}", (byte)0x92);
        keycodes.put("{clear}", (byte)0x93);
        keycodes.put("{brown}", (byte)0x95);
        keycodes.put("{pink}", (byte)0x96);
        keycodes.put("{dark gray}", (byte)0x97);
        keycodes.put("{grey}", (byte)0x98);
        keycodes.put("{light green}", (byte)0x99);
        keycodes.put("{light blue}", (byte)0x9a);
        keycodes.put("{light gray}", (byte)0x9b);
        keycodes.put("{purple}", (byte)0x9c);
        keycodes.put("{left}", (byte)0x9d);
        keycodes.put("{yellow}", (byte)0x9e);
        keycodes.put("{cyan}", (byte)0x9f);
        keycodes.put("{pi}", (byte)0xff);
    }
}

Here is the main code to exercise the tokenizer.

Group2of2.java

import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.charset.Charset;
import java.io.IOException;
import java.nio.file.StandardOpenOption;

public class Group2of2 {

    private static void hexdump(byte[] bytes) {
        int start = (bytes[1] & 0xff) * 256 + (bytes[0] & 0xff);
        int pc = start;
        StringBuilder chars = new StringBuilder();
        int zeroes = 0;

        for (int x = 2; x < bytes.length; x++) {
            if ( (pc - start) % 8 == 0 ) {
                System.out.printf(" %s", chars);
                if ( zeroes >= 3 ) return;
                System.out.printf("\n%04X:", pc);
                chars.setLength(0);
            }
            char c = (char)(bytes[x] & 0xff);
            zeroes = c==0 ? zeroes + 1 : 0;
            if ( c >= 32 && c <= 127 )
                chars.append(c);
            else
                chars.append('.');
            System.out.printf(" %02X", bytes[x]);
            pc++;

        }
        System.out.println("\n");
    }

    public static void main(String[] args) throws IOException {
        byte[] bytes;

        if (args.length != 2) {
            System.out.println("Must provide source and object.");
            System.exit(3);
        }

        bytes = Files.readAllBytes(Paths.get(args[0]));

        Tokenizer tzr = new Tokenizer(new String(bytes, Charset.defaultCharset()));
        byte[] program = tzr.scanTokens();

        hexdump(program);
        Files.write(Paths.get(args[1]), program, StandardOpenOption.CREATE);
    }
}

The operators noted above will be implemented in a switch but are kept in the code for reference purposes. (You'll need to know the codes.)

Keep in mind that you are not building an interpreter! The code from Crafting Interpreters will be used as a guide to simplify the tokenization process. The meat of this project is turning the source code into tokenized BASIC code and writing the file correctly so that it will run on an emulated Commodore 64.

Keep your eye on the ball!

Your success in this project is based on the following criteria:

A rough outline of the problem breakdown.
Division of work between team members.
Each member provides an equal contribution to the solution.
A final one-page writeup of your solution with a self-assessment of what you did well and where you could improve the code.

If you complete the recognition of keycodes in strings, you will earn an additional 5 points!

Submit as Team_Project2.java to the Learning Management System.