Skip to the content.

Do your first research experiment

Create your own git repo wherever you want. Most people will choose to do it within their own GitHub account, but GitLab is another good option and there are plenty of other possible choices. Give your repo a name that includes comp560 and your login/username e.g., comp560-jmac.

Create a README.md, a .gitignore suitable for Python projects, a Python virtual environment, and a LICENSE file. Update these whenever needed as your experiment progresses. Activate your virtual environment.

You probably have already cloned our fork of nanoGPT, comp560-nanoGPT. If not clone it. Either way it should be installed at the same directory level as your new repo (e.g., if new repo is at ~/git/comp560-jmac then nanoGPT should be at ~/git/comp560-nanoGPT).

Decide on the stream of data you would like your LLM to learn. Think of something that is easy to learn based only on character-by-character information. Three examples are provided here, but you should think of your own example. Seek help if necessary.

Create a subdirectory with the name of the experiment you would like to conduct, say insert-spaces for example 3 above. These instructions will continue to use example 3. Populate the new subdirectory with files and folders as in the following tree:

        insert-spaces
        ├── README.md
        ├── config
        ├── data
        └── out

Continue to update the README file in this directory with descriptions of your experiment as it evolves.

Copy the provided prepare.py and alter it so that it produces a sample from your chosen data stream. The version provided here produces data for the insert-spaces experiment in example 3 above. Run your program and make a reasonable attempt to verify that it is working correctly. Note that it produces some output files: meta.pkl, train.bin, and val.bin. Move these and your Python script into a new subdirectory called basic, under data:

data
└── basic
    ├── meta.pkl
    ├── prepare.py
    ├── train.bin
    └── val.bin

Why do we put these in a new subdirectory called “basic”? Because this is our “basic” version the experiment. We may have more fancy ones later with different names and they will have their own subdirectories under data.

Save the provided basic.py file into your config directory.

When running a new experiment for the first time, it can be a good idea to do a complete run with a very small number of training iterations. This will identify any problems or bugs before we do longer runs. So, in the basic.py config file, change the value of max_iters to (say) 200. Now let’s train our model. The working directory should be your experiment directory (insert-spaces) in our running example. In a bash shell or similar, the command is

NANOGPT_CONFIG=../../comp560-nanoGPT/configurator.py  python -u ../../comp560-nanoGPT/train.py config/basic.py

Explanation:

If training completed successfully, try sampling from the trained model. We expect very bad results because of the small number of training iterations, but we are still trying to debug our workflow at this stage. The sampling command, from the same working directory, could be:

NANOGPT_CONFIG=../../comp560-nanoGPT/configurator.py  python -u ../../comp560-nanoGPT/sample.py config/basic.py --num_samples=1 --max_new_tokens=100 --seed=2345

Change the value of max_iters back to (say) 2000. Retrain the model (should be only about 3 minutes) and sample from it again.

The next step depends on your results:

Write a very short report describing all experiments, and you are done.