Possible research projects and mini-projects
In no particular order:
- Investigate migration to nanochat
- Replicate anything from Feng23 or Baeumel25
- The first research experiment page describes how to create a transformer model based on a continuous stream of input, then assess the ability of the model to generate a continuous stream of output similar to the input. For many of the proposed experiments this semester, it will be preferable train the transformer model to produce short responses in response to short inputs. That is, all inputs and outputs are relatively short rather than being a continuous stream. Create a framework for conducting experiments with this type of model.
- Get a robust addition model working and clearly characterize the points at which it fails
- then, train it with chain of thought – do we need more, less, or about the same resources?
- then, give it scratch space to develop its own chain of thought – Can we identify any resource savings from doing it this way?
- Figure out how to design certain GPTs by hand, then implement and test in PyTorch, e.g.:
- copy an n-character input (for, say, n=3)
- reverse an n-character input (for, say, n=3)
- n-digit binary/ternary/decimal addition
- markov chain and HMM
- For all models in the previous bullet point, experiment with training a model that is as similar as possible but created completely automatically from training.
- Mentor high school student.
- Mentor one or both of our first-year students.
- Experiment with training our simple character-level models on GPUs. Determine whether GPUs are a significant benefit for this type of model. If so, how can we best take advantage of them? Analyze the financial costs and benefits, then make a recommendation on how to proceed.
- This is a complex question that depends on the size of the models we are using. So the analysis needs to employ a range of different sizes for the models.
- [This is more of a general challenge than a project.] What is the most interesting or complex problem you can solve using a character-level LLM that can be trained in 3 minutes on a laptop?
- Make a poster describing some of your work in this course and present it at the Computer Science Symposium, Tuesday April 28, from 12:00-1:15 in the Tome Library.
- Propose a new paper to read and discuss.
- Investigate effect of context window on simple models.
- Suppose our data is generated by a source that, by design, requires a context window of only n tokens for optimal performance. Is it true that using a context window larger than n is a waste of resources? Or can we actually get better performance by using a larger context window? See AI chat for details: chatgpt.com/share/69714129…
- Do some kind of experiments on generalization ability of simple GPT models. For example, suppose we have trained a model that copies the input, which can consist of any alphabetic string. However, we deliberately suppress the occurrence of one or more tokens in the training data. For expample, suppose uppercase letters occur rarely. Now measure the accuracy of the model on inputs that do/do not contain uppercase letters. Are we able to measure some kind of transference of knowledge, that is, given the model knows how to copy lowercase letters will it very quickly learn how to copy uppercase letters? I guess this could be done as an example of fine tuning where we first train on only lower case letters and then do fine tuning with tiny amount data containing uppercase letters. Many other experiments are possible.
- create a simple model of tool use that can be incorporated into our elementary character level GPTs. example: suppose we permit our LLM to call a tool that knows how to do single digit-addition. can we train it to do multi digit edition efficiently using this tool? does chain of thought or scratch pad space help with this?
- investigate whether JAX can be used to speed up training for our class of small simple models
- can we develop in theory or in practice some kind of algebra of transformer architectures? along the following lines: Train A for task T, then B for task U. Add or concatenate A&B, fine-tune to solve U-follows-T or maybe T concat U. (example: reverse digits then add)
- can we think of a task that provably requires scratch pad, if we impose limit on depth or overall size of transformer?
- one possible example: Task: sum of two digits embedded in random string E.g. aaaassd3fddfvff5dddffd -> 8. presumably it is easier to write down the two digits first then add them. Simpler: use binary XOR and only a single clutter character: e.g. aaaaaaa1aaaaaa0aaaa -> 1
- extend and improve jmac’s toyAI experiments.
- Create, adapt, or adopt some tools that create beautiful visualizations of our simple GPT models
- experiment with initiating tool calls when there is high uncertainty over the next token, as in this Meta-Cognition Trigger article. Credit: Biruk Kebede
- experiment with repeating the prompt as in Prompt Repetition Improves Non-Reasoning LLMs. Credit: Aziz Muminov
- use Karpathy’s autoresearch to run any of the experiments in this project