Literature

Background on LLMs, GPTs, and transformers

J. MacCormick (2026). Thinking AI: How Artificial Intelligence Emulates Human Understanding. Princeton University Press. chapter 10 (link is restricted to members the COMP560 MS Team), PUP, Amazon.
- This is a chapter from a general audience book about modern AI systems. This chapter explains how a GPT works, employing simple descriptions that require no computer science background.

Feng, G., Zhang, B., Gu, Y., Ye, H., He, D., & Wang, L. (2023). Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 36, 70757-70798. pdf at neurips.cc. Teams discussion channel.
- This is the original motivation for our research project. It demonstrates how certain simple tasks such as arithmetic or solving linear equations can be tackled with chain-of-thought reasoning and how/why that is beneficial. We aim to replicate or extend these results in the specific domain very small simple tasks and small transformer models that can be trained on consumer laptops in only a few minutes. (The paper contains some very interesting theoretical results, but we focus more on the practical experiments.)

Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., & Farajtabar, M. (2025). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. arXiv:2506.06941.
- This paper analyses large reasoning models (LRMs) – the successors to chain of thought reasoning in LLMs. It identifies certain properties of LRMs, such a complete accuracy collapse after a level of difficulty is reached in a certain puzzle, and the fact that exact algorithms are typically not employed.

suggested by Aziz

Elhage, et al., “A Mathematical Framework for Transformer Circuits”, Transformer Circuits Thread, 2021. html at Anthropic

Lindsey, et al., “On the Biology of a Large Language Model”, Transformer Circuits, 2025. html at Anthropic

suggested by Aziz

Baeumel, Tanja, Josef van Genabith, and Simon Ostermann. “The lookahead limitation: Why multi-operand addition is hard for LLMs.” arXiv preprint arXiv:2502.19981 (2025). pdf at arXiv.
Lee, N., Sreenivasan, K., Lee, J. D., Lee, K., & Papailiopoulos, D. (2024, May). Teaching Arithmetic to Small Transformers. In International Conference on Learning Representations. ICLR 2024. pdf at arXiv. suggested by Hemanth
- extremely Relevant to our project. Can someone read and provide a summary and/or suggested next steps?

Dong, Z., Zhou, Z., Liu, Z., Yang, C., & Lu, C. (2025). Emergent Response Planning in LLMs. arXiv:2502.06258 [cs.CL].
- This paper shows that large language models (LLMs) trained only to predict the next token nonetheless encode representations that reveal future planning behavior across their entire output, suggesting latent capabilities for anticipating structure, content, and overall response attributes beyond the next token.

suggested by Aziz

Leviathan, Y., Kalman, M., & Matias, Y. (2025). Prompt Repetition Improves Non-Reasoning LLMs. arXiv:2512.14982 [cs.LG].
- The authors demonstrate that simply repeating an input prompt (i.e., duplicating the text) can improve the performance of popular language models on non-reasoning benchmarks without increasing the number of generated tokens or inference latency. See also summary of Leviathan2025.

suggested by Aziz

Burtsev, M. S., Kuratov, Y., Peganov, A., & Sapunov, G. V. (2021). Memory Transformer. arXiv:2006.11527 [cs.CL].
- This paper introduces transformer architectures augmented with memory tokens that help capture both local and global sequence information, leading to improved performance on tasks like machine translation and language modeling by effectively storing and attending to non-local representations. Suggested by Aziz
Bietti, A., Cabannes, V., Bouchacourt, D., Jegou, H., & Bottou, L. (2023). Birth of a transformer: A memory viewpoint. Advances in Neural Information Processing Systems, 36, 1560-1588.
- Describes how to train a single-head, two-layer transformer using simulated data that is mostly random but contains hardwired pairs of adjacent tokens. The model is forced to learn a so-called induction head, allowing analysis of which parameters are learned first and other aspects of the learning process. It also enables comparison with an ideal induction head whose parameters are calculated explicitly by the authors. This paper is interesting to us not so much for its results, but for its clean approach to defining a very simple transformer model and comparing hand-crafted parameters to learned ones.

Li, W., Li, D., Dong, K., Zhang, C., Zhang, H., Liu, W., Wang, Y., Tang, R., & Liu, Y. (2025). Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger. arXiv:2502.12961 [cs.CL].
- Proposes a strategy for when to invoke external tools within an LLM

suggested by Biruk

Dao, T., & Gu, A. (2024). Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. arXiv:2405.21060.
- Proposes an alternative to attention layers, called mamba-2. The alternative has certain computational advantages while maintaining most of the accuracy of attention. The same authors introduced an earlier version in their mamba-1 paper, and with a bunch of collaborators have also released mamba-3 in March 2026.

Yu, M., Wang, D., Shan, Q., Reed, C. J., & Wan, A. (2024). The super weight in large language models. arXiv preprint arXiv:2411.07191.
- Reports that pruning very small number of weights (sometimes only 1 weight) in a model with billions of parameters can drastically alter the performance. This has various consequences for model quantization and applications. Relevant question for DNU Lab: Do we see the same behavior in micro-LLMs?

Credit: Hemanth Kepa