Summary of Prompt Repetition Improves Non-Reasoning LLMs

This document was created by John MacCormick and dictated into ChatGPT (GPT-5.3). The model reformatted and lightly edited the content for clarity, correcting minor errors and improving readability while preserving the original meaning.

This is a summary of Prompt Repetition Improves Non-Reasoning LLMs by Leviathan, Kelman and Matias (Google Research, December 2025).

The core idea of the paper is that simply repeating the prompt sent to a large language model (LLM) can improve performance when explicit reasoning is disabled. This effect is observed across a range of problem types and is observed in all seven LLMs evaluated in the study.

Variation in Performance Gains

The magnitude of improvement varies significantly depending on the task:

In some cases, there is no improvement at all.
In others, accuracy increases by a modest ~3%.
In certain tasks, performance improves dramatically—by more than 40%.

Example Benchmark: OpenBookQA

One illustrative example is the OpenBookQA benchmark, introduced by Mihai Glăveanu et al. (2018) in the paper “Can a Suit of Armor Conduct Electricity?”.

This benchmark consists of multiple-choice questions that combine:

Common-sense reasoning, and
Scientific facts (explicitly provided)

Human performance on this benchmark is typically around 92% accuracy.

Example Questions

Can a suit of armor conduct electricity?
Which of these would let the most heat through?
- A. A new pair of jeans
- B. A steel spoon from the cafeteria
- C. Cotton candy at a store
- D. A Calvin Klein cotton hat

Experimental Findings on OpenBookQA

The authors conducted two experiments using the OpenBookQA dataset:

1. Reordered Prompts (Options First)

The multiple-choice options were presented before the question.
This makes the task harder for an LLM, since it cannot interpret the options in light of the question.
Under this setup, prompt repetition led to large improvements across all seven models:
- Accuracy typically increased from around 80% to 90%.

2. Standard Ordering (Question First)

The prompt followed the original OpenBookQA format:
- Question first, then answer choices.
In this more natural setup:
- Only 3 of the 7 models showed significant improvement.
- Gains were much smaller, e.g., from 92% to 95%.

Custom Task: “NameIndex”

The authors also designed custom tasks to highlight cases where prompt repetition has especially strong effects. One such task is called NameIndex.

Task Description

A typical prompt takes the form:

“Here is a list of names…”

This is followed by a list (e.g., 50 full names, each with a given name and family name), and then a question such as:

“What is the 25th name?”

Results

Substantial improvements were observed across most models when the prompt was repeated:

Anthropic’s Haiku 3 model improved from 5% to 50% accuracy.
The Sonnet 3.7 model improved from 80% to 95%.

Effect of Reasoning

When reasoning capabilities were enabled, prompt repetition had little to no effect on most models and tasks.

The authors hypothesize that this is because reasoning-enabled models often implicitly repeat or restate the prompt as part of their internal or external reasoning process. As a result, explicitly repeating the prompt provides little additional benefit.