Reflexion Framework: Verbal Self-Critique Loops for Language Agents

{{Reflexion}} (Shinn et al. 2023) is a framework where a language model agent generates verbal self-critique after each attempt and uses that critique to improve subsequent attempts, without any parameter updates.

Reflexion is a framework introduced by Shinn et al. in 2023 (arXiv 2303.11366) for improving language model agent performance through self-generated verbal feedback. Unlike fine-tuning approaches that update model weights, Reflexion keeps the model frozen and instead has it generate text critiques of its own previous attempts, which are then included in the prompt for the next attempt. The core insight is that LLMs can often diagnose their own errors when shown the result, even when they could not avoid those errors on the first try. The framework decomposes the agent into three roles, all played by the same underlying language model. The Actor attempts the task, producing an output (code, a plan, a caption, etc.). The Evaluator judges whether the output is correct or sufficient — sometimes this is an external check like running unit tests, sometimes it is the language model judging its own output against criteria. The Self-Reflection module generates a verbal critique of what went wrong and what to try differently. On the next attempt, the Actor sees the prior output plus the critique in context and can adjust. Reflexion produced notable gains on coding benchmarks like HumanEval, where GPT-4 with Reflexion outperformed GPT-4 without it by significant margins, and on decision-making tasks like ALFWorld. The mechanism works best when an external evaluator can give a clear correctness signal — failed unit tests, wrong API outputs, simulator feedback. When the evaluator is the LM itself judging open-ended quality, gains are smaller and more variable because the model's self-judgment can be overconfident or systematically biased. The framework has been applied to many task domains since 2023, including 3D captioning, mathematical reasoning, and tool use. Direct application of Reflexion to a new task domain (different actor prompts, different evaluator) is straightforward and often works, but it does not constitute a new mechanism — the contribution is the engineering of a new application rather than novel agentic capability. See Chain-of-Thought Prompting: How Step-by-Step Reasoning Improves LLM Accuracy for the closely related prompting technique that elicits step-by-step reasoning within a single attempt.

Reflexion Framework: Verbal Self-Critique Loops for Language Agents

Related Knowledge

Why Asking an LLM to Check Its Own Answer Often Fails

Have insights to add?