EGGROLL: Low-Rank Perturbations Make Evolution Strategies 100x Faster at Hyperscale

{{EGGROLL}} (Evolution Guided GeneRal Optimisation via Low-rank Learning), from an Oxford/MILA/NVIDIA collaboration in November 2025, structures each {{evolution strategies}} perturbation as a low-rank matrix so that thousands of perturbations can be computed in a single batched forward pass — yielding a claimed 100-fold training-speed increase over naive ES at billion-parameter scale.

"Evolution Strategies at the Hyperscale" (arXiv 2511.16652, November 20 2025) introduces EGGROLL — Evolution Guided GeneRal Optimisation via Low-rank Learning. The senior authors are well-credentialed: Jakob Foerster (Oxford, multi-agent RL), Aaron Courville (MILA, Bengio's group), and Shimon Whiteson (Oxford, RL). The 19-author paper is a serious Oxford / MILA / NVIDIA collaboration. The core technical move is to structure each perturbation in the evolution strategies population as a rank-r matrix in the LoRA style. Many such perturbations can be applied as swapped LoRA adapters and computed in a single batched forward pass, rather than running 30+ separate forward passes for separate full-rank perturbations. Although each individual perturbation is low-rank, averaging across many low-rank perturbations recovers a high-rank update — so expressivity is preserved. Headline numbers from the project page (eshyperscale.github.io): - 100-fold training-speed increase vs naive ES. - Up to 91% of pure batch inference throughput during training (vs 34% for PPO, 0.41% for OpenAI-style ES). - On RWKV-7 14B trained on DeepScaleR for 12 hours on 32 GPUs: AIME24 13% to 30% (+17 percentage points), AIME25 7% to 33% (+26 percentage points), and HMMT25 11% to 13% (+2 percentage points). - 8,192 parallel generations during training (vs 256 for GRPO on the same hardware) on GSM8K with RWKV-7 7B. The paper's own framing is "competitive with GRPO," not "beats GRPO." The strong AIME results contrast with the near-noise HMMT25 result on a harder benchmark, so the broader "EGGROLL outperforms GRPO" framing is more nuanced than some popular write-ups suggest. Note that EGGROLL is sometimes misheard or transcribed as "Agro" in video coverage — the actual method name is EGGROLL (a deliberate food pun: rank-r perturbations folded around a base model resemble an egg roll). Code is available at github.com/ESHyperscale/HyperscaleES (JAX) and github.com/ESHyperscale/nano-egg (single-file Int8 reference). EGGROLL's parallel-population trick depends heavily on RWKV: Recurrent Architecture with Constant State Size for Parallel Inference, which makes large-batch inference cheap. Whether the technique transfers to transformer architectures with their expensive KV-cache scaling is an open scaling question.

EGGROLL: Low-Rank Perturbations Make Evolution Strategies 100x Faster at Hyperscale

Related Knowledge

RWKV: Recurrent Architecture with Constant State Size for Parallel Inference

Evolution Strategies for LLM Fine-Tuning: A Revival of a Pre-Deep-Learning Optimizer

Evolution Strategies at Scale (Cognizant 2025): First Full-Parameter ES on Billion-Parameter LLMs

Have insights to add?