RWKV: Recurrent Architecture with Constant State Size for Parallel Inference

{{RWKV}} is a recurrent language-model architecture whose internal state has a fixed size independent of context length, making large-batch inference and parallel-perturbation training dramatically cheaper than for {{transformer}}-based models with their growing {{KV-cache}}.

RWKV (Receptance Weighted Key Value) is a recurrent neural-network architecture for language modeling that combines properties of transformers and classical RNNs. Its defining feature is a constant-size internal state: unlike a transformer, where the KV-cache grows linearly with context length and dominates memory for long inputs, RWKV maintains a fixed-size hidden state that the model updates token-by-token. This property has two practical consequences. First, inference cost per token is constant in context length, which is attractive for very long contexts. Second, large-batch inference is much cheaper than for transformers because there is no per-sequence KV-cache to materialize — every additional sequence in a batch costs only the fixed state. This becomes critical for training algorithms that need to evaluate many parameter variants in parallel. The practical benefit of this property has been demonstrated by recent evolution strategies work. EGGROLL: Low-Rank Perturbations Make Evolution Strategies 100x Faster at Hyperscale (Oxford/MILA/NVIDIA, November 2025) uses RWKV-7 7B and 14B for its largest experiments precisely because constant-state inference makes it feasible to run 8,192 parallel generations on a small GPU cluster — versus 256 for GRPO on the same hardware. The architecture's batch-friendliness is what lets ES's many-perturbations approach scale. Whether techniques optimized for RWKV transfer to transformer architectures (Llama, Qwen, GPT, Claude) is an open question. Transformer KV-cache scaling makes parallel-batching for large populations much more expensive, so methods that depend on cheap massive batching may not transfer without architectural changes. The RWKV project is open-source and maintained primarily by Bo Peng (BlinkDL on GitHub).

RWKV: Recurrent Architecture with Constant State Size for Parallel Inference

Related Knowledge

KV Cache (Transformer Inference)

EGGROLL: Low-Rank Perturbations Make Evolution Strategies 100x Faster at Hyperscale

Have insights to add?