Ever dreamt of running Reinforcement Learning (RL) on a whopping 32 billion parameter Language Model (LLM) in just 4 bits, on a single H100 GPU, with BF16-level accuracy and a 1.2–1.5× speed boost? NVIDIA researchers, along with their collaborators from MIT, HKU, and Tsinghua, have made it a reality with QeRL – a groundbreaking training framework that pushes RL post-training into 4-bit FP4 (NVFP4) while keeping gradient math in higher precision via LoRA. The team’s work is now open-source, and you can check it out here:
So, what’s QeRL doing to the Reinforcement Learning loop?
Most RLHF/GRPO/DAPO pipelines spend most of their time in rollouts (token generation). QeRL shifts the policy’s weight path to NVFP4 (FP4) with dual-level scaling and keeps logits/gradients in higher precision via LoRA. This way, backprop remains stable while the sampling path hits hardware-efficient FP4×BF16 kernels (Marlin). The result? Faster prefill/decoding during rollouts without needing a separate full-precision policy.
Quantization as exploration, now schedulable
A fascinating finding: deterministic FP4 quantization raises policy entropy, flattening token distributions early in training and improving exploration. To control this effect over time, QeRL introduces Adaptive Quantization Noise (AQN) – channel-wise Gaussian perturbations mapped into LayerNorm scale parameters and annealed with an exponential schedule. This keeps kernel fusion intact (no extra weight tensors) while transitioning from exploration to exploitation.
What do the results say?
On the Qwen2.5 backbone model, QeRL shows that NVFP4+LoRA outperforms vanilla LoRA and QLoRA in rollout throughput and overall training time, with >2× rollout throughput on 14B/32B models against QLoRA and ~1.8× end-to-end vs QLoRA in a representative setup. The team also demonstrates training a 32B policy with GRPO on a single H100-80GB, thanks to the lower memory footprint of weight-only FP4.
Accuracy-wise, for a 7B model, the team reports GSM8K = 90.8% and MATH500 = 77.4%, surpassing 16-bit LoRA and QLoRA under their setup and matching full-parameter fine-tuning. Across broader math benchmarks, QeRL maintains parity or advantage while converging faster due to improved exploration.
What QeRL is – and isn’t
QeRL is weight-only FP4 with LoRA updates; it doesn’t claim FP4 precision for logits/gradients. The benefits focus on rollout/prefill throughput and memory footprint, with empirical evidence that quantization-induced entropy aids RL exploration when AQN modulates it over training. Generalization to other tasks depends on reward design and sequence lengths.
Key Takeaways
– QeRL combines NVFP4 4-bit weight quantization with LoRA to accelerate the rollout phase and cut memory, enabling RL for a 32B LLM on a single H100-80GB.
– Quantization acts as exploration: FP4 increases policy entropy, while Adaptive Quantization Noise (AQN) schedules channel-wise noise via LayerNorm scales.
– Reported efficiency: >1.5× rollout speedups vs 16-bit LoRA and ~1.8× end-to-end vs QLoRA; >2× rollout throughput vs QLoRA on 14B/32B setups.
– Accuracy holds: Qwen2.5-7B reaches 90.8% on GSM8K and 77.4% on MATH500, matching full-parameter fine-tuning under the paper’s setup.
Editorial Comments
QeRL speeds up the RL rollout stage by quantizing weights to NVFP4 and keeping updates and logits in higher precision using LoRA. It reports >1.5× rollout speedups and can train a 32B policy on a single H100-80GB GPU. It adds Adaptive Quantization Noise to control exploration during training. Results are shown mainly on math-reasoning tasks using GRPO and DAPO. The gains rely on NVFP4 kernel support such as Marlin.
Check out the FULL CODES [here](https://arxiv.org/pdf/2510.11696) and the Paper. Don’t forget to explore our GitHub Page for Tutorials, Codes, and Notebooks. Follow us on Twitter, join our 100k+ ML SubReddit, subscribe to our Newsletter, and now, you can also join us on Telegram!