NVIDIA’s AI team has just dropped a bombshell with Reinforcement Learning Pretraining (RLP), a new training objective that blends reinforcement learning into the initial stages of AI training, rather than waiting until later. The idea is simple yet powerful: treat a short chain-of-thought (CoT) as an action before predicting the next token, and reward it based on the information it provides about the next token, compared to a ‘no-think’ baseline. This results in a continuous, position-wise reward signal that can be applied to ordinary text streams at a massive scale.

How it Works: RLP uses a single network to sample a CoT policy and score the next token. An ‘EMA teacher’ provides a ‘no-think’ counterfactual, and the reward is calculated as the log-likelihood ratio between the CoT-conditioned likelihood and the no-think baseline. Training updates only the thought tokens using a clipped surrogate with per-token importance ratios and group-relative advantages.

Why it Matters: Unlike previous ‘reinforcement pretraining’ methods that rely on sparse, binary correctness signals or proxy filters, RLP’s dense, verifier-free reward attaches position-wise credit wherever thinking improves prediction. This enables updates at every token position in general web-scale corpora without external verifiers or curated answer keys.

Results You Won’t Believe: Pretraining with RLP improved the overall math+science average by ~19% for Qwen3-1.7B-Base compared to the base model and ~17% compared to compute-matched continuous pretraining (CPT). After identical post-training, the RLP-initialized model retained a ~7-8% relative advantage, with the largest gains on reasoning-heavy benchmarks. For Nemotron-Nano-12B v2, applying RLP yielded an overall average increase from 42.81% to 61.32% and an absolute +23% gain on scientific reasoning, using ~200B fewer tokens.

Outperforming the Competition: Under matched data and compute, RLP outperformed RPT on math, science, and overall averages, thanks to its continuous information-gain reward versus RPT’s sparse binary signal and entropy-filtered tokens.

The Future is Here: Reinforcement Learning Pretraining (RLP) is orthogonal to post-training pipelines and shows compounding improvements after standard alignment. It scales to domain-agnostic corpora and SFT-style reasoning corpora, avoiding the brittleness of narrow curated datasets. Even in compute-matched comparisons, RLP still led on overall averages, suggesting the improvements derive from objective design, not budget.

Ready to Upgrade Your AI? RLP reframes pretraining to directly reward ‘think-before-predict’ behavior using a verifier-free, information-gain signal, yielding durable reasoning gains that persist through identical SFT+RLVR and extend across architectures. The method’s objective integrates cleanly into large-scale pipelines without curated verifiers, making it a practical upgrade to next-token pretraining rather than a post-training add-on.

Join the Revolution: Check out the [Paper](https://github.com/NVlabs/RLP/blob/main/pdf/RLP_Reinforcement_as_a_Pretraining_Objective.pdf), [Code](https://github.com/NVlabs/RLP), and [Project Page](https://nvlabs.github.io/RLP/). Follow us on [Twitter](https://twitter.com/NVlabs) and join our [100k+ ML SubReddit](https://www.reddit.com/r/MachineLearning/) and [Newsletter](https://nvlabs.github.io/newsletter/). And if you’re on Telegram, join us there too!

Share.
Leave A Reply

Exit mobile version