“Accelerating Reinforcement Learning in Code Large Language Models: A Mid-Training Approach with Temporal Action Abstractions”

October 9, 2025

97

**Rephrased Blog Content:**

A recent study from Apple has shed light on the intricacies of mid-training in reinforcement learning (RL), outlining what this phase should accomplish before post-training and introducing RA3, a novel method that enhances RL convergence. RA3, an Expectation-Maximization (EM)-style procedure, learns temporally consistent latent actions from expert traces and fine-tunes on these bootstrapped traces. The research underscores two key aspects of mid-training: pruning to a compact near-optimal action subspace and shortening the effective planning horizon, both of which accelerate RL convergence.

The study, published on arXiv, is the first to formally explore how mid-training shapes post-training RL. It breaks down the outcomes into two critical factors: pruning efficiency and RL convergence. Pruning efficiency refers to how effectively mid-training selects a compact near-optimal action subset that shapes the initial policy prior. RL convergence, on the other hand, denotes how swiftly post-training improves within that restricted set. The analysis suggests that mid-training is most effective when the decision space is compact and the effective horizon is short, favoring temporal abstractions over primitive next-token actions.

RA3, the algorithm proposed in the study, optimizes a sequential variational lower bound (a temporal ELBO) in a single pass, using an EM-like loop. In the E-step, RA3 uses RL to infer temporally consistent latent structures (abstractions) aligned with expert sequences. In the M-step, it performs next-token prediction on the bootstrapped, latent-annotated traces, integrating these abstractions into the model’s policy.

The study’s results, focusing on code generation tasks, are promising. Across multiple base models, RA3 improved average pass@k scores on HumanEval and MBPP by approximately 8 and 4 points, respectively, compared to the base model and an NTP mid-training baseline. Moreover, when initialized with RA3, the post-training RL from human feedback (RLVF) method converged faster and reached higher final performance on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.

In essence, the study formalizes mid-training via two key determinants: pruning efficiency and impact on RL convergence. It argues that mid-training is most effective when the decision space is compact, and the effective planning horizon is short. RA3, the algorithm introduced in the study, optimizes a sequential variational lower bound by iteratively discovering temporally consistent latent structures with RL and then fine-tuning on bootstrapped traces in an EM-style loop. On code generation tasks, RA3 demonstrated significant improvements in average pass@k scores and accelerated RLVF convergence, leading to improved asymptotic performance on various benchmarks.

The study’s contribution is concrete and focused. It formalizes mid-training around two key determinants and operationalizes them via a temporal ELBO optimized in an EM loop to learn persistent action abstractions before RLVF. The researchers reported average pass@k gains of approximately 8 and 4 points over base and NTP mid-training baselines on HumanEval and MBPP, respectively, and faster RLVF convergence on several benchmarks.

For those interested in delving deeper into the technical aspects of the study, the full technical paper is available on arXiv. Additionally, the authors provide tutorials, codes, and notebooks on their GitHub page. You can also follow them on Twitter, join their 100k+ ML SubReddit, subscribe to their newsletter, and even connect with them on Telegram for updates and discussions on the latest advancements in machine learning and reinforcement learning.