Meta AI's Game-Changer: Training Language Agents Without Rewards, Outperforms Imitation Learning!"

Ever wondered how your AI agent’s performance would skyrocket if it could learn purely from its own actions, without any rewards or human demonstrations? Meta Superintelligence Labs has just made that a reality with ‘Early Experience’, a revolutionary reward-free training approach that’s leaving imitation learning in the dust across eight diverse benchmarks.

What’s Early Experience all about?

Traditional AI pipelines rely heavily on imitation learning (IL) from expert trajectories, which is cheap but hard to scale and unreliable when faced with new situations. Reinforcement learning (RL), on the other hand, promises learning from experience but requires verifiable rewards and stable infrastructure, which aren’t always available, especially in web and multi-tool settings.

Early Experience sits right in the middle, offering the best of both worlds. It’s reward-free like IL, but the supervision comes from the consequences of the agent’s own actions, not just expert actions. In simple terms, the agent proposes, acts, and learns from what actually happens next—no reward function required!

The team behind Early Experience has come up with two concrete strategies to make this happen:

1. Implicit World Modeling (IWM): The model learns to predict the next observation given the state and chosen action, helping the agent tighten its internal model of environment dynamics and reduce off-policy drift.

2. Self-Reflection (SR): The model explains why an expert action is better than alternatives using observed outcomes, then fine-tunes the policy from this contrastive signal.

Both strategies use the same budgets and decoding settings as IL, but with a twist—the data source is agent-generated branches, not more expert trajectories.

Putting Early Experience to the test

The research team evaluated Early Experience on eight language-agent environments covering web navigation, long-horizon planning, scientific/embodied tasks, and multi-domain API workflows. The results? Early Experience delivered average absolute gains of +9.6 success and +9.4 out-of-domain (OOD) over IL across all tasks and models. These gains persisted even when the same checkpoints were used to initialize RL (GRPO), improving post-RL ceilings by up to +6.4 compared to RL started from imitation learning (IL).

Efficiency: Less expert data, same optimization budget

A key practical win is demo efficiency. With a fixed optimization budget, Early Experience matches or beats IL using a fraction of expert data. On WebShop, using just 1/8 of the demonstrations with Early Experience already exceeded IL trained on the full demo set; on ALFWorld, parity was hit at 1/2 the demos. The advantage grew with more demonstrations, indicating that agent-generated future states provide supervision signals that demonstrations alone can’t capture.

Where does reinforcement learning fit in?

Early Experience isn’t “RL without rewards”. It’s a supervised recipe that uses agent-experienced outcomes as labels. In environments with verifiable rewards, the team simply adds RL after Early Experience. Because the initialization is better than IL, the same RL schedule climbs higher and faster, with up to +6.4 final success over IL-initialized RL across tested domains. This positions Early Experience as a bridge: reward-free pre-training from consequences, followed (where possible) by standard reinforcement learning (RL).

Key Takeaways

– Reward-free training via agent-generated future states (not rewards) using Implicit World Modeling and Self-Reflection outperforms imitation learning across eight environments.
– Reported absolute gains over IL: +18.4 (WebShop), +15.0 (TravelPlanner), +13.3 (ScienceWorld) under matched budgets and settings.
– Demo efficiency: exceeds IL on WebShop with 1/8 of demonstrations; reaches ALFWorld parity with 1/2—at fixed optimization cost.
– As an initializer, Early Experience boosts subsequent RL (GRPO) endpoints by up to +6.4 versus RL started from IL.
– Validated on multiple backbone families (3B–8B) with consistent in-domain and out-of-domain improvements; positioned as a bridge between imitation learning (IL) and reinforcement learning (RL).

Editorial Comments

Early Experience is a practical and pragmatic contribution. It replaces brittle rationale-only augmentation with outcome-grounded supervision that an agent can generate at scale, without reward functions. The two variants—Implicit World Modeling and Self-Reflection—directly attack off-policy drift and long-horizon error accumulation, explaining the consistent gains over imitation learning across eight environments and the stronger RL ceilings when used as an initializer for GRPO. In web and tool-use settings where verifiable rewards are scarce, this reward-free supervision is the missing middle between IL and RL and is immediately actionable for production agent stacks.

Check out the [PAPER](https://arxiv.org/pdf/2510.08558) here. For tutorials, codes, and notebooks, head over to our [GitHub Page](https://github.com/meta-org/early-experience). You can also follow us on [Twitter](https://twitter.com/metaorg), join our [100k+ ML SubReddit](https://www.reddit.com/r/MachineLearning/), subscribe to our newsletter, and even join us on [Telegram](https://t.me/metaorg).