Sharing is Caring: How a 'Swarm' of Language Models Learns Faster by Sharing Experiences

Paper at a Glance

Paper Title: Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
Authors: Jeffrey Amico, Gabriel Passamani Andrade, John Donaghy, Ben Fielding, Tristin Forbus, Harry Grieve, Semih Kara, Jari Kolehmainen, Yihua Lou, Christopher Nies, Edward Phillip Flores Nuño, Diogo Ortega, Shikhar Rastogi, Austin Virts, and Matthew J. Wright (Gensyn AI Team)
Affiliation: Gensyn AI
Published in: arXiv, 2025
Link to Paper: https://arxiv.org/abs/2509.08721

The Gist of It: TL;DR

In one sentence: This paper introduces Swarm sAmpling Policy Optimization (SAPO), a decentralized reinforcement learning framework where a “swarm” of independent language models learns complex reasoning tasks significantly faster by sharing their plain-text experiences (rollouts), achieving up to a 94% improvement in reward without needing synchronized hardware or identical models.

Why It Matters: The Big Picture

Training large language models (LMs) is a massive undertaking. After the initial pre-training, a crucial step is post-training with reinforcement learning (RL), which helps models learn complex reasoning and align with human preferences. This is the magic behind the impressive capabilities of models like GPT-4 and DeepSeek-RL.

However, the standard approach to RL post-training is brutally expensive and complex. It typically requires orchestrating enormous, centralized clusters of high-end GPUs. These systems are fragile, suffer from communication bottlenecks when trying to keep all model weights in sync, and are financially out of reach for most researchers and smaller organizations. This creates a high barrier to entry, concentrating cutting-edge AI development in the hands of a few.

The researchers at Gensyn AI ask a powerful question: What if we could achieve the benefits of large-scale training without the centralized mothership? What if a fleet of smaller, independent, and diverse models could learn together, collaboratively, just by talking to each other? This paper presents a compelling answer, offering a path to democratize and decentralize advanced AI training.

The Core Idea: How It Works

1. The Problem They’re Solving

Conventional distributed RL requires all participating models (or “workers”) to constantly synchronize their parameters (weights). This is a heavy, high-bandwidth operation. Furthermore, it assumes all workers are running the same model on similar hardware, creating a rigid, homogeneous system. SAPO is designed to break free from these constraints.

2. The Key Innovation

The central idea of SAPO is elegant and simple: instead of sharing heavy model weights, agents in the network share their lightweight experiences. In the context of LMs, an experience, or “rollout,” is simply the plain text generated in response to a prompt. This simple mechanism has profound implications:

It’s lightweight: Sharing text is vastly cheaper than synchronizing gigabytes of model parameters.
It’s asynchronous: Agents don’t need to wait for each other. They can learn at their own pace.
It’s heterogeneous: Since only text is exchanged, the swarm can consist of different model architectures running on wildly different hardware (from a MacBook to a high-end server).

This approach turns the training process into a multi-agent system where learning becomes a collective effort. When one agent has a breakthrough—an “Aha moment” on a difficult problem—it can share that successful experience, allowing the insight to propagate through the entire swarm and bootstrap the learning of others.

3. The Method, Step-by-Step

The SAPO algorithm, summarized in Algorithm 1 of the paper, works in a continuous loop for each agent in the swarm:

Generate Local Experience: Each agent (a node in the network) receives a set of tasks or questions. It generates its own set of answers, which forms its “local rollouts.”
Share and Sample: The agent broadcasts some of its rollouts to the swarm. Simultaneously, it listens for rollouts shared by other agents.
Assemble a Training Set: The agent creates a training batch by combining its own local rollouts with a selection of “external rollouts” sampled from the swarm. A key part of the process is that agents can filter the external rollouts, for instance, by ignoring those that resulted in zero reward, focusing only on the most promising examples.
Learn and Update: The agent uses its own local reward model to evaluate the combined training set and updates its own policy (its language model) using a standard RL algorithm like PPO or its variant, GRPO.

This cycle repeats, allowing each agent to benefit from the collective intelligence of the swarm while maintaining its own independent policy.

Key Experimental Results

The authors conducted controlled experiments using a swarm of eight Qwen2.5 0.5B models on the ReasoningGYM benchmark, a collection of procedural reasoning tasks. They tested four configurations, varying the ratio of local to external rollouts in the training batch.

Finding 1: Balanced Sharing is Best: The configuration using an equal mix of local and external rollouts (4 local / 4 external) was the clear winner. As shown in Figure 2, it achieved the highest overall performance and a 94% improvement in cumulative reward compared to the baseline where agents trained in isolation (8 local / 0 external). This empirically validates the paper’s core premise: sharing is caring.
Finding 2: Too Much Sharing Can Be Unstable: Relying too heavily on external experiences (the “2 local / 6 external” setup) led to highly oscillatory performance. Agents would learn very quickly from good examples in the swarm but would also be easily distracted by worse-performing agents, leading to “steep learning and forgetting” behavior. This highlights the need for a healthy balance between self-exploration and learning from others.
Finding 3: It Works “In the Wild”: The team tested SAPO in a large-scale open-source demo with thousands of community members running diverse models on their own hardware. The results, shown in Figure 3, confirmed that models participating in the swarm consistently and significantly outperformed their isolated counterparts after about 175 training rounds.

A Critical Look: Strengths & Limitations

Strengths / Contributions

Practical Decentralization: SAPO provides a genuinely practical and efficient framework for decentralized RL training. By sharing lightweight text rollouts, it sidesteps the primary communication and synchronization bottlenecks that plague traditional distributed systems.
Support for Heterogeneity: The framework is agnostic to the model, hardware, or specific RL update rule used by each node. This opens the door for large, diverse communities of participants to contribute to a collective training effort, effectively democratizing access to powerful AI training paradigms.
Improved Sample Efficiency: The experiments clearly show that collective experience sharing allows the swarm to learn much faster. The ability for successful “Aha moments” to propagate rapidly boosts the entire network’s performance.

Limitations / Open Questions

Stability and Tuning: The paper shows that an imbalance in experience sharing can lead to instability. Finding the optimal ratio of local to external samples may require careful tuning and could be task-dependent, which adds a layer of complexity.
Quality Control and Trust: In an open, untrusted network, what prevents malicious or simply low-quality agents from polluting the shared experience pool with useless or harmful rollouts? The paper mentions simple filtering (e.g., by reward), but more robust mechanisms would be needed for real-world, large-scale deployment.
Benefit Ceiling: The open demo suggested SAPO’s benefits were most pronounced for “mid-capacity models,” while stronger models saw less improvement. This raises an interesting question: is there a performance ceiling where models become too advanced to benefit from the experiences of a less capable swarm?

Contribution Level: Significant Improvement. This paper does not invent a new RL algorithm from scratch, but it introduces a novel and highly practical framework for applying existing ones in a decentralized, multi-agent context. It addresses the critical real-world problem of scaling RL for LMs with an elegant solution that is more efficient, scalable, and accessible than traditional centralized approaches.

Conclusion: Potential Impact

SAPO presents a compelling vision for the future of AI training—one that is more collaborative, decentralized, and accessible. By enabling a heterogeneous swarm of models to learn from each other’s experiences, this work challenges the notion that cutting-edge AI development must be confined to massive, centralized data centers. It opens up exciting possibilities for community-driven AI development and even for new learning paradigms where unconventional agents, like humans, could contribute their experiences to a collective intelligence. While questions around stability and trust in large-scale systems remain, SAPO lays a strong foundation for a more open and efficient path to building smarter AI.