DeepSeek R1 has set the AI community abuzz, and today we’re diving deep into one of its secret weapons: Group Relative Policy Optimization (GRPO), as introduced by the DeepSeekMath paper. Unlike traditional PPO (Proximal Policy Optimization), GRPO brings a fresh, battle-hardened approach to Reinforcement Learning, slashing memory and compute overhead by nearly 50%, as well as significantly reducing the training cost. In this post, we’ll unpack a Python code example that simulates GRPO on a virtual Battleship battlefield and draw parallels to how DeepSeek R1 uses the same principles to revolutionize LLM training. In less than three hundred lines of code, we demonstrate using pure NumPy how GRPO works and what its added value is.
LLM pre- and post-training
To understand why DeepSeek R1 is so cost-effective to train, let us recall how an LLM is trained from a high-level perspective. The training process of a Large Language Model (LLM) consists of two main phases: pre-training and post-training.
1. Pre-training: this is the initial phase where the model is exposed to vast amounts of high-quality data to learn next-token prediction. Essentially, it’s about feeding the model large-scale data and allowing it to develop a foundational understanding of language. Given the significant computational cost, this phase is typically carried out by large companies, while smaller companies often focus more on post-training.
2. Post-training: this phase focuses on refining the model’s reasoning abilities and is typically divided into two key steps:
Stage 1 – supervised fine-tuning: here, the model is fine-tuned using a smaller, high-quality dataset curated by experts. The goal is to teach the model to follow instructions, answer questions, and perform structured reasoning (e.g., chain-of-thought reasoning). Ideally, if unlimited expert data were available, this would be the best way to enhance the model. However, since high-quality data is scarce, an additional step is needed.
Stage 2 – Reinforcement Learning from Human Feedback (RLHF): Since expert reasoning data is limited, reinforcement learning (RL) is employed to bridge the gap. RLHF involves training a reward model based on human feedback, which then helps guide the LLM’s learning through reinforcement learning. This process ensures the model better aligns with human preferences, making it more effective and reliable in real-world applications.
RLHF is an expensive step in LLM training due to the high cost of human annotation, where expert annotators rank or critique model outputs. It also requires intensive compute resources for training a reward model and running Proximal Policy Optimization (PPO), which involves multiple inference steps and policy updates.
DeepSeek’s revolutionary idea is to skip the supervised fine-tuning part and to apply reinforcement learning directly to their DeepSeek V3 foundational model, thereby significantly saving cost and eliminating human bias present in the curated dataset. Furthermore, they introduced the GRPO algorithm as a replacement of the PPO algorithm. In order to keep the discussion away from deep technical details, we will apply the GRPO algorithm to the Battleship to explain how it works.
The Battlefield: Battleship as a simulation environment
Imagine you’re commanding a fleet in a high-stakes Battleship game. Every move - each missile fired - is critical. In our simulation, the BattleshipGame class sets the stage with:
A 5×5 board where ships (“🚢”) are hidden beneath waves (“🌊”).
A dynamic environment where each missile can either miss (“💦”) or hit (“💥”), affecting your overall score.
This battlefield isn’t just for fun - it’s a metaphor for how GRPO treats each decision (or token generation) as a tactical move in an evolving campaign.
Setting Up the Game Environment
The simulation begins with a random board generator that places ships on the grid. The game rules and board setup mirror the uncertain, dynamic nature of real-world decision-making, where not every move leads to a hit.
Random Board Generation:
The
random_board
method continuously attempts to place all ships without overlapping until a valid board configuration is achieved. This randomness is akin to the initial uncertainty in model responses before optimization.Game Mechanics:
The play and score methods simulate taking an action (firing a missile) and evaluating the outcome, respectively. A hit boosts your score, while a miss leaves a mark but no reward.
The Environment and the Policy
Our BattleshipEnv
class wraps the game logic into a reinforcement learning environment:
Observation Encoding:
The board is encoded into a numerical format (e.g., fog of war, hit, or miss), which serves as the “state” the policy observes.
Policy Model:
A simple neural network (neural_battleship_policy
) maps these observations to action probabilities - essentially deciding where to fire next.
Notice how our simulation’s policy mirrors an LLM’s decision-making process. Just as the network picks the next token based on context, our policy selects the next move based on the board’s state.
Introducing GRPO: the group tactics
Here’s where GRPO takes center stage. Traditional PPO relies on a critical network to evaluate each step, but GRPO does things differently by harnessing group sampling.
Collecting groups:
The collect_group function simulates multiple game trajectories (battles) from a fixed starting point. For each trajectory in a group, we record:Observations: The state at each step.
Actions and their probabilities: What move was chosen and how confident the model was.
Rewards: Only the terminal reward matters, emphasizing end-game success.
Group-normalized advantage:
Instead of estimating the advantage of each action via a critic, GRPO normalizes the terminal rewards within a group:
Advantage = (Reward – Mean Reward) / (Standard Deviation + ε)
This strategy, much like comparing the performance of different battalions in a campaign, allows the model to gauge which trajectories performed best relative to the others.
GRPO objective function:
The grpo_objective function applies a clipped surrogate loss – a safeguard against overzealous updates. It balances:Policy improvement: Ensuring actions that yield high advantages get reinforced.
KLpPenalty: Keeping the new policy close to a trusted reference (our baseline battleship strategy).
The overall objective is then minimized via gradient descent, guided by the custom AdamW optimizer. In effect, the policy is “trained in the trenches” to make smarter moves in subsequent battles.
Training the policy: the final push
The training loop (train_grpo) simulates many mini-batches of battles:
Group sampling: For each mini-batch, groups of trajectories are collected.
Gradient steps: Multiple updates are performed per mini-batch, refining the policy to maximize terminal rewards.
Parallels to DeepSeek R1’s efficiency
So how does all this tie back to DeepSeek R1? Here are the key takeaways:
Efficiency through simplicity:
GRPO’s use of group-normalized advantages eliminates the need for a bulky critic network. DeepSeek R1 leverages this by training its LLM directly with RL, bypassing the supervised fine-tuning stage and saving substantial compute time.Robust exploration:
By sampling multiple trajectories per prompt (or battle), GRPO inherently encourages exploration. This mirrors DeepSeek R1’s strategy of letting the model “self-evolve” its reasoning capabilities through massive parallel sampling.Stability and performance:
The clipped surrogate loss in GRPO ensures that policy updates are controlled and stable – crucial for maintaining performance in the high-stakes, compute-intensive world of LLM training.
DeepSeek R1’s adoption of GRPO is a testament to how borrowing ideas from other fields (like battlefield tactics) can lead to revolutionary improvements in AI. By drawing on the strategic, group-based sampling approach, DeepSeek R1 not only streamlines RLHF but also paves the way for more efficient and robust LLMs.
Conclusion
The battle-tested GRPO algorithm is more than just a clever trick – it’s a paradigm shift in how we optimize policies for large language models. With DeepSeek R1 harnessing GRPO’s simplicity and efficiency, the future of RL in LLM training looks both promising and thrilling. Whether you’re a seasoned RL expert or just stepping into the arena, the elegance of GRPO’s group strategy is a powerful reminder: sometimes, winning the war is all about how you fight the individual battles.
Dive into the code, explore the tactics, and join the revolution with DeepSeek R1 – where every move counts. Curious about how to easily and efficiently build a RAG pipeline using DeepSeek as a generator? Then have a look at our RAGLite series of blogposts: introduction and guided walkthrough.
Ready to explore more on the cutting edge of LLM training? Stay tuned for our next deep dive into the latest innovations in reinforcement learning and AI development!