Proximal Policy Optimization
Proximal Policy Optimization (PPO) is a reinforcement learning algorithm developed at OpenAI (Schulman et al., 2017) that has become the dominant method for the final fine-tuning stage of reinforcement learning from human feedback. PPO belongs to the family of policy gradient methods: it directly optimizes the policy (the function mapping observations to actions) using gradient ascent on expected reward, while enforcing a proximity constraint that prevents any single update from changing the policy too drastically. This constraint — implemented as a clipped surrogate objective — stabilizes training in environments where large policy updates would send the system into low-reward regions from which recovery is difficult. In the RLHF context, PPO optimizes a language model's output distribution against a learned reward model, with an additional KL-divergence penalty that keeps the policy near its supervised fine-tuning baseline. The proximity constraint and KL penalty together define the boundaries within which the model is allowed to improve. Everything the model learns is bounded by those constraints — which means the constraints are not merely technical parameters but normative choices about how much behavioral change is permitted per training step. The empirical question of how to set these bounds for safety-relevant applications has not been resolved.