Proximal Policy Optimization (PPO) is a reinforcement learning algorithm
developed by OpenAI in 2017. PPO is designed to optimize the policy
function of a reinforcement learning agent, using a surrogate objective
function that places a limit on how much the policy can change in each
PPO uses a neural network to represent the policy function, and it can be
used to learn both discrete and continuous action spaces. PPO is known for
its robustness, and it has been shown to outperform other state-of-the-art
reinforcement learning algorithms in a variety of domains.
There are several hyperparameters that can be tuned to get better results
Learning rate (α) - determines how much the
policy parameters are updated in each iteration
The clip parameter controls how much the policy is allowed to change
in each iteration. A higher clip parameter can lead to more stable
updates, but it can also limit the ability of the policy to explore
new actions. A lower clip parameter can lead to more exploration,
but it can also lead to instability.
GAE lambda - a parameter used to compute the Generalized Advantage
Estimate (GAE), which is used to estimate the value function (default
Number of epochs per update - determines how many times the data is used
to update the policy (default 10)
Batch size - determines how many samples are used to compute each update
(32, 64 or higher)
Value function coefficient (default 0.5)
Entropy coefficient (default 0)
We can use the average reward, policy loss and value loss as metrics to
evaluate the performance of a PPO model.
Average reward measures the average reward per episode over a certain
number of episodes.
Increasing average reward is a sign that model is getting better at the
task (better performance). A good range for average reward is
task-dependent, and can vary greatly depending on the complexity of the
Tips for average reward
Expected average reward is affected by various hyperparameters as well as
the reward function.
Here are some common issues with average reward and tips on how to fix
1. Average reward too low
Learning rate (alpha) might be too low. Increase alpha to make the model
Discount factor (gamma) might be too low. Increase gamma to make the
model account for more future reward.
The model might be stuck in a local minimum. Try changing the
hyperparameters or reward function to get the model out of the local
2. Average reward unstable and fluctuates widely
Learning rate (alpha) might be too high. Decrease alpha to make the
model learn slower.
Discount factor (gamma) might be too high. Decrease gamma to make the
model account for less future reward.
Clip parameter might be too low. Increase clip parameter to prevent the
policy from changing too much at once.
Policy loss measures the difference between the old policy and the new
policy after an update.