AI Simulator

2023-03-29 01:40:45 +0000 - paradite

Proximal Policy Optimization (PPO) is a reinforcement learning algorithm developed by OpenAI in 2017. PPO is designed to optimize the policy function of a reinforcement learning agent, using a surrogate objective function that places a limit on how much the policy can change in each iteration.

PPO uses a neural network to represent the policy function, and it can be used to learn both discrete and continuous action spaces. PPO is known for its robustness, and it has been shown to outperform other state-of-the-art reinforcement learning algorithms in a variety of domains.


There are several hyperparameters that can be tuned to get better results with PPO.

Performance metrics

We can use the average reward, policy loss and value loss as metrics to evaluate the performance of a PPO model.

Average reward

Average reward measures the average reward per episode over a certain number of episodes.

Increasing average reward is a sign that model is getting better at the task (better performance). A good range for average reward is task-dependent, and can vary greatly depending on the complexity of the task.

Tips for average reward

Expected average reward is affected by various hyperparameters as well as the reward function.

Here are some common issues with average reward and tips on how to fix them:

1. Average reward too low

2. Average reward unstable and fluctuates widely

Policy loss

Policy loss measures the difference between the old policy and the new policy after an update.

Value loss

Value loss measures the difference between the old value function and the new value function after an update.


PPO can be trained to play many single-player games with either discrete actions or continue actions. Some examples include Tetris, Snake, 2048 and Block Puzzle.

This is a screenshot of tfboard for training PPO to play AI Simulator: Block Puzzle over 3M frames:

Further readings

PPO paper

Detailed explanation of PPO

Interactive demos