CS885 Module 1: Trust region & proximal policy optimization

Name: CS885 Module 1: Trust region & proximal policy optimization
Uploaded: 2020-06-01T05:47:43.000Z
Duration: 44 min 18 s
Description: The slides associated with this video are accessible on the course web: https://cs.uwaterloo.ca/~ppoupart/teaching/cs885-spring20/schedule.html This video is part of a series of video lectures for CS885 offered by Pascal Poupart at the University of Waterloo in 2018 and 2020. Make sure to watch the previous video on trust region methods: https://youtu.be/qaOKZkeutqE

Trust Region and Approximate Policy Optimization Techniques

Introduction to Trust Region Methods

Pascal Poupart introduces the module focusing on trust region and approximate policy optimization techniques in reinforcement learning, highlighting various optimization methods based on gradient techniques.

The discussion is divided into two groups: those that compute a gradient for direction and step length, and those utilizing trust regions. The focus will be on Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO).

Importance of Learning Rate

In policy gradient techniques, updates are made based on gradients, with alpha representing the learning rate that determines step length. A small alpha leads to slow but reliable convergence.

A high alpha may result in faster convergence but can be unreliable. An illustration using the A2C algorithm shows varying convergence rates with different alphas.

Challenges in Choosing Learning Rate

Selecting an appropriate alpha is challenging; thus, exploring techniques that avoid this choice becomes essential.

Trust Region Method Overview

The trust region method approximates complex objectives with simpler surrogate objectives (e.g., quadratic functions), ensuring efficient optimization while limiting divergence from the original objective.

By restricting search to a trusted region where the surrogate accurately represents the original function, improvements estimated from the surrogate are likely valid.

Application of Trust Regions in Policy Optimization

When applying trust regions to policy optimization, it’s crucial to consider either parameters θ or policies π. The ultimate goal is typically maximizing the value function reflecting expected rewards.

Small changes in stochastic policies lead to gradual changes in value functions, promoting smoother transitions compared to working directly with parameters θ.

Distance Measures for Stochastic Policies

To define trust regions concerning stochastic policies, a distance measure like KL divergence is proposed. This measures how closely two distributions resemble each other by evaluating logarithmic differences between probabilities.

Trust Region Policy Optimization Algorithm

The KL divergence allows defining an iterative algorithm known as Trust Region Policy Optimization (TRPO), which starts from an initial policy πθ and seeks an improved policy πθ tilde through constraint optimization problems at each step.

TRPO aims to maximize improvements in value functions subject to constraints defined by KL divergence thresholds across states.

Understanding TRPO: Trust Region Policy Optimization

Deriving the Circuit Function

The circuit function is introduced as a means to facilitate optimization, with a focus on relaxing the KL divergence constraint for easier computation.

The original theoretical constraint requires that the KL divergence between current and new policies remains below a delta for all states, which is complex to optimize simultaneously.

TRPO simplifies this by considering an expectation over states, allowing some states to exceed the KL divergence limit as long as others compensate.

An approximation of the objective function is necessary due to computational difficulties in optimizing directly; this leads to defining the circuit function.

State-action pairs are sampled from both the stationary distribution of states and the current policy, enabling calculation of value differences between policies.

Steps in Derivation

The difference in value can be expressed through ratios involving properties of both new and existing policies multiplied by advantages.

By explicitly writing expectations with sums, terms cancel out, simplifying calculations significantly.

The stationary distribution corresponding to the new policy replaces that of the existing one, acknowledging practical sampling limitations.

Expectations are reformulated based on state encounters until planning horizon ends, incorporating discount factors into advantage functions.

The advantage function is defined as the difference between Q-function and value function; further expansions lead to simplified equations.

Finalizing Value Function Expression

Terms related to value functions are canceled pairwise except for those associated with initial state s_0 .

Discounted sums of rewards are replaced by definitions of value functions under new policies due to their sampling nature.

Simplification results in an expression representing improvements in value functions relative to new policies without unnecessary expectations.

Pseudocode Overview for TRPO

TRPO operates as an actor-critic technique; pseudocode shares similarities with other methods but highlights unique aspects like actor updates focused on constraint optimization.

Key differences include how TRPO gathers numerous state-action pairs into mini-batches before updating actors, ensuring reliability in surrogate evaluations.

Understanding TRPO and PPO Algorithms

Overview of TRPO (Trust Region Policy Optimization)

TRPO ensures that the surrogate function closely approximates the original function, leading to more reliable monotonic improvements in reward.

The optimization process in practice is often approximated using linear functions for objectives and quadratic functions for constraints, allowing faster solutions.

Implementing TRPO is complex due to its constraint optimization nature, which complicates the use of standard stochastic gradient descent methods.

Constraints in TRPO

The main constraint involves a bound on KL divergence, which limits how much the new policy can differ from the old one based on their probability ratios.

KL divergence helps prevent large values in probability ratios by constraining them indirectly through expectations of log ratios.

Simplifying Objectives with Clipping

A simpler objective can be designed that directly constrains the ratio of probabilities between new and old policies using a clipping mechanism.

Clipping keeps this ratio within 1 - ε and 1 + ε, ensuring policies remain close while allowing for necessary adjustments.

Transition to Proximal Policy Optimization (PPO)

PPO replaces constrained optimization with an unconstrained approach using the simplified clipped objective, facilitating easier implementation via automatic differentiation tools like PyTorch or TensorFlow.

Both TRPO and PPO have become important baselines in reinforcement learning packages due to their effectiveness across various tasks.

Performance Evaluation

Evaluations show PPO performing well across robotics tasks, often outperforming other algorithms like TRPO and A2C in terms of reward over training steps.