Reinforcement Learning (QLS-RL) Lecture 12
Introduction
The lecturer introduces the topic of reinforcement learning and explains that they will be moving towards a more inductive or database interpretation of reinforcement learning.
Recap of Previous Lecture
- The previous lecture covered the Bayesian setting for reinforcement learning.
- The focus was on methods that rely on data to perform prediction and control.
Partially Observable Markov Decision Processes
- Recap of where we stand in terms of our understanding of reinforcement learning.
- We are sitting in the upper right corner, having discussed Markov decision processes and Bellman's equation.
- Our level of knowledge is determined by features such as access to the model underlying our system, dynamics, structure of rewards, and observability.
- Partially observable Markov decision processes contain MDPs themselves as a specific instance when observability is perfect.
- In general, when observability is partial, planning methods must be combined with inference to predict future outcomes.
- This calls into play the notion of Bayesian updating.
Further Study
- There are exercises and applications related to partially observable Markov decision processes that can be explored further.
- Interested students can contact the lecturer for references or consider it as a small project for their exam.
Moving Towards Full Reinforcement Learning Problem
The lecturer explains how they will move horizontally towards the upper left part of the diagram to address difficulties arising from lack of knowledge about the model.
Planning vs Learning
- When moving horizontally towards full reinforcement learning problem, we give up the notion that we know everything about the model and therefore can plan ahead.
- Instead, we replace this prior information with posterior information obtained through interaction with data.
- This leads to a focus on learning rather than planning.
- The decision maker must adopt the mindset of a chess player, trying to plan ahead based on observations yet to come.
Learning Reinforcement Learning
- The key concept when moving along this direction is that we start talking about learning.
- However, this does not mean that we are totally dropping out of the notion of model completely.
- There is no clear-cut frontier between all these methods and techniques.
- It's important to rationalize all these approaches using simplifications in order to have some clear landmarks in this rich landscape.
Next Steps
- Over the next six lectures, the lecturer will slowly build up all the concepts and techniques that will lead us towards the upper left corner of the problem where full reinforcement learning problem sits.
Model-Free Learning and Control
In this section, the speaker discusses the concept of model-free learning and control in reinforcement learning. They explain that this approach avoids making assumptions about transition probabilities and rewards in a system.
Understanding Model-Free Learning
- Model-free learning avoids assuming knowledge of transition probabilities and rewards in a system.
- In model-free learning, there is still some structural information about the state space and actions available.
- In grid world, model-free learning means having no prior knowledge of the map or reward locations. The only way to discover this is through trial and error.
Trial and Error
- Interacting with the environment through trial and error is necessary for discovering hidden information in model-free learning.
- The process involves attempting different actions, collecting information, and optimizing behavior as we learn.
Key Aspects: Predicting and Controlling
- The two key aspects of model-free learning are predicting what will happen in the future (learning to predict), followed by optimizing behavior based on that prediction (control).
- In an MDP, solving Bellman's equation simultaneously predicts what will happen in the future while optimizing behavior.
Model-Free Bandits
In this section, the speaker discusses what it means to be model-free with bandits and how it differs from other approaches.
Being Model-Free in Bandits
- In bandits, being model-free means that some information is known about the system, but the probability distributions of states given an action are totally unknown.
- The level of knowledge in a model-free approach is significantly less than in other approaches.
- There may also be more complex model-free problems where even basic information about the system is not provided.
Collecting Information for a Model
- To move from a situation where some information is known to one where nothing is known, data must be collected about the system.
- The flow of this process would be to collect data, build a model, and then solve Bellman's equation using this empirical model.
- In practice, data would be collected by interacting with the environment and keeping track of what has happened.
Constructing an Empirical Model
In this section, the speaker explains how to construct an empirical model by collecting data from a system with some policy. The empirical model is constructed using the collected data and can be used to calculate transition probabilities and rewards.
Collecting Data
- To construct an empirical model, we start by collecting data from a system with some policy.
- We construct n trajectories, where each trajectory is a sequence of state-action-new state observations.
- We collect data without trying to optimize or change the policy.
Constructing Empirical Probabilities
- Transition probabilities can be calculated using conditional expectations.
- The numerator of the conditional expectation is a function that counts the number of times a triplet (state, action, new state) occurs in one specific trajectory.
- The denominator of the conditional expectation is the total count of times that we started from small s and small a.
- Empirical probabilities are constructed by replacing expected values with empirical means.
Constructing Empirical Rewards
- Empirical rewards for triplets (state, action, new state) can also be constructed using collected data.
- The numerator of the reward calculation is the sum of rewards experienced at time t+1 when on triplet s,a,s'.
- The denominator is again just counting how many times we visited that triplet.
Building a Model and Solving the Bellman Equation
In this section, the speaker discusses how to use the Bellman equation to solve problems in physics and engineering. They also introduce the problem of bandits and explain how to construct a model for empirical rewards.
Using Models to Solve Problems
- Collect data with an experiment
- Build up a model according to that experiment
- Use this model in order to infer how to improve performance
Constructing a Model for Empirical Rewards
- Simplified question: Suppose you know states don't change but you don't know rewards
- Construct model for empirical rewards by collecting rewards and dividing by number of times action is taken
- Solve Bellman equation with empirical averages obtained so far
- Best action is arg max of empirical averages for each option
Issues with Using Empirical Averages Alone
- Explore then commit algorithm may not be reliable due to bad luck or small sample sizes
Optimization based on Empirical Knowledge
In this section, the speaker discusses the pitfalls of using a naive approach to optimization based on empirical knowledge. He explains that it is risky and doomed to fail even in simple situations.
The Pitfalls of Naive Approach
- If two parameters are very close, you need a lot of attempts to tell them apart.
- There is a pitfall in this mechanism which tells you that you have to do something better here.
- You have to account for more exploration when going from one parameter to another.
- You need statistics in order then to move to solving Bellman equation.
Non-linear Equation
- The problem becomes particularly conspicuous when dealing with optimization in presence of a finite sample.
- You're taking the max before you have completed the average because you have a finite sample.
- It's possible but tricky to perform maximum operator which is strongly non-linear of an average.
Solving Bellman's Equation Without Knowing Tomorrow
- There is another way of doing things that bypasses this problem.
- Methods can find provably convergent solutions without knowing what the model is.
- It's possible and efficient with temporal difference methods.
Tackling Two Problems at Once
- (no timestamp provided)
- Solving Bellman's equation is trying to tackle two problems at once: prediction and control.
Learning the Value Function of a Policy without a Model
In this section, the speaker discusses how to learn the value function of a policy without having a model at hand. The focus is on prediction and not optimization.
Monte Carlo Method
- Use Monte Carlo method to predict the value of a function.
- The value of a policy is just the expected value of the sum of discounted rewards conditioned on the initial state.
Learning Value Function Dynamically Online
- Continuously update knowledge while computing and increasingly do more of what you think is best but still keep some room for exploration.
- Do this dynamically online.
Precision Depends on Gap Between Actual Averages
- The number of experiences or episodes capital N that you have to produce in order to be sure that your optimization is correct depends on the gap between actual averages.
- If two coins are only split apart by 49 and 0.51, then you will need hundreds of trials, but if they are just split apart by 1,000, then you will need a million trials roughly speaking because it goes like the square root of the number of trials; precision is inverse proportional to square root n.
Conclusion
The speaker discussed how to learn the value function of a policy without having a model at hand. They suggested using Monte Carlo methods for prediction and continuously updating knowledge while computing. Additionally, they noted that the precision of learning depends on the gap between actual averages.
Monte Carlo Method
In this section, the speaker discusses the Monte Carlo method and its advantages.
Advantages of Monte Carlo Method
- The Monte Carlo method is unbiased, meaning that the expected value of the estimator is equal to the true expected value.
- It is simple and does not require complex formulas.
Issues with Monte Carlo Method
- The method can be computationally expensive, especially for problems with a large number of states or a long horizon.
- It requires a large number of trials to achieve good precision.
Thought Experiment on Greed Word
In this section, the speaker presents a thought experiment on how one would use the Monte Carlo method in practice.
Thought Experiment
- The experiment involves starting from a state and producing a simulation by interrogating an oracle function that gives states and rewards.
- The problem becomes computationally expensive when dealing with problems that have many states or long horizons.
Calculation of Empirical Average in Monte Carlo
In this section, the speaker clarifies how empirical averages are calculated in Monte Carlo.
Calculation of Empirical Average
- Empirical averages are calculated using all previous observations up to that point in time.
- This process assumes that we are still dealing with Markov decision processes (MDPs).
Monte Carlo Methods
In this section, the speaker explains what Monte Carlo methods are and how they work. They discuss the advantages and disadvantages of using Monte Carlo methods and answer questions from the audience about their use.
What are Monte Carlo Methods?
- Monte Carlo methods involve simulating a system to produce trajectories with a generic model.
- This is done by calling a function that takes in an initial state and action, and produces the next state and reward.
- Monte Carlo methods do not care if the system is Markovian or not, but they are less efficient because they do not leverage knowledge of the system's structure.
Advantages and Disadvantages of Monte Carlo Methods
- Monte Carlo methods are unbiased, but can be computationally inefficient in terms of realizations.
- Using Monte Carlo methods can be expensive and does not fully leverage the structure of Markov problems.
Can Sampling Help Overcome These Problems?
- Important sampling is one way to improve on these issues by using another distribution to produce data and compensating for differences between distributions.
- However, this is not what will be done in this case.
Recursion Relationship
In this section, the speaker discusses how to recast the problem of estimating value functions as a statistical estimation problem in terms of something that already includes Markovianity.
Using Recursion Relationship
- The recursion relationship states that the value function at a given state for a certain policy is the sum of actions and new states of the transition probabilities, with actions picked according to that policy.
- Using this recursion relationship can be more efficient than using Monte Carlo methods because it leverages knowledge of the system's structure.
Linear Algebra and Temporal Difference Learning
In this section, the speaker discusses how to solve the linear equation for calculating the value function in Markov Decision Processes (MDPs) using linear algebra. The speaker also introduces the concept of temporal difference learning as a way to solve this problem stochastically.
Solving Linear Equation for Value Function
- The expected reward from state s is calculated by taking the sum over new states and actions according to the policy of rewards.
- Calculating the value function in MDP is a linear algebra problem that involves solving a linear equation with an unknown vector.
- The recursion relation can be rewritten in matrix notation, highlighting the linear algebra nature of the problem.
- The matrix on the right-hand side is known in MDP, while the vector on the left-hand side is unknown.
Stochastic Approximation and Temporal Difference Learning
- Can we solve this linear equation without knowing what's on either side but having samples of these matrices and vectors?
- This technique goes under the name of temporal difference learning, which involves solving this problem stochastically.
- Tomorrow's lecture will discuss how to prove stochastic approximation using a simplified version of this technique.
Reinforcement Learning: Temporal Difference Error
In this section, the speaker introduces the concept of temporal difference error in reinforcement learning and explains how it is used to eliminate unknown variables from a recursion relation.
The General Idea
- The speaker explains that they will rewrite an equation involving unknown variables in a way that involves sampling quantities.
- The unknown variables are the probabilities and rewards.
- By putting everything on the same side of the equation, they can eliminate some of the unknown variables by replacing them with sampling quantities.
- This new equation will be the cornerstone of their algorithm.
Expectation Value
- The recursion relation is also equal to an expectation value if actions are taken according to policy.
- This expectation value involves picking an action according to policy, taking a state according to transition probability, and receiving a reward.
- The object under the expectation value is called delta t plus one and represents the temporal difference error.
Temporal Difference Error
- Delta t plus one is a stochastic object that depends on current states, actions, and rewards observed.
- It stands for temporal difference error.
Temporal Difference Error
In this section, the speaker explains why the temporal difference error is important and how it measures the difference between what you actually observe and what you estimate on the basis of the value function.
Importance of Temporal Difference Error
- The temporal difference error measures the difference between what you actually observe and what you estimate on the basis of the value function.
- The expected reward from a state minus gamma times the expected reward from the next state is exactly the expected reward. This is why it's called an error because it measures this difference.
- The estimated reward according to the value function is called an error because it measures this difference.
- The expectation of this error is zero according to your points.
Using Temporal Difference Error for Estimation
- Suppose we start with a guess for our value function, compute the temporal difference error for this guess. On average, it will not be zero because if it were zero on average, this would be my exact value function.
- If my average temporary difference error is larger than zero, then my first guess was pessimistic because it was predicting a reward that is smaller than the reward that actually my policy gives.
- This provides information on how to correct your estimate.
Simplified Version of Problem
- A simplified version of this problem can be thought about in one dimension.
Linear Equations and Value Functions
In this section, the speaker explains the geometrical interpretation of linear equations in one dimension and how it relates to value functions in MDPs.
Geometrical Interpretation of Linear Equations
- The equation for a hyperplane intersecting a vector gives a vector as a solution.
- The speaker considers a sketch of this thing in one dimension.
Value Function and MDPs
- The space of all possible vectors is represented by V, while V_pi represents the only vector that truly is the value of my policy.
- An MDP looks like finding the zero of an equation if you wish, in this case, it's even a linear equation.
Stochastic Approximation
In this section, the speaker discusses stochastic approximation and how it can be used to find solutions without knowing what the blue line is.
Stochastic Approximation
- Stochastic approximation allows us to interrogate our system and get some samples without knowing what the blue line is.
- We want to find the green point here without knowing what the blue line is but just knowing that on average my red points are around the blue line.
Convergence Conditions for Stochastic Gradient Descent
- The lecture will provide conditions for convergence of stochastic gradient descent tomorrow morning.
Temporal Difference Learning for Value Function
In this section, the speaker outlines their plan to formulate their strategy for temporal difference learning for the value function.
Formulating Strategy for Temporal Difference Learning
- The speaker plans to complete the simple one-dimensional problem using stochastic approximation and then go back to their full multi-dimensional problem.
- They will then formulate their strategy for temporal difference learning for the value function.