Reinforcement Learning (QLS-RL) Lecture 10
QUANTITATIVE LIFE SCIENCE Reinforcement Learning (QLS-RL) A. Celani QLS-RL-L10-Celani-Zoom.mp4
Reinforcement Learning (QLS-RL) Lecture 10
Introduction to Partially Observable Markov Decision Processes
In this section, the speaker introduces the concept of partially observable markov decision processes and outlines the objectives for the tutorial.
Reinforcement Learning Problem
- Any reinforcement learning problem can be cast as an agent interacting with an environment.
- The agent interacts with a closed loop interaction made up of actions and percepts.
- Percepts are split into rewards and observations.
Observations and Rewards
- Observations give contextual information, but in general, they are limited.
- At time t, the agent has performed a series of actions resulting in rewards and observations.
- Observations could be high dimensional objects such as images or lists of variables. Rewards are one-dimensional objects.
Choosing Actions
- The question is how to choose the next action given all previous observations (history).
- The problem is to find out what are the best mappings between histories and decisions given that we want to achieve a certain goal.
No Model Available
- If no model is available, we have to rely on long histories and process them by compressing them into memory states.
Goal of Reinforcement Learning
- The goal is to maximize over choices of policies the expected value of future rewards.
Two Approaches to Decision Making
In this section, the speaker discusses two approaches to decision making: model-free and model-based. The focus is on the second approach.
Model-Based Decision Making
- Model-based decision making involves using a Markov decision process (MDP) to make decisions.
- States are hidden in an MDP, so observations must be used instead.
- If the MDP's states, rewards, and probabilities are known, planning can still be done even if the system cannot be observed.
- Partial observability means that an agent cannot know which state a system is in at each step. Instead, they have an observation variable that can take on different values.
Example of Partial Observability
In this section, the speaker provides an example of partial observability using a two-state Markov decision process.
Two-State Markov Decision Process
- A two-state Markov decision process has two states and two actions.
- Each action has only two possible outcomes: it can send the system to itself or to the other state.
- Dynamic programming can be used to solve for the best actions from each state.
- The best action from state 2 is 1 and from state 1 is 2.
Introducing Partial Observability
- An observation variable y is introduced that takes on values of either 1 or 2.
- The probability of having an observation y given s is introduced as well as error probability in measurement devices.
-[](0:12:38 t:758s) The error probability is used to account for the fact that the measurement device can make errors.
Introduction to Partially Observable Markov Decision Processes
In this section, the speaker introduces the concept of partially observable Markov decision processes (POMDPs) and explains how they differ from traditional Markov decision processes.
POMDPs and Memoryless Processes
- POMDPs are introduced as a way to deal with partial observability in decision-making.
- When we have partial observability, our decision-making will not be markovian. We'll have to rely on history.
- The example of the carton pole is used to illustrate how relying on history is necessary when we only have access to certain observations.
Inferring States from Observations
- The speaker discusses how observations can help us infer which state we are in, even if not with certainty.
- If we can formalize the idea of inferring states from past observations, then we can make decisions based on that information.
Hidden Markov Chains
- The speaker introduces hidden Markov chains as a type of partially observable Markov decision process.
- Hidden Markov chains are also known as hidden Markov models.
Introduction to Markov Hidden Models
In this section, the speaker introduces the concept of Markov Hidden Models and explains how they can be used to identify the probability of being in a certain state at a given time.
The Model
- The system evolves over time according to some transition probabilities.
- At each time step, we receive some observation about the system.
- We want to identify the probability of being in a certain state at time t given the previous history of observations.
Bayesian Updating
- Bayesian updating is a technique that allows us to construct sequentially these probabilities as new observations come in.
- As we collect information about the system, we might be able to localize our probability around a certain state.
- If we have knowledge of where our system is sitting at a given time, we can formulate some sort of decision-making problem.
Deriving Rules for Sequential Probability Updates
- We can incorporate sequentially new observations as they come and update our current probability over states.
- The goal is to update our current probability over states as the system evolves in time.
Probability of Sequence from 0 to t Given Observations from 0 to t
In this section, the speaker explains how they sum over all possible previous states of the probability of the sequence from 0 to t given the observations from zero to t.
Probability Distribution and Conditional Probability
- The speaker focuses only on the probability of the current state and does not care about what the previous states were given their information.
- Given history, they want to know where they are now or infer where they are now.
- They use the definition of a conditional probability which tells them that this object is also equal to the sum over all previous states of the joint probability.
Markov Property
- The speaker writes the joint probability distribution of states and observations as a long product using Markov property.
- They rewrite this object by isolating what has happened at the last step.
- All probabilities are products of probabilities, and all observations are independent.
Marginal Over States
- The denominator is just a marginal over states of what they have above.
- The probability of a sequence of observations up to time t is just a marginal over all sequences of states' joint probability.
I'm sorry, but I cannot provide a summary of the transcript as there is no transcript provided in your message. Please provide me with the transcript so that I can create a summary for you.
Summing Over All Possible Occurrences
In this section, the speaker explains how they are summing over all possible occurrences and not erasing anything. They are using properties of probabilities and the Markov structure of the problem.
Notation Explanation
- The notation "ps from zero to t minus two and then like product s t minus one" is just for isolating the sequence.
- The object represented by "s from zero to t minus 1" is a sequence with an element isolated.
Joint Probability Calculation
- The numerator in the joint probability calculation is the green part, which is the joint probability times zero t minus one observations from zero to t minus one times transition as t plus one x t.
- This probability of why is not a blackboard.
Conditional Probability Calculation
- The ratio of a joint probability to a marginal probability results in a conditional probability.
- The conditional probability formula involves history up to time t minus one, probability of condition, transition, and observation divided by height.
Final Formula
- The final formula involves summing over states at time t minus one of the probability of states.
Introduction to Sequential Probability
In this section, the speaker introduces sequential probability and explains how it can be used to update probabilities sequentially.
Updating Probabilities Sequentially
- The probability of being in state st given the history up to time t is equal to the probability of being in state st given the history from zero to t.
- This gives a sequential rule for updating probabilities. If you have a probability distribution over states at time t and combine it with your probability of transitions and observations, you can derive an updated probability distribution.
- At any time, if you have a probability of a state given the observations up to time t minus one, you can use your knowledge of the model p and f to derive the new updated probability given the new observations.
Bayes Formula
- The formula introduced is essentially Bayes formula.
- When transition probabilities are just identity, then this formula becomes similar to Bayes formula with priors, likelihoods and evidence.
Competing Beliefs
- There are two competing parts in updating beliefs - prior beliefs move in space as well as likelihood updates.
Belief Updating
In this section, the speaker explains how belief updating works in a hidden Markov process. The posterior follows any changes in states, and since the system is stochastic, it tends to spread out the posterior.
Posteriors and Stochastic Changes
- Posteriors follow changes in states.
- A change from one state to another is stochastic and tends to spread out the posterior.
- Even if the probability of being in a certain state at time t given previous history is high, making a stochastic change will erase a lot of information.
Competition Between Observations and Dynamics
- There is competition between observations that increase information and dynamics that tend to make the system forget.
Belief Updating as Key Ingredient
- Belief updating accounts for both underlying dynamics and observations.
- It is a key ingredient in constructing a theory for decision-making with partial observation.
Correcting Misprints
In this section, the speaker corrects misprints found by an attendee.
Correction of Misprints
- Attendee points out misprint in formula.
- Speaker thanks attendee for pointing it out.
- Speaker asks attendees to point out other misprints if they find them.
T+1 Confusion
In this section, the speaker addresses confusion regarding t+1 notation used earlier.
T+1 Confusion Explained
- Attendee asks about t+1 notation used earlier.
- Speaker explains that confusion arose from mixing steps from t to t+1 and t-1 to t.
- There should be no t+1 if everything is done correctly.
Improving Image Quality
In this section, an attendee asks about improving the image quality of files provided by the speaker.
Image Quality Improvement
- Attendee asks about improving image quality of files provided by speaker.
- Speaker says he tried but was not able to improve resolution.
- Speaker will try post-processing to make them better.
Pink Passage Explanation
In this section, an attendee asks for clarification on a pink passage in the transcript.
Pink Passage Clarification
- Attendee asks for clarification on a pink passage in the transcript.
- Speaker explains that he repeated the same process as before with a sum until time t-2 and a sum over t-1.
Introduction to Partially Observable Markov Decision Processes
In this section, the speaker introduces Partially Observable Markov Decision Processes (POMDPs) as an extension of the notion of a Markov Decision Process (MDP).
Definition of POMDPs
- POMDP is an extension of MDP that includes states, actions, and observations.
- Observations are new objects in POMDP that represent percepts or rewards.
- Rewards can be thought of as information about the state of the system itself.
- The model for the environment includes transition probabilities from states to states given actions and average rewards.
- A model for observations is also provided, which describes how rewards are distributed in probability.
Simplifying without Control
- At first, they start without control to simplify the process.
- They will add actions later on.
Splitting into Two Parts
- The reason why they didn't take into account actions at first was to avoid making it more complicated.
- Depending on how you formally define the system, rewards could be functions of observations or separate from them.
Marginalizing Over States Probabilities Before t
In this section, the speaker discusses marginalizing over state probabilities before time t.
Summing Over History Before t - 2
- The yellow part of the sum is used to sum over all history before t minus two and included.
- This leaves only dependence on object zero.
Clarification on Time t
- The speaker clarifies that time t strictly refers to before time t minus one.
Ingredients of POMDPs
In this section, the speaker discusses the ingredients of POMDPs.
States and Actions
- States and actions are included in POMDPs, just like in MDPs.
Observations
- Observations are new objects in POMDPs that represent percepts or rewards.
- Rewards can be thought of as information about the state of the system itself.
Model for Environment
- The model for the environment includes transition probabilities from states to states given actions and average rewards.
Model for Observations
- A model for observations is also provided, which describes how rewards are distributed in probability.
Introduction to Posterior and Belief
In this section, the speaker introduces the concept of posterior and belief in POMDP. The posterior is a way of encoding the history of observations, while the belief is the probability distribution over states.
Posterior as Sufficient Statistic
- The posterior is a sufficient statistic for the history of observations.
- All information accumulated in a sequence of observations and actions is encoded into the belief.
- The posterior is also known as belief in POMDP.
Updating Beliefs
- Beliefs are updated using Bayesian updating rule.
- Updated beliefs are obtained by multiplying prior beliefs with likelihood and normalizing it.
- Bayesian updating rule for POMDP includes action taken and observation made.
Policies in POMDP
- Policies in POMDP are mappings from beliefs to actions.
- A policy picks an action based on a given prior or belief about where we are in state space.
- After performing an action, we update our belief about where we are in state space using Bayesian updating rule.
Maximizing Expectation Value over Beliefs
In this section, the speaker explains how to maximize expectation value over beliefs to achieve our goal in POMDP.
Goal of POMDP
- The goal of POMDP is to maximize expectation value over policies for discounted rewards.
- Expectation value is calculated by averaging stochasticity of process and our beliefs about how we are distributed over states.
Observations and Rewards
In this section, the speaker discusses how observations and rewards are represented in the literature. They explain that there may be slight differences in notation but the substance is the same.
Representing Observations and Rewards
- The literature on PMDPs has different ways of representing observations and rewards.
- Some describe them as functions of observations while others describe them as mixed rewards and contextual observations.
- Notation may vary from one approach to another, but the principles remain the same.
Averages over Beliefs
In this section, the speaker explains how to construct an average over beliefs using stochasticity in a process. They also discuss how beliefs come into play when constructing averages.
Constructing Averages Over Beliefs
- To construct an average over beliefs, we sum over our probability distribution over states given by our starting beliefs.
- We pick an action according to that prior and then average over it.
- There is stochasticity in the process itself which means we have to sum over all possible actions with their own probability distribution with all possible outcomes.
- We also have to sum over s because we don't know where we are so belief appears here too.
- At each step, we have to average over our updated belief which depends on the action taken at previous time.
Intricate Summation
In this section, the speaker explains that there is a dependence on history through beliefs of all things that will be summed up into the future.
Dependence on History
- The expressions shown are clearly intractable as they are.
- There is a dependence on history through beliefs of all things that will be summed up into the future.
How to Solve a Partial Observable Decision Maker Decision Process
In this section, the speaker explains how to solve a partial observable decision maker decision process by realizing one fundamental thing. The speaker introduces the concept of Bayesian updating and how it sends beliefs into new beliefs. The speaker also explains that working with partial observable market decision processes is just looking at an MDP in a much more complex space which is the space of probability distribution over states.
Bayesian Updating Sends Beliefs into New Beliefs
- Bayesian updating sends beliefs into new beliefs.
- The new belief depends only on the previous belief and it depends on something which is the action and probabilistically also the possible observations with some probabilities that you know about.
- Bayesian updating is a Markov process in the space of beliefs.
POMDP Is Equivalent to an MDP in Belief Space
- A POMDP lives in a new space which is the space of beliefs.
- By replacing states with beliefs, we are lifting to this higher dimensional space actually continuous space. This means that a POMDP actually is equivalent to an MDP in belief space.
- You could define a sort of probability of having a new belief given the previous belief and the action, and your average rewards will be average rewards with respect to beliefs.
Perfect Observations Collapse Back to an MDP
- Working with partial observable market decision processes is just looking at an MDP in a much more complex space which is the space of probability distribution over states.
- The corners of this belief space are certainty, and if you observe directly the state with certainty, this collapses back to an MDP.
Calculations and Bellman Equation
In this section, the speaker discusses the calculations that prove their point and introduces the Bellman equation for POMDP.
The Upshot of Calculations
- The speaker mentions that there is a Bellman equation for POMDP just like there is one for MDP.
The Bellman Equation for POMDP
- The optimal value of any belief can be computed to determine the optimal gain in the future given that belief.
- The structure of the Bellman equation is similar to that of MDP, with some differences such as beliefs over which to average and observations over which transitions depend.
- Value iteration can be used to solve it, but it becomes a nonlinear functional equation instead of a vector.
- Solving this equation is generally a super hard problem except in some favorable situations where updating beliefs are simple.
Efficiently Combining Prior and Posterior
In this section, the speaker discusses how to efficiently combine prior and posterior probabilities. They explain that it is not always easy to do so, but there are specific cases where it is possible.
Using Conjugate Priors for Gaussian Likelihoods
- The speaker explains that when combining a Gaussian likelihood with a conjugate prior, the resulting posterior will also be Gaussian.
- This allows for dimensionality reduction by only looking at means and averages instead of the full space of probability distributions.
- The speaker notes that this method only works when the underlying model has certain properties.
Using Beta Distribution as Conjugate Prior for Bernoulli Likelihood
- If the likelihood is Bernoulli (i.e., zeros and ones), then using a beta distribution as a conjugate prior results in a beta function posterior.
- The family of exponential distributions encompasses these examples and others, allowing for larger classes of probability distributions.
- However, this situation is non-generic, meaning it does not cover all probability distributions.
Solving Bayesian Bandit Problem with Two-Arm Bernoulli Problem
- The two-arm Bernoulli problem simplifies the problem by having simple transition probabilities and a constant state.
- By using value iteration in this discrete space of beliefs, we can solve exactly the Bayesian bandit problem.
- A good reference for more information on partially observable Markov decision processes (PMDPs) is a review paper by Spohn.