Reinforcement Learning (QLS-RL) Lecture 4 - Part 1
Introduction
The lecturer introduces the topic of the lecture and explains that they will be dealing with different settings of the Markov decision problem.
- The lecture will focus on a more approachable problem than in the previous lecture.
- The lecturer briefly recaps what was covered in the last lecture, which was the formal introduction of Markov decision processes.
Markov Decision Processes
This section covers Markov decision processes and their components.
- A Markov decision process is made up of states, actions, rewards, and transition probabilities.
- Policies are mappings of states to actions.
- For a given policy, we can generate an action from a state to produce a new state and reward signal.
- The objective function is constructed by accumulating rewards over time horizon T.
Value Function
This section covers value functions and their importance in deriving recursion relationships.
- Value functions represent expected rewards at a given time t for a certain state under a policy pi.
- Recursion relationships map values of all possible states at any given time to values of those same states at successive times.
- Backward iteration can be used to solve for values starting from final numerical value (0) and working backward through time steps.
Identifying Best Policy
This section covers how to identify the best policy for maximizing return over time.
- Manipulation is required to go from recursion relation to identifying best policy.
- New equation derived from recursion relation helps identify best sequence of decisions over time that guarantees maximum return.
Dynamic Programming
The lecture discusses dynamic programming, which is a method for solving optimization problems. It involves recursively computing optimal value functions at subsequent times and using them to compute optimal value functions at preceding times.
Recursive Relations
- Dynamic programming involves recursively computing optimal value functions at subsequent times and using them to compute optimal value functions at preceding times.
- The relationship between the subsequent and preceding time's optimal value functions is nonlinear due to the max function in the middle. However, this does not cause any problems.
Backward Computation
- To use dynamic programming, you start from the end with an optimal value function of zero because any decision taken at the end doesn't matter. You then use this equation to go backward in time in a process called dynamic programming.
Optimal Actions
- Dynamic programming eventually leads to a set of optimal actions to take for each state at each time, which means that we have found a way to solve our problem. If we are only interested in optimal decisions, we can erase all information about values, making it more memory-efficient.
Computational Complexity
- Dynamic programming becomes computationally heavy when there are large state spaces or action choices or long horizons because of the computational power required to solve it.
Procrastinator Exercise
This section introduces an exercise inspired by dynamic programming called "Procrastinator." It explores why people tend to procrastinate and whether it is an optimized behavior.
Motivation
- The motivation behind this exercise is understanding why people tend to procrastinate and whether it is an optimized behavior or not.
- People tend to procrastinate, and it is not just an individual problem but a common one.
Two Viewpoints
- There are two viewpoints on procrastination:
- The first viewpoint suggests that the way we reason is inadequate for the task at hand.
- The second viewpoint suggests that there's something optimal in what we're doing because we're all doing the same thing, which means we're optimizing over something.
Procrastinator Exercise
- The "Procrastinator" exercise is a Markov decision process that explores why people tend to procrastinate and whether it is an optimized behavior or not.
- This exercise tries to understand the psychology behind procrastination and whether it is an optimized behavior or not.
Introduction to the Exercise
The exercise involves two states, "done" and "not yet". At time zero, an assignment is given, and there is a probability distribution over these two states.
Initial State
- The initial state will be a probability distribution over the two states.
- If you are given an assignment, you typically start in the state "not yet done".
- If you have already completed it or found the solution online, you start in the state "done".
Actions and Rewards
There are two actions: procrastinate or complete the assignment. Completing the assignment has a reward of 0 if you are already in the "done" state. If not, there is a chance of completing it with a reward of 1 minus epsilon and a cost c.
Procrastination
- Procrastinating takes you back to the same state with probability 1 and reward 0.
Completing Assignment
- Completing the assignment has a chance of success (1 - epsilon) with a cost c.
- If successful, you stay in the "done" state for all remaining time steps.
- If unsuccessful due to external factors such as internet connection issues or distractions from friends, there is no penalty but also no reward.
End Game
At time t = capital T (the horizon), if you are in the "done" state then there is no reward or penalty. However, if you are still in the "not yet" state at this point then there is a big penalty of negative capital C.
Done State at Horizon
- No reward or penalty if in "done" state at horizon (time t = capital T).
Not Yet State at Horizon
- Big penalty of negative capital C if still in "not yet" state at horizon (time t = capital T).
Understanding the Best Strategy
In this section, the speaker discusses how to obtain the best strategy and how it depends on different parameters.
Obtaining the Best Strategy
- The exercise is to find out the best strategy and understand how it depends on parameters.
- You can try to do it analytically or numerically.
- Analytical approach gives immediate control over parameters while numerical approach requires scanning for different parameters.
Phase Space Diagram
- A phase space diagram can be created to determine where some decisions are bad or best.
Adapting Policies Based on Parameters
- The policy always depends implicitly on the parameters of the environment.
- Depending on the kind of prospective vision of what the environment is going to send them, agents might decide to do different things.
- The properties of the environment can be thought of as what kind of model an agent uses to describe it.
Final Time and Rewards
In this section, the speaker explains rewards at final time in more detail.
Final Time Rewards
- If it's the final time, there's no more time left to do it.
- Therefore, if you happen to be in a not-done state at that point, you get zero reward.
Introduction
In this section, the speaker introduces the topic of decision making and discusses how time is treated differently in certain situations.
Time Stationary Policies
- The speaker discusses a large class of problems in decision making where time has to be treated differently.
- There is no real clock ticking in these problems, and there is no need to consider different strategies depending on how far we are from the deadline or horizon of our problem.
- These problems deserve to be treated separately and will be discussed further.
Heuristics for Time Stationary Policies
In this section, the speaker provides some heuristics for dealing with time stationary policies.
Terminal States
- The speaker defines terminal states as states from which you do not exit. These states are also called coffin states.
- Terminal states must have an additional property: the average reward that you will get in the future is equal to zero if s is terminal.
- If an MDP ends up in a terminal state with probability one, then our objective function g can be written as the expectation of the sum over all times of gamma t the rewards.
Graphical Representation of Terminal States
In this section, the speaker provides a graphical representation of terminal states.
Basic Idea
- The graph represents different actions that may send you here or there.
- A state is considered terminal if it has arrows going into it but none going out.
- If an MDP ends up in a terminal state with probability one, then it gets stuck there and receives zero rewards from there to infinity.
I apologize, but I cannot see any transcript provided in this conversation. Please provide me with the transcript so that I can create a summary using the format you have described.
Introduction to Augmented Markov Decision Processes
In this section, the speaker introduces the concept of an augmented Markov decision process (MDP) and explains how it is used to modify a given MDP.
Adding a New State
- A new state called "coffin state" is added to the original MDP.
- The original MDP is then augmented with this new state called "void state".
- The new state is terminal, which means that you get back to it with probability one.
Modifying Transition Probabilities
- At every step, probabilities are modified by multiplying them by gamma.
- Every time you transition from one state to another, there is a possibility that you transition from here to a new terminal state with probability 1-gamma.
- Gamma can be thought of as a survivor probability.
Equivalence of Original and Augmented MDP
- The objective discounted objective of the original MDP is equal to the undiscounted objective in the augmented MDP.
- This allows us to say that we can expect that the policy will not be time-dependent.
Introduction to Discount Factor
In this section, the speaker introduces the concept of a discount factor and how it can be used to consider all situations in an augmented MDP with a terminal state. The speaker also explains how introducing discounting allows for a stationary value function and optimal solution.
Introducing Discount Factor
- A discount factor is introduced to consider all situations in an augmented MDP with a terminal state.
- Gamma equals one if there is a true terminal state in the original MDP.
- Introducing discounting allows for a stationary value function and optimal solution.
Average Survival Time
- One over one minus gamma represents the average survival time.
- Dynamic programming cannot be used since there is no end to start with when dealing with problems that have no time.
Deriving Formal Equation for Discounted MDP
In this section, the speaker discusses how to formally derive the equation for discounted MDP and how to solve it.
Solving Discounted MDP Equation
- The equation derived earlier is proven as exactly the solution of discounted MDP first and secondly, how to solve it is discussed.
- Since dynamic programming cannot be used due to lack of time, another way of solving this equation must be found.
Clarification on True Terminal State
In this section, the speaker clarifies what they mean by "true terminal state" when discussing gamma equaling one.
Understanding True Terminal State
- Gamma equals one if there is a true terminal state in the original MDP.
- The sum will converge anyway if there is a true terminal state, and eventually, these rewards become zero all the time.
Conclusion
In this section, the speaker concludes the lecture and announces a break.
End of Lecture
- The lecture concludes with an announcement for a break.