(page reused and constantly updated over multiple semesters; the course was previously taught under the number 598)
Last Updated: 10/08/2024
Approximate Dynamic Programming and Offline RL
Approximate Dynamic Programming (ADP) concerns obtaining approximate solutions to large planning problems, often with the help of sampling and function approximation. Many ADP methods can be considered as prototype algorithms for popular value-based RL algorithms used today, especially in the offline setting, so it is important to understand their behaviors and guarantees.
Offline RL without exploratory data
Online + Offline (Hybrid)
Lower bounds
Off-policy Evaluation
How to estimate the performance of a policy using data collected from a different policy? This question has important implications in safety and real-world applications of RL.
OPE in Bandits
Importance Sampling for RL
Marginalized Importance Sampling and Beyond
LSTD and LSPI
These methods directly minimize the projected Bellman error for linear value-function approximation, and have some different characteristics compared to other TD methods that learn from bootstrapped targets.
Multi-agent RL
Online
Offline
POMDPs and PSRs
When Markov assumption does not hold and the world is partially observable, how can you recover state from observations and plan based on the belief?
Planning
Learning
Sampling-based Planning Algorithms
Computing near-optimal policy in large MDPs by sampling in a generative model of the environment.
PAC-RL with Function Approximation
Structural Assumptions (Bellman rank, Eluder dimension, etc.)
Latent State Discovery
Low-rank/Linear MDPs
Thompson sampling
Lower bounds
PAC-RL (tabular)
Lower bounds
Exploiting side information or structures in easy problems
Concurrent/”Low-Switch”/”Deployment-efficient” PAC-RL
Multi-task & Life-long Learning and Learning across Population
PAC Hierarchical RL
Linear Programming for MDPs
Apart from value/policy iteration, Linear Programming (LP) is another standard method for solving MDPs. The LP formulation also reveals many interesting properties of MDPs (e.g., the dual formulation has occupancy measure as its decision variables).
Entropy Regularization
Decision-aware model-based RL
RL and Causality
In batch RL we often assume that logged actions are function of states, which are not necessarily true in some application scenarios. If actions in the data were determined based on hidden factors, standard algorithms may suffer from the confounding effect, and we may seek tools from causal inference to mitigate it.
Robustness in MDPs
Conservative approaches to MDPs. Compute a policy that is robust to parameter uncertainty of MDPs. Or when the MDP is fully known, compute a policy whose return is worst-case optimal w.r.t. the randomness of trajectory.
Policy Gradient and Policy Improvement Methods
How to design an RL algorithm that keeps improving the policy monotonically? This motivation inspired the original CPI algorithm, and later its variants such as TRPO.
Interactive Imitation Learning
How to mimic a policy from expert demonstrations? A pure supervised learning approach (“behavior cloning”) may suffer from distribution shift. Interactive methods, such as DAgger, are more robust against it.
RL & Control
In control theory, the most basic model of a dynamic environment is a linear system. How to do RL in continuous state space and what guarantees can be obtained?
Abstractions, Model-based RL, and Representation Selection
State abstraction is a simplest form of approximation that collapses a high-dimensional state representation to a more manageable one. Understanding of abstractions forms the basis for understanding more complex function approximation schemes. However, designing a good abstraction is a very difficult engineering task; can we have the algorithm automatically select among candidate representations in learning?
Factored MDPs
Factored MDP is a tractable formulation of large scale MDPs: state is represented by factors, and the future value of each factor only depends on the values of a small number of parent factors now.
Stochastic Approximation
Convergence analyses of Q-learning, TD, etc.
Concentration Inequalities in Dependent Processes