(page reused and constantly updated over multiple semesters)
Approximate Dynamic Programming
Approximate Dynamic Programming (ADP) concerns obtaining approximate solutions to large planning problems, often with the help of sampling and function approximation. Many ADP methods can be considered as prototype algorithms for popular value-based RL algorithms used today, so it is important to understand their behaviors and guarantees.
Linear Programming for MDPs
Apart from value/policy iteration, Linear Programming (LP) is another standard method for solving MDPs. The LP formulation also reveals many interesting properties of MDPs (e.g., the dual formulation has occupancy measure as its decision variables).
LSTD and LSPI
These methods directly minimize the projected Bellman error for linear value-function approximation, and have some different characteristics compared to other TD methods that learn from bootstrapped targets.
How to estimate the performance of a policy using data collected from a different policy? This question has important implications in safety and real-world applications of RL.
OPE in Bandits
Importance Sampling for RL
Marginalized Importance Sampling and Beyond
Connection between IS and PG
RL and Causality
In batch RL we often assume that logged actions are function of states, which are not necessarily true in some application scenarios. If actions in the data were determined based on hidden factors, standard algorithms may suffer from the confounding effect, and we may seek tools from causal inference to mitigate it.
PAC-MDP is a mathematical framework for studying sample complexity in RL, often with an emphasis on the challenge of exploration. PAC bounds for RL have been refined and simplified over time, and new topics beyond classical tabular RL are being explored.
Spectral methods for exploration in POMDPs
Exploiting side information or structures in easy problems
Concurrent & Continuous PAC-RL
Multi-task & Life-long Learning and Learning across Population
PAC Hierarchical RL
PAC-RL with Function Approximation
Most PAC-MDP literature focuses on the tabular case, while empirical RL algorithms all rely on function approximation. What is a mathematical framework for understanding exploration in RL with function approximation?
In MDPs we start with a reward function. Where does it come from? In fact, it can be very difficult to specify a good reward function in practice. One way to resolve this problem is by inverse RL, where we derive a reward function algorithmically from expert demonstrations.
Here is a hard trade-off: a reward function that faithfully describes the ultimate evaluation criteria is often sparse and difficult to work with; a non-sparse reward function that gives a lot of learning signals may result in undesired behaviors. How to transform a sparse reward function into a more learnable one without sacrificing fidelity to the ultimate goal?
On Planning Horizon
In practice, the horizon / discount factor in an RL algorithm is usually smaller than that in the problem definition. When there is such mismatch between the two notions of horizons, how lossy is the policy that we get?
Robustness in MDPs
Conservative approaches to MDPs. Compute a policy that is robust to parameter uncertainty of MDPs. Or when the MDP is fully known, compute a policy whose return is worst-case optimal w.r.t. the randomness of trajectory.
Policy Gradient and Policy Improvement Methods
How to design an RL algorithm that keeps improving the policy monotonically? This motivation inspired the original CPI algorithm, and later its variants such as TRPO.
Interactive Imitation Learning
How to mimic a policy from expert demonstrations? A pure supervised learning approach (“behavior cloning”) may suffer from distribution shift. Interactive methods, such as DAgger, are more robust against it.
Sampling-based Planning Algorithms
Solving large MDPs by sampling.
RL & Control
In control theory, the most basic model of a dynamic environment is a linear system. How to do RL in continuous state space and what guarantees can be obtained?
Abstractions, Model-based RL, and Representation Selection
State abstraction is a simplest form of approximation that collapses a high-dimensional state representation to a more manageable one. Understanding of abstractions forms the basis for understanding more complex function approximation schemes. However, designing a good abstraction is a very difficult engineering task; can we have the algorithm automatically select among candidate representations in learning?
Factored MDP is a tractable formulation of large scale MDPs: state is represented by factors, and the future value of each factor only depends on the values of a small number of parent factors now.
POMDPs and PSRs
When Markov assumption does not hold and the world is partially observable, how can you recover state from observations and plan based on the belief?
Convergence analyses of Q-learning, TD, etc.
Concentration Inequalities in Dependent Processes