CS 598 Stat RL: Project Topics & References

(page reused and constantly updated over multiple semesters)

  • The references are to help you choose the course project topic, not to put constraints on you. Feel free to choose paper(s) not listed here as long as they are related to RL and have significant theoretical components.
  • The references are roughly organized into categories. You can choose 1 or more papers from the same category, or even across categories—as long as you see a strong connection and can write a coherent report based on the selected papers.
  • The lists can be incomplete and not representative. You should use them as seed papers and track the citations to read more.
  • We will do some of the topics in class. I will mark them with (*). Your report needs to be significantly different from what’s done in class. If you are unsure which parts/aspects of the papers will be covered, talk to me.
  • Some papers have been popular choices in past semesters. I will mark them with (-): You are encouraged to choose other papers. If you have to choose one of them, you will be held to higher standards. F20: You will need to include a novel extension of this work in your report, based on which your report will be mainly evaluated.

Approximate Dynamic Programming

Approximate Dynamic Programming (ADP) concerns obtaining approximate solutions to large planning problems, often with the help of sampling and function approximation. Many ADP methods can be considered as prototype algorithms for popular value-based RL algorithms used today, so it is important to understand their behaviors and guarantees.

Linear Programming for MDPs

Apart from value/policy iteration, Linear Programming (LP) is another standard method for solving MDPs. The LP formulation also reveals many interesting properties of MDPs (e.g., the dual formulation has occupancy measure as its decision variables).


These methods directly minimize the projected Bellman error for linear value-function approximation, and have some different characteristics compared to other TD methods that learn from bootstrapped targets.

Off-policy Evaluation

How to estimate the performance of a policy using data collected from a different policy? This question has important implications in safety and real-world applications of RL.

OPE in Bandits

Importance Sampling for RL

Marginalized Importance Sampling and Beyond

Connection between IS and PG

RL and Causality

In batch RL we often assume that logged actions are function of states, which are not necessarily true in some application scenarios. If actions in the data were determined based on hidden factors, standard algorithms may suffer from the confounding effect, and we may seek tools from causal inference to mitigate it.

PAC-RL (tabular)

PAC-MDP is a mathematical framework for studying sample complexity in RL, often with an emphasis on the challenge of exploration. PAC bounds for RL have been refined and simplified over time, and new topics beyond classical tabular RL are being explored.

Lower bounds

Spectral methods for exploration in POMDPs

Exploiting side information or structures in easy problems

Concurrent & Continuous PAC-RL

Multi-task & Life-long Learning and Learning across Population

PAC Hierarchical RL

PAC-RL with Function Approximation

Most PAC-MDP literature focuses on the tabular case, while empirical RL algorithms all rely on function approximation. What is a mathematical framework for understanding exploration in RL with function approximation?

Bellman rank

Eluder dimension

Low-rank/Linear MDPs

Lower bound

Inverse RL

In MDPs we start with a reward function. Where does it come from? In fact, it can be very difficult to specify a good reward function in practice. One way to resolve this problem is by inverse RL, where we derive a reward function algorithmically from expert demonstrations.

Reward Shaping

Here is a hard trade-off: a reward function that faithfully describes the ultimate evaluation criteria is often sparse and difficult to work with; a non-sparse reward function that gives a lot of learning signals may result in undesired behaviors. How to transform a sparse reward function into a more learnable one without sacrificing fidelity to the ultimate goal?

On Planning Horizon

In practice, the horizon / discount factor in an RL algorithm is usually smaller than that in the problem definition. When there is such mismatch between the two notions of horizons, how lossy is the policy that we get?

Robustness in MDPs

Conservative approaches to MDPs. Compute a policy that is robust to parameter uncertainty of MDPs. Or when the MDP is fully known, compute a policy whose return is worst-case optimal w.r.t. the randomness of trajectory.

Policy Gradient and Policy Improvement Methods

How to design an RL algorithm that keeps improving the policy monotonically? This motivation inspired the original CPI algorithm, and later its variants such as TRPO.

Interactive Imitation Learning

How to mimic a policy from expert demonstrations? A pure supervised learning approach (“behavior cloning”) may suffer from distribution shift. Interactive methods, such as DAgger, are more robust against it.

Sampling-based Planning Algorithms

Solving large MDPs by sampling.

RL & Control

In control theory, the most basic model of a dynamic environment is a linear system. How to do RL in continuous state space and what guarantees can be obtained?

Abstractions, Model-based RL, and Representation Selection

State abstraction is a simplest form of approximation that collapses a high-dimensional state representation to a more manageable one. Understanding of abstractions forms the basis for understanding more complex function approximation schemes. However, designing a good abstraction is a very difficult engineering task; can we have the algorithm automatically select among candidate representations in learning?

Factored MDPs

Factored MDP is a tractable formulation of large scale MDPs: state is represented by factors, and the future value of each factor only depends on the values of a small number of parent factors now.


When Markov assumption does not hold and the world is partially observable, how can you recover state from observations and plan based on the belief?

Stochastic Approximation

Convergence analyses of Q-learning, TD, etc.

Concentration Inequalities in Dependent Processes