CS 542 Stat RL: Project Topics & References

(page reused and constantly updated over multiple semesters; the course was previously taught under the number 598)

Last Updated: 09/16/2023

  • The references are to help you choose the course project topic, not to put constraints on you. Feel free to choose paper(s) not listed here as long as they are related to RL and have significant theoretical components.
  • The references are roughly organized into categories. You can choose 1 or more papers from the same category, or even across categories—as long as you see a strong connection and can write a coherent report based on the selected papers.
  • The lists can be incomplete and not representative. You should use them as seed papers and track the citations to read more.
  • We will do some of the topics in class. I will mark them with (*). Your report needs to be significantly different from what’s done in class. If you are unsure which parts/aspects of the papers will be covered, talk to me.
  • Some papers have been popular choices in past semesters. I will mark them with (-): You are encouraged to choose other papers. If you have to choose one of them, you will be held to higher standards. F22: You will need to include a novel extension of this work in your report, based on which your report will be mainly evaluated.

Approximate Dynamic Programming and Offline RL

Approximate Dynamic Programming (ADP) concerns obtaining approximate solutions to large planning problems, often with the help of sampling and function approximation. Many ADP methods can be considered as prototype algorithms for popular value-based RL algorithms used today, especially in the offline setting, so it is important to understand their behaviors and guarantees.

Offline RL without exploratory data

Online + Offline (Hybrid)

Lower bounds

Off-policy Evaluation

How to estimate the performance of a policy using data collected from a different policy? This question has important implications in safety and real-world applications of RL.

OPE in Bandits

Importance Sampling for RL

Marginalized Importance Sampling and Beyond


These methods directly minimize the projected Bellman error for linear value-function approximation, and have some different characteristics compared to other TD methods that learn from bootstrapped targets.

Multi-agent RL




When Markov assumption does not hold and the world is partially observable, how can you recover state from observations and plan based on the belief?



Sampling-based Planning Algorithms

Computing near-optimal policy in large MDPs by sampling in a generative model of the environment.

PAC-RL with Function Approximation

Structural Assumptions (Bellman rank, Eluder dimension, etc.)

Latent State Discovery

Low-rank/Linear MDPs

Thompson sampling

Lower bounds

PAC-RL (tabular)

Lower bounds

Exploiting side information or structures in easy problems

Concurrent/”Low-Switch”/”Deployment-efficient” PAC-RL

Multi-task & Life-long Learning and Learning across Population

PAC Hierarchical RL

Linear Programming for MDPs

Apart from value/policy iteration, Linear Programming (LP) is another standard method for solving MDPs. The LP formulation also reveals many interesting properties of MDPs (e.g., the dual formulation has occupancy measure as its decision variables).

Entropy Regularization

Decision-aware model-based RL

RL and Causality

In batch RL we often assume that logged actions are function of states, which are not necessarily true in some application scenarios. If actions in the data were determined based on hidden factors, standard algorithms may suffer from the confounding effect, and we may seek tools from causal inference to mitigate it.

Robustness in MDPs

Conservative approaches to MDPs. Compute a policy that is robust to parameter uncertainty of MDPs. Or when the MDP is fully known, compute a policy whose return is worst-case optimal w.r.t. the randomness of trajectory.

Policy Gradient and Policy Improvement Methods

How to design an RL algorithm that keeps improving the policy monotonically? This motivation inspired the original CPI algorithm, and later its variants such as TRPO.

Interactive Imitation Learning

How to mimic a policy from expert demonstrations? A pure supervised learning approach (“behavior cloning”) may suffer from distribution shift. Interactive methods, such as DAgger, are more robust against it.

RL & Control

In control theory, the most basic model of a dynamic environment is a linear system. How to do RL in continuous state space and what guarantees can be obtained?

Abstractions, Model-based RL, and Representation Selection

State abstraction is a simplest form of approximation that collapses a high-dimensional state representation to a more manageable one. Understanding of abstractions forms the basis for understanding more complex function approximation schemes. However, designing a good abstraction is a very difficult engineering task; can we have the algorithm automatically select among candidate representations in learning?

Factored MDPs

Factored MDP is a tractable formulation of large scale MDPs: state is represented by factors, and the future value of each factor only depends on the values of a small number of parent factors now.

Stochastic Approximation

Convergence analyses of Q-learning, TD, etc.

Concentration Inequalities in Dependent Processes