(page reused and constantly updated over multiple semesters; the course was previously taught under the number 598)

**Last Updated: 09/16/2023**

- The references are to help you choose the course project topic, not to put constraints on you. Feel free to choose paper(s) not listed here as long as they are related to RL and have significant theoretical components.
- The references are roughly organized into categories. You can choose 1 or more papers from the same category, or even across categories—as long as you see a strong connection and can write a coherent report based on the selected papers.
- The lists can be incomplete and not representative. You should use them as seed papers and track the citations to read more.
**We will do some of the topics in class.**I will mark them with (*). Your report needs to be significantly different from what’s done in class. If you are unsure which parts/aspects of the papers will be covered, talk to me.**Some papers have been popular choices in past semesters.**I will mark them with (-): You are encouraged to choose other papers. If you have to choose one of them, you will be held to higher standards.**F22: You will need to include a novel extension of this work in your report, based on which your report will be mainly evaluated.**

**Approximate Dynamic Programming and Offline RL**

Approximate Dynamic Programming (ADP) concerns obtaining approximate solutions to large planning problems, often with the help of sampling and function approximation. Many ADP methods can be considered as prototype algorithms for popular value-based RL algorithms used today, especially in the offline setting, so it is important to understand their behaviors and guarantees.

- (*) Performance bounds in L_p norm for approximate value iteration
- Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path
- Error propagation for approximate policy and value iteration
- Performance Loss Bounds for Approximate Value Iteration with State Aggregation
- (*) Information-Theoretic Considerations in Batch Reinforcement Learning
- Q* Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison
- Batch Value-function Approximation with Only Realizability

*Offline RL without exploratory data*

- Provably good batch off-policy reinforcement learning without great exploration
- Is Pessimism Provably Efficient for Offline RL?
- Bellman-consistent Pessimism for Offline Reinforcement Learning
- Pessimistic Model-based Offline RL: PAC Bounds and Posterior Sampling under Partial Coverage
- Adversarially Trained Actor Critic for Offline Reinforcement Learning

*Online + Offline (Hybrid)*

- Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning
- Policy Finetuning in Reinforcement Learning via Design of Experiments using Offline Data
- Hybrid rl: Using both offline and online data can make rl efficient

*Lower bounds*

- Offline reinforcement learning: Fundamental barriers for value function approximation
- What are the Statistical Limits of Offline RL with Linear Function Approximation?
- Exponential lower bounds for batch reinforcement learning: Batch rl can be exponentially harder than online rl

**Off-policy Evaluation**

How to estimate the performance of a policy using data collected from a different policy? This question has important implications in safety and real-world applications of RL.

*OPE in Bandits*

- Effective Evaluation Using Logged Bandit Feedback from Multiple Loggers
- Balanced Policy Evaluation and Learning
- Optimal and Adaptive Off-Policy Evaluation in Contextual Bandits
- Adaptive Estimator Selection for Off-Policy Evaluation

*Importance Sampling for RL*

- (*) Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
- (-) Data-efficient off-policy policy evaluation for reinforcement learning

*Marginalized Importance Sampling and Beyond*

- (*) Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation
- Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling
- (-) DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
- (*) Minimax Weight and Q-Function Learning for Off-Policy Evaluation
- Reinforcement Learning via Fenchel-Rockafellar Duality
- Accountable Off-Policy Evaluation With Kernel Bellman Statistics
- Minimax Value Interval for Off-Policy Evaluation and Policy Optimization
- Doubly Robust Bias Reduction in Infinite Horizon Off-Policy Estimation
- Beyond the Return: Off-policy Function Estimation under User-specified Error-measuring Distributions
- Offline Reinforcement Learning with Realizability and Single-policy Concentrability

**LSTD and LSPI**

These methods directly minimize the projected Bellman error for linear value-function approximation, and have some different characteristics compared to other TD methods that learn from bootstrapped targets.

- Finite-sample analysis of least-squares policy iteration
- LSTD with Random Projections
- The fixed points of off-policy TD
- A Sharp Characterization of Linear Estimators for Offline Policy Evaluation
- The Optimal Approximation Ratios in Misspecified Off-Policy Value Function Estimation

**Multi-agent RL**

*Online*

*Offline*

- When is Offline Two-Player Zero-Sum Markov Game Solvable?
- Provably Efficient Offline Multi-agent Reinforcement Learning via Strategy-wise Bonus
- Offline Learning in Markov Games with General Function Approximation

**POMDPs and PSRs**

When Markov assumption does not hold and the world is partially observable, how can you recover *state* from observations and plan based on the *belief*?

*Planning*

- Point-based value iteration: An anytime algorithm for POMDPs
- Perseus: Randomized Point-based Value Iteration for POMDPs
- Closing the learning–planning loop with predictive state representations
- Hilbert Space Embeddings of Hidden Markov Models
- Low-Rank Spectral Learning with Weighted Loss Functions

*Learning*

- Reinforcement Learning of POMDPs using Spectral Methods
- A PAC RL Algorithm for Episodic POMDPs
- Sample-Efficient Reinforcement Learning of Undercomplete POMDPs
- Future-Dependent Value-Based Off-Policy Evaluation in POMDPs
- Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems
- Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency
- When Is Partially Observable Reinforcement Learning Not Scary?
- PAC Reinforcement Learning for Predictive State Representations
- Learning in pomdps is sample-efficient with hindsight observability
- Optimistic MLE: A Generic Model-Based Algorithm for Partially Observable Sequential Decision Making

**Sampling-based Planning Algorithms**

Computing near-optimal policy in large MDPs by sampling in a generative model of the environment.

- (-) A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes
- PEGASUS: A policy search method for large MDPs and POMDPs
- Sample complexity of policy search with known dynamics
- Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning
- Dual Policy Iteration
- Efficient Planning in Large MDPs with Weak Linear Function Approximation
- Learning with Good Feature Representations in Bandits and in RL with a Generative Model
- On Query-efficient Planning in MDPs under Linear Realizability of the Optimal State-value Function
- Exponential lower bounds for planning in mdps with linearly-realizable optimal action-value functions
- Tensorplan and the few actions lower bound for planning in mdps under linear realizability of optimal value functions

**PAC-RL with Function Approximation**

*Structural Assumptions (Bellman rank, Eluder dimension, etc.)*

- (*) Contextual Decision Processes with Low Bellman Rank are PAC-Learnable
- On Oracle-Efficient PAC Reinforcement Learning with Rich Observations
- Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches
- Bilinear Classes: A Structural Framework for Provable Generalization in RL
- Bellman Eluder Dimension: New Rich Classes of RL Problems, and Sample-Efficient Algorithms
- Non-Linear Reinforcement Learning in Large Action Spaces: Structural Conditions and Sample-efficiency of Posterior Sampling
- Model-based reinforcement learning and the eluder dimension
- Efficient exploration and value function generalization in deterministic systems
- The Role of Coverage in Online Reinforcement Learning

*Latent State Discovery*

- Provably Efficient RL with Rich Observations via Latent State Decoding
- Kinematic state abstraction and provably efficient rich-observation reinforcement learning
- Sample-Efficient Reinforcement Learning in the Presence of Exogenous Information
- Provable rich observation reinforcement learning with combinatorial latent states

*Low-rank/Linear MDPs*

- (*) Provably efficient reinforcement learning with linear function approximation
- Frequentist regret bounds for randomized least-squares value iteration
- Pc-pg: Policy cover directed exploration for provable policy gradient learning
- FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs
- Sample complexity of reinforcement learning using linearly combined model ensembles
- Model-free representation learning and exploration in low-rank mdps
- Representation learning for online and offline rl in low-rank mdps
- Efficient Model-Free Exploration in Low-Rank MDPs
- Reinforcement Learning in Low-Rank MDPs with Density Features

*Thompson sampling*

*Lower bounds*

- Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning?
- The Statistical Complexity of Interactive Decision Making

**PAC-RL (tabular)**

- Near-optimal regret bounds for reinforcement learning
- Sample complexity of episodic fixed-horizon reinforcement learning
- Minimax regret bounds for reinforcement learning
- Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning
- Policy Certificates: Towards Accountable Reinforcement Learning
- Optimistic posterior sampling for reinforcement learning: worst-case regret bounds
- Regret Bounds for Reinforcement Learning via Markov Chain Concentration
- Is Q-learning Provably Efficient?
- A Unifying View of Optimism in Episodic Reinforcement Learning
- Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

*Lower bounds*

*Exploiting side information or structures in easy problems*

- Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs
- When Simple Exploration is Sample Efficient: Identifying Sufficient Conditions for Random Exploration to Yield PAC RL Algorithms
- Guarantees for Epsilon-Greedy Reinforcement Learning with Function Approximation
- Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds

*Concurrent/”Low-Switch”/”Deployment-efficient” PAC-RL*

- Concurrent PAC RL
- Provably Efficient Q-Learning with Low Switching Cost
- Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality

*Multi-task & Life-long Learning and Learning across Population*

- Sample complexity of multi-task reinforcement learning
- PAC-inspired Option Discovery in Lifelong Reinforcement Learning
- Markov Decision Processes with Continuous Side Information

*PAC Hierarchical RL*

**Linear Programming for MDPs**

Apart from value/policy iteration, Linear Programming (LP) is another standard method for solving MDPs. The LP formulation also reveals many interesting properties of MDPs (e.g., the dual formulation has occupancy measure as its decision variables).

- (-) The Linear Programming Approach to Approximate Dynamic Programming
- On constraint sampling in the linear programming approach to approximate dynamic programming
- A Linearly Relaxed Approximate Linear Program for Markov Decision Processes
- Scalable Bilinear π Learning Using State and Action Features

**Entropy Regularization**

- Bridging the Gap Between Value and Policy Based Reinforcement Learning
- Equivalence Between Policy Gradients and Soft Q-Learning
- A unified view of entropy-regularized markov decision processes
- Logistic Q-learning

**Decision-aware model-based RL**

- Value-aware loss function for model-based reinforcement learning
- Iterative Value-Aware Model Learning
- Minimax Model Learning

**RL and Causality**

In batch RL we often assume that logged actions are function of states, which are not necessarily true in some application scenarios. If actions in the data were determined based on hidden factors, standard algorithms may suffer from the confounding effect, and we may seek tools from causal inference to mitigate it.

- Confounding-Robust Policy Improvement
- Off-Policy Evaluation in Partially Observable Environments
- Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with Latent Confounders

**Robustness in MDPs**

Conservative approaches to MDPs. Compute a policy that is robust to parameter uncertainty of MDPs. Or when the MDP is fully known, compute a policy whose return is worst-case optimal w.r.t. the randomness of trajectory.

- Robust Control of Markov Decision Processes with Uncertain Transition Matrices
- RAAM: The Benefits of Robustness in Approximating Aggregated MDPs in Reinforcement Learning
- Risk-sensitive and robust decision-making: a CVaR optimization approach
- Tight Bayesian Ambiguity Sets for Robust MDPs

**Policy Gradient and Policy Improvement Methods**

How to design an RL algorithm that keeps improving the policy monotonically? This motivation inspired the original CPI algorithm, and later its variants such as TRPO.

- (-) Approximately Optimal Approximate Reinforcement Learning
- (-) Trust Region Policy Optimization
- Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes
- Politex: Regret bounds for policy iteration using expert prediction
- Exploration-Enhanced Politex

**Interactive Imitation Learning**

How to mimic a policy from expert demonstrations? A pure supervised learning approach (“behavior cloning”) may suffer from distribution shift. Interactive methods, such as DAgger, are more robust against it.

- (-) A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
- (-) Reinforcement and Imitation Learning via Interactive No-Regret Learning
- Learning to Search Better than Your Teacher
- Convergence of Value Aggregation for Imitation Learning
- Accelerating Imitation Learning with Predictive Models
- Provably Efficient Imitation Learning from Observation Alone

**RL & Control**

In control theory, the most basic model of a dynamic environment is a linear system. How to do RL in continuous state space and what guarantees can be obtained?

- Regret Bounds for the Adaptive Control of Linear Quadratic Systems
- On the Sample Complexity of the Linear Quadratic Regulator
- (-) Global Convergence of Policy Gradient Methods for Linearized Control Problems
- The Gap Between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint
- Naive exploration is optimal for online lqr
- The nonstochastic control problem

**Abstractions, Model-based RL, and Representation Selection**

State abstraction is a simplest form of approximation that collapses a high-dimensional state representation to a more manageable one. Understanding of abstractions forms the basis for understanding more complex function approximation schemes. However, designing a good abstraction is a very difficult engineering task; can we have the algorithm automatically select among candidate representations in learning?

- (-) Kernel-based Reinforcement Learning
- Policy Error Bounds for Model-Based Reinforcement Learning with Factored Linear Models
- (*) Towards a Unified Theory of State Abstraction for MDPs
- Abstraction Selection in Model-based Reinforcement Learning
- Model selection in reinforcement learning
- Selecting Near-Optimal Approximate State Representations in Reinforcement Learning
- The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning
- Online model selection for reinforcement learning with function approximation

**Factored MDPs**

Factored MDP is a tractable formulation of large scale MDPs: state is represented by factors, and the future value of each factor only depends on the values of a small number of parent factors now.

- Efficient Reinforcement Learning in Factored MDPs
- Efficient Solution Algorithms for Factored MDPs
- Near-optimal Reinforcement Learning in Factored MDPs
- Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and Tighter Regret Bounds for the Non-Episodic Setting

**Stochastic Approximation**

Convergence analyses of Q-learning, TD, etc.

- Convergence of Stochastic Iterative Dynamic Programming Algorithms
- Finite-Sample Analysis of Proximal Gradient TD Algorithms
- Finite Sample Analyses for TD(0) with Function Approximation

**Concentration Inequalities in Dependent Processes**