(page reused and constantly updated over multiple semesters)

- The references are to help you choose the course project topic, not to put constraints on you. Feel free to choose paper(s) not listed here as long as they are related to RL and have significant theoretical components.
- The references are roughly organized into categories. You can choose 1 or more papers from the same category, or even across categories—as long as you see a strong connection and can write a coherent report based on the selected papers.
- The lists can be incomplete and not representative. You should use them as seed papers and track the citations to read more.
**We will do some of the topics in class.**I will mark them with (*). Your report needs to be significantly different from what’s done in class. If you are unsure which parts/aspects of the papers will be covered, talk to me.**Some papers have been popular choices in past semesters.**I will mark them with (-): You are encouraged to choose other papers. If you have to choose one of them, you will be held to higher standards.**F20: You will need to include a novel extension of this work in your report, based on which your report will be mainly evaluated.**

**Approximate Dynamic Programming**

Approximate Dynamic Programming (ADP) concerns obtaining approximate solutions to large planning problems, often with the help of sampling and function approximation. Many ADP methods can be considered as prototype algorithms for popular value-based RL algorithms used today, so it is important to understand their behaviors and guarantees.

- (*) Performance bounds in L_p norm for approximate value iteration
- Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path
- Error propagation for approximate policy and value iteration
- Modelling transition dynamics in MDPs with RKHS embeddings
- SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation
- (*) Information-Theoretic Considerations in Batch Reinforcement Learning
- Q* Approximation Schemes for Batch Reinforcement Learning: A Theoretical Comparison
- Batch Value-function Approximation with Only Realizability

**Linear Programming for MDPs**

Apart from value/policy iteration, Linear Programming (LP) is another standard method for solving MDPs. The LP formulation also reveals many interesting properties of MDPs (e.g., the dual formulation has occupancy measure as its decision variables).

- The Linear Programming Approach to Approximate Dynamic Programming
- On constraint sampling in the linear programming approach to approximate dynamic programming
- A Linearly Relaxed Approximate Linear Program for Markov Decision Processes
- Scalable Bilinear π Learning Using State and Action Features
- Efficient Planning in Large MDPs with Weak Linear Function Approximation

**LSTD and LSPI**

These methods directly minimize the projected Bellman error for linear value-function approximation, and have some different characteristics compared to other TD methods that learn from bootstrapped targets.

- Finite-sample analysis of least-squares policy iteration
- LSTD with Random Projections
- The fixed points of off-policy TD

**Off-policy Evaluation**

How to estimate the performance of a policy using data collected from a different policy? This question has important implications in safety and real-world applications of RL.

*OPE in Bandits*

- Effective Evaluation Using Logged Bandit Feedback from Multiple Loggers
- Balanced Policy Evaluation and Learning
- Optimal and Adaptive Off-Policy Evaluation in Contextual Bandits
- Adaptive Estimator Selection for Off-Policy Evaluation

*Importance Sampling for RL*

- (*) Eligibility traces for off-policy policy evaluation
- (*) Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
- Data-efficient off-policy policy evaluation for reinforcement learning
- Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation

*Marginalized Importance Sampling and Beyond*

- Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation
- Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling
- Off-Policy Policy Gradient with State Distribution Correction
- (*) Minimax Weight and Q-Function Learning for Off-Policy Evaluation
- Reinforcement Learning via Fenchel-Rockafellar Duality
- Accountable Off-Policy Evaluation With Kernel Bellman Statistics
- (*) Minimax Confidence Interval for Off-Policy Evaluation and Policy Optimization

*Connection between IS and PG*

- On a connection between importance sampling and the likelihood ratio policy gradient
- From Importance Sampling to Doubly Robust Policy Gradient

**RL and Causality**

In batch RL we often assume that logged actions are function of states, which are not necessarily true in some application scenarios. If actions in the data were determined based on hidden factors, standard algorithms may suffer from the confounding effect, and we may seek tools from causal inference to mitigate it.

- Confounding-Robust Policy Improvement
- Off-Policy Evaluation in Partially Observable Environments
- Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with Latent Confounders

**PAC-RL (tabular)**

PAC-MDP is a mathematical framework for studying sample complexity in RL, often with an emphasis on the challenge of *exploration*. PAC bounds for RL have been refined and simplified over time, and new topics beyond classical tabular RL are being explored.

- Near-optimal regret bounds for reinforcement learning
- Sample complexity of episodic fixed-horizon reinforcement learning
- Minimax regret bounds for reinforcement learning
- Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning
- Policy Certificates: Towards Accountable Reinforcement Learning
- Optimistic posterior sampling for reinforcement learning: worst-case regret bounds
- Regret Bounds for Reinforcement Learning via Markov Chain Concentration
- Is Q-learning Provably Efficient?
- A Unifying View of Optimism in Episodic Reinforcement Learning
- Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

*Lower bounds*

- On Lower Bounds for Regret in Reinforcement Learning
- Open Problem: The Dependence of Sample Complexity Lower Bounds on Planning Horizon

*Spectral methods for exploration in POMDPs*

*Exploiting side information or structures in easy problems*

- Problem Dependent Reinforcement Learning Bounds Which Can Identify Bandit Structure in MDPs
- When Simple Exploration is Sample Efficient: Identifying Sufficient Conditions for Random Exploration to Yield PAC RL Algorithms
- PAC Reinforcement Learning with an Imperfect Model
- Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds

*Concurrent & Continuous PAC-RL*

- Concurrent PAC RL
- PAC Optimal Exploration in Continuous Space Markov Decision Processes
- Improving PAC Exploration Using the Median Of Means
- Lipschitz Continuity in Model-based Reinforcement Learning (also see this addendum)

*Multi-task & Life-long Learning and Learning across Population*

- Sample complexity of multi-task reinforcement learning
- PAC-inspired Option Discovery in Lifelong Reinforcement Learning
- Markov Decision Processes with Continuous Side Information

*PAC Hierarchical RL*

**PAC-RL with Function Approximation**

Most PAC-MDP literature focuses on the tabular case, while empirical RL algorithms all rely on function approximation. What is a mathematical framework for understanding exploration in RL with function approximation?

*Bellman rank*

- (*) Contextual Decision Processes with Low Bellman Rank are PAC-Learnable
- On Oracle-Efficient PAC Reinforcement Learning with Rich Observations
- Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches
- Provably Efficient RL with Rich Observations via Latent State Decoding

*Eluder dimension*

- Model-based reinforcement learning and the eluder dimension
- Efficient exploration and value function generalization in deterministic systems

*Low-rank/Linear MDPs*

- Provably efficient reinforcement learning with linear function approximation
- FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs
- Sample complexity of reinforcement learning using linearly combined model ensembles

*Lower bound*

**Inverse RL**

In MDPs we start with a reward function. Where does it come from? In fact, it can be very difficult to specify a good reward function in practice. One way to resolve this problem is by inverse RL, where we derive a reward function algorithmically from expert demonstrations.

- (-) Algorithms for Inverse Reinforcement Learning
- (-) Apprenticeship learning via inverse reinforcement learning
- Repeated Inverse Reinforcement Learning
- Learning Safe Policies with Expert Guidance

**Reward Shaping**

Here is a hard trade-off: a reward function that faithfully describes the ultimate evaluation criteria is often sparse and difficult to work with; a non-sparse reward function that gives a lot of learning signals may result in undesired behaviors. How to transform a sparse reward function into a more learnable one without sacrificing fidelity to the ultimate goal?

**On Planning Horizon**

In practice, the horizon / discount factor in an RL algorithm is usually smaller than that in the problem definition. When there is such mismatch between the two notions of horizons, how lossy is the policy that we get?

- Biasing Approximate Dynamic Programming with a Lower Discount Factor
- The Dependence of Effective Planning Horizon on Model Accuracy
- On Structural Properties of MDPs that Bound Loss due to Shallow Planning
- Truncated Approximate Dynamic Programming with Task-Dependent Terminal Value

**Robustness in MDPs**

Conservative approaches to MDPs. Compute a policy that is robust to parameter uncertainty of MDPs. Or when the MDP is fully known, compute a policy whose return is worst-case optimal w.r.t. the randomness of trajectory.

- Robust Control of Markov Decision Processes with Uncertain Transition Matrices
- RAAM: The Benefits of Robustness in Approximating Aggregated MDPs in Reinforcement Learning
- Risk-sensitive and robust decision-making: a CVaR optimization approach
- Tight Bayesian Ambiguity Sets for Robust MDPs

**Policy Gradient and Policy Improvement Methods**

How to design an RL algorithm that keeps improving the policy monotonically? This motivation inspired the original CPI algorithm, and later its variants such as TRPO.

- (-) Approximately Optimal Approximate Reinforcement Learning
- (-) Trust Region Policy Optimization
- Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes
- Politex: Regret bounds for policy iteration using expert prediction

**Interactive Imitation Learning**

How to mimic a policy from expert demonstrations? A pure supervised learning approach (“behavior cloning”) may suffer from distribution shift. Interactive methods, such as DAgger, are more robust against it.

- (-) A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
- (-) Reinforcement and Imitation Learning via Interactive No-Regret Learning
- Learning to Search Better than Your Teacher
- Convergence of Value Aggregation for Imitation Learning
- Accelerating Imitation Learning with Predictive Models
- Provably Efficient Imitation Learning from Observation Alone

**Sampling-based Planning Algorithms**

Solving large MDPs by sampling.

- A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes
- PEGASUS: A policy search method for large MDPs and POMDPs
- Sample complexity of policy search with known dynamics
- Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning
- Dual Policy Iteration

**RL & Control**

In control theory, the most basic model of a dynamic environment is a linear system. How to do RL in continuous state space and what guarantees can be obtained?

- Regret Bounds for the Adaptive Control of Linear Quadratic Systems
- On the Sample Complexity of the Linear Quadratic Regulator
- (-) Global Convergence of Policy Gradient Methods for Linearized Control Problems
- The Gap Between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint
- Naive exploration is optimal for online lqr

**Abstractions, Model-based RL, and Representation Selection**

State abstraction is a simplest form of approximation that collapses a high-dimensional state representation to a more manageable one. Understanding of abstractions forms the basis for understanding more complex function approximation schemes. However, designing a good abstraction is a very difficult engineering task; can we have the algorithm automatically select among candidate representations in learning?

- Kernel-based Reinforcement Learning
- Policy Error Bounds for Model-Based Reinforcement Learning with Factored Linear Models
- (*) Towards a Unified Theory of State Abstraction for MDPs
- Abstraction Selection in Model-based Reinforcement Learning
- Model selection in reinforcement learning
- Selecting Near-Optimal Approximate State Representations in Reinforcement Learning
- The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning

**Factored MDPs**

Factored MDP is a tractable formulation of large scale MDPs: state is represented by factors, and the future value of each factor only depends on the values of a small number of parent factors now.

- Efficient Reinforcement Learning in Factored MDPs
- Efficient Solution Algorithms for Factored MDPs
- Near-optimal Reinforcement Learning in Factored MDPs Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and Tighter Regret Bounds for the Non-Episodic Setting

**POMDPs and PSRs**

When Markov assumption does not hold and the world is partially observable, how can you recover *state* from observations and plan based on the *belief*?

- Point-based value iteration: An anytime algorithm for POMDPs
- Perseus: Randomized Point-based Value Iteration for POMDPs
- Closing the learning–planning loop with predictive state representations
- Hilbert Space Embeddings of Hidden Markov Models
- Low-Rank Spectral Learning with Weighted Loss Functions
- Completing State Representations using Spectral Learning

**Stochastic Approximation**

Convergence analyses of Q-learning, TD, etc.

- Convergence of Stochastic Iterative Dynamic Programming Algorithms
- Finite-Sample Analysis of Proximal Gradient TD Algorithms
- Finite Sample Analyses for TD(0) with Function Approximation

**Concentration Inequalities in Dependent Processes**