A Paradox on Sampling Trajectories with Approximate Belief States

As mentioned on X, I got interested in a POMDP question more than 1 year ago, and discovered many interesting results against my initial intuition. After procrastinating forever, I eventually found time to write things down,1 and decided to share a piece of the counterintuitive results here as a teaser for the paper. The answer can be found in the paper, and I will post the link here after it appears on arXiv. (Update 11/26: paper link)

  1. The style of the paper does not quite fit the common standards of publications (I had an uphill battle before and do not want to try again), and the main purpose of writing is to offload the ideas from my mind so that they do not keep coming up and occupying my thought space forever… and for that I have to strike a trade-off between the rigor of the paper and the time spent on it. As a result, some proofs in the paper are not fully fleshed out, but I have expanded the key analyses to the extent that, I am reasonably confident I am not missing anything major. 

Read More →

In-Sample Moments "Generalize" under Overfitting

As a hitchhiker in learning theory, I constantly bump into a weird kind of problems, where an overfitted hypothesis still predicts certain population quantities accurately. One specific instance is about the 2nd moment of prediction which will be today’s topic.1 After being haunted by the issue for years,2 I finally figured out a way to nicely characterize the phenomenon.

  1. A simpler example is when we predict E[Y]En[f^]\mathbb{E}[Y] \approx \mathbb{E}_n[\hat{f}], which is always accurate provided that F\mathcal{F} is closed under offset, as that implies En[f^]=En[Y]\mathbb{E}_n[\hat{f}] = \mathbb{E}_n[Y]

  2. Some years ago when Alekh, Sham, and I discussed RL at MSR (that was before we started working on the book), Sham once made a point along the lines of: if you fit TD from on-policy trajectories, and you totally overfit, then it just recovers Monte-Carlo? which is very much along the spirit of this post. And guess what, we are working on something related (and very exciting!) and that’s what got me into this rabbit hole again… 

Read More →

Is Density vs. Feature Coverage That Different?

Coverage is a fundamental concept in offline RL and measures how data distribution contains information about another policy. The most basic form of coverage is measured by the magnitude of density ratios, but it is well known that this can be relaxed to feature coverage with linear reward/value functions. Having a lenient definition of coverage can be beneficial in policy evaluation and learning: for evaluation, a target policy may be poorly covered by data in terms of density ratio but well covered in feature, and whether the analysis takes advantage of the latter will determine whether a guarantee can be established. For policy learning, pessimistic approaches allow us to compete with the best covered policy, and a lenient notion of coverage (such as feature coverage) effectively searches over a significantly larger space of policies.

Read More →

Tweet Archive

I occasionally mention useful/fun facts about or related to RL theory on twitter. I often find myself referring to these tweets in conversations and have a hard time finding them, so I decide to gather the links and put them here.