Typo in Algorithm 1, Line 5: the learner should act greedily according to (R_t + c_t) instead of c_t itself. NJ thanks Pengfei Yu for pointing this out as part of CS598 F20 final project report.