joint work with Andrey Bernstein, Yue Chen, Marcello Colombino, Emiliano Dall'Anese, Prashant Mehta
The talk will survey new results in the area of “quasi-stochastic approximation” in which all of the processes under consideration are deterministic, much like quasi-Monte-Carlo for variance reduction in simulation. Variance reduction can be substantial, but only if the parameters in the algorithm are chosen with attention to constraints. These results yield a new class of reinforcement learning algorithms for deterministic state space models. Specifically, we propose a class of algorithms for policy evaluation, using a different policy designed to introduce “exploration”.
joint work with Rupam Mahmood, Richard Sutton
We introduce a new class of TD(lambda) algorithms for policy evaluation, based on the idea of randomized stopping times and generalized Bellman equations. The key innovation is a dynamic scheme to judiciously set the lambda-parameters in eligibility trace iterates, which is especially useful for balancing the bias-variance tradeoff in off-policy learning. We prove weak convergence and almost sure convergence results for several gradient-type algorithms, including saddle-point and mirror-descent TD variants, by using ergodic theory for weak Feller Markov chains and mean-ODE based proof methods.
joint work with Kaiqing Zhang, Hao Zhu, Tamer Başar
We propose a new variant for infinite-horizon problems that uses a random rollout horizon to estimate the Q function. Doing so yields unbiased gradient estimates, which yields an alternative method to recover existing convergence results. We then modify the proposed PG method by introducing a periodically enlarged stepsize rule, which under the correlated negative curvature condition (which holds when the reward is strictly positive or negative), converges to a second-order stationarity, i.e., approximate local maxima. Intriguingly, we substantiate reward-reshaping from nonconvex optimization.