Mon.1 11:00–12:15 | H 0110 | STO
.

Stochastic Approximation and Reinforcement Learning (1/3)

Chair: Alec Koppel Organizer: Alec Koppel
11:00

Sean Meyn

joint work with Andrey Bernstein, Yue Chen, Marcello Colombino, Emiliano Dall'Anese, Prashant Mehta

Quasi-Stochastic Approximation and Off-Policy Reinforcement Learning

The talk will survey new results in the area of “quasi-stochastic approximation” in which all of the processes under consideration are deterministic, much like quasi-Monte-Carlo for variance reduction in simulation. Variance reduction can be substantial, but only if the parameters in the algorithm are chosen with attention to constraints. These results yield a new class of reinforcement learning algorithms for deterministic state space models. Specifically, we propose a class of algorithms for policy evaluation, using a different policy designed to introduce “exploration”.

11:25

Huizhen Yu

joint work with Rupam Mahmood, Richard Sutton

On Generalized Bellman Equations and Temporal-Difference (TD) Learning for Policy Evaluation in MDPs

We introduce a new class of TD(lambda) algorithms for policy evaluation, based on the idea of randomized stopping times and generalized Bellman equations. The key innovation is a dynamic scheme to judiciously set the lambda-parameters in eligibility trace iterates, which is especially useful for balancing the bias-variance tradeoff in off-policy learning. We prove weak convergence and almost sure convergence results for several gradient-type algorithms, including saddle-point and mirror-descent TD variants, by using ergodic theory for weak Feller Markov chains and mean-ODE based proof methods.

11:50

Alec Koppel

joint work with Kaiqing Zhang, Hao Zhu, Tamer Başar

Global Convergence of Policy Gradient Methods: A Nonconvex Optimization Perspective

We propose a new variant for infinite-horizon problems that uses a random rollout horizon to estimate the Q function. Doing so yields unbiased gradient estimates, which yields an alternative method to recover existing convergence results. We then modify the proposed PG method by introducing a periodically enlarged stepsize rule, which under the correlated negative curvature condition (which holds when the reward is strictly positive or negative), converges to a second-order stationarity, i.e., approximate local maxima. Intriguingly, we substantiate reward-reshaping from nonconvex optimization.