joint work with Benjamin Recht
We study the sample complexity of popular model-based and model-free algorithms on the Linear Quadratic Regulator (LQR). For policy evaluation, we show that a simple model-based plugin method requires asymptotically less samples than least-squares temporal difference (LSTD) to reach similar performance; the sample complexity gap between the two methods is at least a factor of state dimension. For simple problem instances, nominal (certainty equivalence principle) control requires several factors of state and input dimension fewer samples than policy gradient method for similar performance.
joint work with Belhal Karimi, Błażej Miasojedow, Eric Moulines
In this talk, we analyze a general stochastic approximation scheme to minimize a non-convex, smooth objective function. We consider an update procedure whose drift term depends on a state-dependent Markov chain and the mean field is not necessarily of gradient type, therefore it is a biased SA scheme. Importantly, we provide the first non-asymptotic convergence rate under such dynamical setting. We illustrate these settings with the policy-gradient method for average reward maximization, highlighting its tradeoff in performance between the bias and variance.