The communication overhead is a key bottleneck that hinders perfect scalability of distributed optimization algorithms. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated. We analyze Stochastic Gradient Descent (SGD) with compression of all exchanges messages (like e.g. sparsification or quantization) and show that these schemes converge at the same rate as vanilla SGD when equipped with error compensation technique (i.e. keeping track of accumulated errors in memory).
joint work with Albert Berahas, Majid Jahani
We present two sampled quasi-Newton methods for deep learning: sampled LBFGS (S-LBFGS) and sampled LSR1 (S-LSR1). Contrary to the classical variants of these methods that sequentially build Hessian or inverse Hessian approximations as the optimization progresses, our proposed methods sample points randomly around the current iterate at every iteration to produce these approximations. As a result, the approximations constructed make use of more reliable (recent and local) information, and do not depend on past iterate information that could be significantly stale.
joint work with Mahmoud Assran, Mike Rabbat, Nicolas Ballas
Distributed data-parallel algorithms aim to accelerate the training of deep neural networks by parallelizing the computation of large mini-batch gradient updates across multiple nodes. In this work, we study Stochastic Gradient Push (SGP) which combines PushSum gossip protocol with stochastic gradient updates. We prove that SGP converges to a stationary point of smooth, non- convex objectives at the same sub-linear rate as SGD, that all nodes achieve consensus, and that SGP achieves a linear speedup with respect to the number of nodes. Finally, we empirically validate the performance of SGP.