Stochastic Distributed Learning with Gradient Quantization and Double Variance Reduction


We consider distributed optimization where the objective function is spread among different devices, each sending incremental model updates to a central server. To alleviate the communication bottleneck, recent work proposed various schemes to compress (e.g. quantize or sparsify) the gradients, thereby introducing additional variance $ \omega\ge $ 1 that might slow down convergence. For strongly convex functions with condition number kappa distributed among $n$ machines, we (i) give a scheme that converges in $O((\kappa+\kappa \omega / n+\omega) \log(1/\epsilon))$ steps to a neighborhood of the optimal solution. For objective functions with a finite-sum structure, each worker having less than m components, we (ii) present novel variance reduced schemes that converge in $O((\kappa+\kappa \omega / n+\omega+m)log(1/\epsilon))$ steps to arbitrary accuracy $\epsilon>0$.

Optimization Methods and Software