Konstantin Mishchenko

Konstantin Mishchenko

Postdoctoral Researcher

Inria Sierra


I’m a postdoc at Inria Sierra working with Alexandre d’Aspremont and Francis Bach. I received my PhD in computer science from KAUST, where I worked under the supervision of Peter Richtárik on optimization theory and its applications in machine learning. In 2020, I interned at Google Brain hosted by Nicolas Le Roux and Courtney Paquette. Prior to that, I obtained my double degree MSc diploma from École Normale Supérieure Paris-Saclay and Paris-Dauphine, and a BSc from Moscow Institute of Physics and Technology.

My hobbies include squash, ultimate frisbee, and bouldering.

  • Optimization
  • Deep learning
  • Federated and distributed learning
  • PhD in Computer Science, 2021


  • MSc in Data Science, 2017

    École normale supérieure Paris-Saclay and Paris-Dauphine

  • BSc in Computer Science and Physics, 2016

    Moscow Institute of Physics and Technology


Inria Sierra
Inria Sierra
Dec 2021 – Present Paris, France

Research directions:

  • Adaptive algorithms
  • Second-order algorithms
  • Distributed training

Recent Posts

New paper: Asynchronous SGD with arbitrary delays
My first ever optimization project was an ICML paper about an asynchronous gradient method. At the time, I was quite confused by the fact that no matter what I was doing, Asynchronous gradient descent still converged. Five years later, I can finally give an answer: Because Asynchronous SGD doesn’t care about the delays, which we proved in https://arxiv.org/abs/2206.07638"our new paper. For a short summary, you can read my twitter thread about the paper or check my slides.
I'm at attending SICO conference 12-15 June

In a few days, I am travelling to Autrans, Vercors near Grenoble for the SICO conference dedicated to the 60th birthday of Anatoli Juditsky. The conference will feature a number of speakers working on optimization and statistics. As I did my master’s thesis at the University of Grenoble, I’m really happy to go there again after having been away for almost 5 years.

On the last day of the conference, I will give a talk about a new paper on Asynchronous SGD. The work that I will present is also going to appear online quite soon.

2 papers accepted to ICML

Two of my papers got accepted for presentation at ICML:

The first of these two papers was a first-time submission and the latter was a resubmission. Earlier, we opted in to release online the reviews for the Prox RR paper from NeurIPS 2021, so the ICML reviewers could see (if they searched) that our work was previously rejected. Nevertheless, it was recommended for acceptance.
Although I’m happy about my works, I feel there is still a lot of changed required to fix the reviewing process. One thing that I’m personally waiting for is that every conference would use OpenReview instead of CMT. OpenReview give the opportunity to write individual responses to the reviewers and supports LaTeX in the editor, which are amazing things.
If your paper did not get accepted, don’t take it as a strong evidence that your work is not appreciated, it often happens to high-quality works. A good example of this is the recent revelation by Mark Schmidt on Twitter that their famous SAG paper was rejected from ICML 2012.

Recent Papers

Quickly discover relevant content by filtering publications.
(2022). Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays.

PDF Cite Slides arXiv

(2022). ProxSkip: Yes! Local Gradient Steps Provably Lead to Communication Acceleration! Finally!.

PDF Cite Video arXiv

(2021). IntSGD: Adaptive Floatless Compression of Stochastic Gradients.

PDF Cite Code Poster Slides arXiv ICLR

(2021). Proximal and Federated Random Reshuffling.

PDF Cite Code Slides Video arXiv

(2020). Dualize, Split, Randomize: Fast Nonsmooth Optimization Algorithms.

PDF Cite Poster arXiv

(2019). First Analysis of Local GD on Heterogeneous Data.

PDF Cite Slides arXiv NeurIPS

(2019). MISO is Making a Comeback With Better Proofs and Rates.

PDF Cite arXiv

(2019). DAve-QN: A Distributed Averaged Quasi-Newton Method with Local Superlinear Convergence Rate.


(2019). Revisiting Stochastic Extragradient.

PDF Cite Slides arXiv AISTATS

(2019). Stochastic Distributed Learning with Gradient Quantization and Variance Reduction.

PDF Cite arXiv

(2019). Distributed Learning with Compressed Gradient Differences.

PDF Cite arXiv

(2018). SEGA: Variance Reduction via Gradient Sketching. In Advances in Neural Information Processing Systems, 2018.

PDF Cite arXiv NIPS

(2018). A Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning. International Conference on Machine Learning.


(2018). A Distributed Flexible Delay-tolerant Proximal Gradient Algorithm.

PDF Cite arXiv SIAM