# Research directions in optimization

A student reached out asking for advice on research directions in optimization, so I wrote a long response with pointers to interesting papers. I thought it’d be worth sharing it here too:

- Adaptive optimization. There has been a lot going on in the last year, below are some papers I personally found interesting. First of all, this paper by Li and Lan on Nesterov’s acceleration of adaptive gradient descent: https://arxiv.org/abs/2310.10082 Check Corollary 1 for a simple description of their method. There is one thing I don’t like about it: the amount by which we can increase the stepsize at each iteration decreases as t grows. That being said, I don’t know if this restriction can be lifted, and perhaps it’s the best thing we can get.

Yura Malitsky and I also did some work on adaptive gradient descent, making the stepsizes a bit larger, roughly sqrt(2) improvement over our previous result: https://arxiv.org/abs/2308.02261 We still don’t know if that’s the best we can do or if a tighter analysis can give us better methods.

I should also mention that there is more push in the literature on Polyak stepsize, see for instance these two papers: https://arxiv.org/abs/2407.04358 (a stepsize very similar to Polyak) https://arxiv.org/abs/2406.04142 (Polyak stepsize with momentum)

Adagrad-like methods still can be studied, I believe it’s an underexplored direction. I wish there was more papers on studying the importance of coordinate-wise stepsizes. One paper on the topic I really liked is this study of when Adam is more useful than SGD: https://arxiv.org/abs/2402.19449 There is also some research on new practical methods, for instance, acceleration of DoG is interesting: https://arxiv.org/abs/2404.00666 And I also enjoyed reading this paper by Rodomanov et al. on line-search-inspired stochastic methods: https://arxiv.org/abs/2402.03210

I also like the direction of getting better assumptions for optimization theory and studying the implications. A good example is the gradient clipping literature: https://arxiv.org/abs/1905.11881 ((Lā, Lā)-smoothness) https://arxiv.org/abs/2305.01588 (same revisited) https://arxiv.org/abs/2406.04443 (on heavy-tailed noise) We need to bridge optimization assumptions with what we know about neural networks, so read about properties of neural networks themselves like this: https://arxiv.org/abs/2405.14813 (on scales of layers and how their type affects Lipschitz constants)

These days, people are using deep networks of all scales for their tasks, and they have discovered a lot of tricks that haven’t been studied thoroughly in optimization literature: quantization, Straight-Through Estimator, (https://arxiv.org/abs/1903.05662), low-rank techniques such as LoRA, learning-rate warm-up, etc. You should expose yourself to those tricks to get a better understanding of what the current theory is lacking.

If you’re considering choosing optimization as the topic for your PhD, here are some extra thoughts. Right now there is less activity than about 5 years ago, most low-hanging fruits seem to have been taken, and the remaining questions seem quite challenging. So if you’re looking for a field where it is easy to get publications, it might not be perfect. However, it’s still a good field to produce meaningful theory. It’s also important who you would work with, i.e. if you can find a good advisor, that often affects one’s satisfaction to a larger degree than the topic itself, so make your decision carefully.

As my last word of advice, I definitely encourage testing new methods on neural networks (and preferably not on CIFAR10/CIFAR100, because they give misleading results), at least something like nanoGPT (https://github.com/karpathy/nanoGPT). When I was a PhD student, I did a lot of theoretical research testing my methods on logistic regression and that was useful to understand the theory, but I also had the wrong impression about what works and what doesn’t because of that. If you can, do both, understand the theory as much as you can, but also learn its limits and failure modes.