The focus on optimization is a major trend in modern machine learning. In turn, a number of optimization solutions have been recently developed and motivated by machine learning applications. However, most optimization guarantees focus on the training error, ignoring the performance at test time which is the real goal in machine learning. In this talk, take steps to fill this gap in the context of least squares learning. We analyze the learning (test) performance of accelerated and stochastic gradient methods. In particular, we discuss the influence different learning assumptions.
Many neural nets are trained in a regime where the parameters of the model exceed the size of the training dataset. Due to their over-parameterized nature these models have the capacity to (over)fit any set of labels including pure noise. Despite this high fitting capacity, somewhat paradoxically, models trained via first-order methods continue to predict well on yet unseen test data. In this talk I will discuss some results aimed at demystifying such phenomena by demonstrating that gradient methods achieve (1) global optima, (2) are robust to corruption, and (3) generalize to new test data.
Classical statistics teaches the dangers of overfitting. Yet, in modern practice, deep networks that interpolate the data (have near zero training loss) show excellent test performance. In this talk I will show how classical and modern models can be reconciled within a single "double descent" risk curve, which extends the usual U-shaped bias-variance trade-off curve beyond the point of interpolation. This understanding of model performance delineates the limits of classical analyses and opens new lines of inquiry into generalization and optimization properties of models.