Provable surrogate gradient-based optimization for unsupervised learning
Gradient-based optimization lies at the core of modern machine learning and deep learning, with (stochastic) gradient descent algorithms being employed as the main workhorse. The unprecedented success of deep learning over the last decade has arguably been tied with the popularity and mysterious success of these algorithms --- particularly for supervised learning.
This success is despite the fact that objective functions in deep learning are extremely non-linear and non-convex.
The past few years have witnessed great theoretical advances in analyzing optimization and inductive biases of gradient descent for supervised learning. However, the majority of existing work only applies to settings such as classification and regression. In contrast, the role of gradient descent in the unsupervised setting has gained far less attention. In this work, we make concrete contributions to the understanding of gradient-based optimization in unsupervised learning.
We start with dictionary learning, an unsupervised feature learning mechanism widely used in signal processing and machine learning. The primary goal of dictionary learning is to learn sparse, linear representations of the data by minimizing the reconstruction loss. In this problem, the objective function is coupled with an intractable sparse coding step due to the latent representations. Therefore, the gradient of the loss with respect to the model parameters can not be obtained exactly but is only a noisy estimate of the true gradient. However, gradient-based alternating minimization for dictionary learning works surprisingly well in practice while theoretical understanding of the success has lagged behind. We will refer to this method as "surrogate" gradient-based optimization.
In Chapter 1 and Chapter 2, we introduce two surrogate gradient descent algorithms for sparse coding. The first algorithm learns a double-sparsity model where the dictionary is the product of a fixed, known basis and a learnable sparse component. The second algorithm provably learns a dictionary from samples with missing entries. In each case, we provide a spectral initialization subroutine that gives a coarse estimate of the true dictionary. Then, starting from this estimate, we prove that the surrogate descent algorithm linearly converges to the true dictionary. We analyze the algorithm and demonstrate superior sample complexity and computational complexity bounds over existing provable approaches.
While sparse coding is still widely used, its computational cost is prohibitive for high-dimensional data. Autoencoders have instead emerged as an efficient and flexible alternative for feature learning using neural networks. In Chapter 3, we build upon our theory of surrogate gradient developed in the previous chapters to provide a series of results for autoencoder learning. For several generative models of data, we prove that when trained with gradient descent, two-layer weight-tied autoencoders can successfully recover the ground-truth parameters of the corresponding models. Our analysis establishes theoretical evidence that shallow autoencoder modules can indeed be powerful feature learning mechanisms for a variety of data models. In Chapter 4, we go beyond the local analysis in Chapter 3 and analyze the gradient dynamics of over-parameterized autoencoders. Under a few mild assumptions about the given training dataset, we rigorously prove the linear convergence of gradient descent for randomly initialized autoencoder networks. Our analysis mirrors the recent advances in the emerging theory of neural tangent kernels.
Chapter 5 considers a black-box optimization problem where the objective and constraints are specified as solutions to expensive PDE solvers. We pose this optimization as sampling from a Gibbs distribution with a black-box energy function and perform Langevin sampling by using surrogate gradients of the black-box functions learned by deep neural networks. We prove the convergence of the surrogate Langevin dynamics when the target distribution is log-concave and smooth. Finally, in Chapter 6, we lay out several potential directions that merge two lines of research in gradient flow analysis and Langevin dynamics, as well as inverse problems with generative priors.