Provable surrogate gradient-based optimization for unsupervised learning

dc.contributor.advisor Chinmay Hegde
dc.contributor.author Nguyen, Thanh
dc.contributor.department Electrical and Computer Engineering
dc.date 2020-06-26T19:54:08.000
dc.date.accessioned 2020-06-30T03:21:50Z
dc.date.available 2020-06-30T03:21:50Z
dc.date.copyright Fri May 01 00:00:00 UTC 2020
dc.date.embargo 2020-12-16
dc.date.issued 2020-01-01
dc.description.abstract <p>Gradient-based optimization lies at the core of modern machine learning and deep learning, with (stochastic) gradient descent algorithms being employed as the main workhorse. The unprecedented success of deep learning over the last decade has arguably been tied with the popularity and mysterious success of these algorithms --- particularly for supervised learning.</p> <p>This success is despite the fact that objective functions in deep learning are extremely non-linear and non-convex.</p> <p>The past few years have witnessed great theoretical advances in analyzing optimization and inductive biases of gradient descent for supervised learning. However, the majority of existing work only applies to settings such as classification and regression. In contrast, the role of gradient descent in the unsupervised setting has gained far less attention. In this work, we make concrete contributions to the understanding of gradient-based optimization in unsupervised learning.</p> <p>We start with dictionary learning, an unsupervised feature learning mechanism widely used in signal processing and machine learning. The primary goal of dictionary learning is to learn sparse, linear representations of the data by minimizing the reconstruction loss. In this problem, the objective function is coupled with an intractable sparse coding step due to the latent representations. Therefore, the gradient of the loss with respect to the model parameters can not be obtained exactly but is only a noisy estimate of the true gradient. However, gradient-based alternating minimization for dictionary learning works surprisingly well in practice while theoretical understanding of the success has lagged behind. We will refer to this method as "surrogate" gradient-based optimization.</p> <p>In Chapter 1 and Chapter 2, we introduce two surrogate gradient descent algorithms for sparse coding. The first algorithm learns a double-sparsity model where the dictionary is the product of a fixed, known basis and a learnable sparse component. The second algorithm provably learns a dictionary from samples with missing entries. In each case, we provide a spectral initialization subroutine that gives a coarse estimate of the true dictionary. Then, starting from this estimate, we prove that the surrogate descent algorithm linearly converges to the true dictionary. We analyze the algorithm and demonstrate superior sample complexity and computational complexity bounds over existing provable approaches.</p> <p>While sparse coding is still widely used, its computational cost is prohibitive for high-dimensional data. Autoencoders have instead emerged as an efficient and flexible alternative for feature learning using neural networks. In Chapter 3, we build upon our theory of surrogate gradient developed in the previous chapters to provide a series of results for autoencoder learning. For several generative models of data, we prove that when trained with gradient descent, two-layer weight-tied autoencoders can successfully recover the ground-truth parameters of the corresponding models. Our analysis establishes theoretical evidence that shallow autoencoder modules can indeed be powerful feature learning mechanisms for a variety of data models. In Chapter 4, we go beyond the local analysis in Chapter 3 and analyze the gradient dynamics of over-parameterized autoencoders. Under a few mild assumptions about the given training dataset, we rigorously prove the linear convergence of gradient descent for randomly initialized autoencoder networks. Our analysis mirrors the recent advances in the emerging theory of neural tangent kernels.</p> <p>Chapter 5 considers a black-box optimization problem where the objective and constraints are specified as solutions to expensive PDE solvers. We pose this optimization as sampling from a Gibbs distribution with a black-box energy function and perform Langevin sampling by using surrogate gradients of the black-box functions learned by deep neural networks. We prove the convergence of the surrogate Langevin dynamics when the target distribution is log-concave and smooth. Finally, in Chapter 6, we lay out several potential directions that merge two lines of research in gradient flow analysis and Langevin dynamics, as well as inverse problems with generative priors.</p>
dc.format.mimetype application/pdf
dc.identifier archive/lib.dr.iastate.edu/etd/17923/
dc.identifier.articleid 8930
dc.identifier.contextkey 18242497
dc.identifier.doi https://doi.org/10.31274/etd-20200624-102
dc.identifier.s3bucket isulib-bepress-aws-west
dc.identifier.submissionpath etd/17923
dc.identifier.uri https://dr.lib.iastate.edu/handle/20.500.12876/32106
dc.language.iso en
dc.source.bitstream archive/lib.dr.iastate.edu/etd/17923/Nguyen_iastate_0097E_18698.pdf|||Fri Jan 14 21:31:09 UTC 2022
dc.subject.keywords Gradient Descent
dc.subject.keywords Machine Learning
dc.subject.keywords Optimization
dc.subject.keywords Provable Methods
dc.subject.keywords Unsupervised Learning
dc.title Provable surrogate gradient-based optimization for unsupervised learning
dc.type article
dc.type.genre thesis
dspace.entity.type Publication
relation.isOrgUnitOfPublication a75a044c-d11e-44cd-af4f-dab1d83339ff
thesis.degree.discipline Electrical Engineering(Communicationsand Signal Processing)
thesis.degree.level thesis
thesis.degree.name Doctor of Philosophy
File
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Nguyen_iastate_0097E_18698.pdf
Size:
1.85 MB
Format:
Adobe Portable Document Format
Description: