Hardware-aware design, search, and optimization of deep neural networks

dc.contributor.advisor Somani, Arun
dc.contributor.advisor Tyagi, Akhilesh
dc.contributor.advisor Duwe, Henry
dc.contributor.advisor Fernandez-Baca, David
dc.contributor.advisor Roy, Vivekananda
dc.contributor.author Chitty-Venkata, Sai Subra
dc.contributor.department Department of Electrical and Computer Engineering
dc.date.accessioned 2023-08-25T19:17:29Z
dc.date.available 2023-08-25T19:17:29Z
dc.date.issued 2023-08
dc.date.updated 2023-08-25T19:17:29Z
dc.description.abstract Deep Learning has achieved remarkable progress in the last decade due to its powerful automatic representation capability for a variety of tasks, such as Image Recognition, Speech Recognition, and Machine Translation. This success is associated with network design, which is crucial to feature representation, leading to many innovative architectures such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Graph Neural Network (GNN) and Transformers. A wide range of hardware platforms is available to accelerate the performance of Deep Neural Networks (DNNs), ranging from general-purpose hardware such as CPUs to special-purpose devices such as Tensor Processing Unit (TPU). High-performance computing systems such as GPUs effectively reduce the computation time of DNNs. Due to the slowing down of Moore's law, the research in developing Domain-Specific Hardware, which excels in its assigned tasks, has gained significance. Therefore, it is not straightforward to choose a platform that works in all scenarios, as it depends on the application and environment. Neural Architecture Search (NAS), a subset of Automatic Machine Learning (AutoML), is a method to automate the design process of Neural Network architecture on a given task and dataset without significant human intervention. The NAS method is an intelligent algorithm to automatically search for an efficient neural architecture to save the researcher's manual effort and computation time. Hardware-aware Neural Architecture Search (HW-NAS) is a class of problems whose goal is to search for networks that are not only accurate on the given dataset but also hardware-efficient in terms of latency, energy, size, etc. The resulting searched models outperform manually designed networks in several aspects, such as model performance and inference latency on the actual hardware. NAS and HW-NAS have been very successful in searching for efficient models that achieve State-of-the-art performance on many tasks, such as Image Classification, Object Detection, Machine Translation, etc. Pruning and Quantization are two important techniques to design lightweight, memory-efficient, and hardware-friendly methods for inference on a variety of devices such as CPU, GPU, ASIC, and FPGA. These methods successfully compressed large networks into smaller models with negligible accuracy or task performance loss. Neural Network Pruning refers to removing redundant or unimportant weights/nodes/neurons/filters parameters which do not significantly hinder model performance, thereby reducing the size and computational complexity of a model. Network Quantization converts the high-precision model weights/parameters (Floating point 32) to low precision (Integer 8, Integer 4). Quantization methodology has attracted much attention in academia and industry as inference of a model can be performed at a low precision with a negligible drop in accuracy, as opposed to training where a model is trained at high precision. Weight Pruning or element-wise pruning method shrinks the DNN model significantly and introduces a considerable sparsity in the weight matrices. The uniform systolic arrays in TPU and Tensor Cores in Volta and Turing GPU architectures are not explicitly designed to accelerate such sparse matrices. Therefore, the speedup due to weight pruning is negligible despite removing 90\% of the parameters. Later, several node pruning methods have been developed to resolve the sparsity bottlenecks. However, these methods do not consider the underlying Hardware dimension (size of the array, number of CPUs) or Tensor Core precision, leading to suboptimal performance. We develop Hardware Dimension Aware Pruning (HDAP) method for array-based accelerators, multi-core CPUs, and Tensor Core-enabled GPUs by considering the underlying dimension of the system. The node-pruned networks using the HDAP method achieved an average speedup of 3.2x and 4.2x, whereas the baseline method attained an average speedup of only 1.5x and 1.6x on Turing Tensor Core GPU and Eyeriss architecture, respectively. Hardware systems are often prone to soft errors or permanent faults due to external conditions or internal scaling. A lot of work has been done on the systolic array implementation and its reliability concerns in the past. However, their fault tolerance perspective with respect to DNNs is not yet fully understood with a fault model. In our work, we first present a fault model i.e., different sequences in which faults can occur on the systolic array, and co-design a fault-based and array size based Pruning (FPAP) algorithm with the intent of bypassing the faults and removing the internal redundancy at the same time for efficient inference. Tensor Cores in Nvidia Ampere 100 (A100) GPU support (1) 2:4 fine-grained sparse pruning where 2 out of every 4 elements are pruned and (2) traditional dense multiplication to achieve a good accuracy and performance trade-off. The A100 Tensor Core also takes advantage of 1-bit, 4-bit, and 8-bit multiplication to speed up the inference of a model. Hence, finding the right matrix type (dense or 2:4 sparse) along with the precision for each layer becomes a combinatorial problem. Neural Architecture Search (NAS) can alleviate such problems by automating the architecture design process instead of a brute-force search. In this work, we propose \textbf{(i)} Mixed Sparse and Precision Search (MSPS), a NAS framework to search for efficient sparse and mixed-precision quantized models within the predefined search space and fixed backbone neural network (Eg. ResNet50), and \textbf{(ii)} Architecture, Sparse and Precision Search (ASPS) to jointly search for kernel size and the number of filters, and sparse-precision combination of each layer. We illustrate the effectiveness of our methods targeting A100 Tensor Core on Nvidia GPUs by searching efficient sparse-mixed precision networks on ResNet50 and achieving better accuracy-latency trade-off models compared to the manually designed Uniform Sparse Int8 networks.
dc.format.mimetype PDF
dc.identifier.doi https://doi.org/10.31274/td-20240329-380
dc.identifier.orcid 0000-0002-3027-1915
dc.identifier.uri https://dr.lib.iastate.edu/handle/20.500.12876/kv7kJx5v
dc.language.iso en
dc.language.rfc3066 en
dc.subject.disciplines Computer engineering en_US
dc.subject.keywords Deep Learning en_US
dc.subject.keywords Hardware Accelerators en_US
dc.subject.keywords Neural Architecture Search en_US
dc.subject.keywords Quantization en_US
dc.title Hardware-aware design, search, and optimization of deep neural networks
dc.type dissertation en_US
dc.type.genre dissertation en_US
dspace.entity.type Publication
relation.isOrgUnitOfPublication a75a044c-d11e-44cd-af4f-dab1d83339ff
thesis.degree.discipline Computer engineering en_US
thesis.degree.grantor Iowa State University en_US
thesis.degree.level dissertation $
thesis.degree.name Doctor of Philosophy en_US
File
Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
ChittyVenkata_iastate_0097E_21000.pdf
Size:
4.13 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
0 B
Format:
Item-specific license agreed upon to submission
Description: