Techniques for enhancing adversarial robustness and robust fairness in deep neural networks

Fakorede, Olukorede Joseph

Techniques for enhancing adversarial robustness and robust fairness in deep neural networks

File

Fakorede_iastate_0097E_21885.pdf (2.22 MB)

Date

2024-12

Authors

Fakorede, Olukorede Joseph

Advisor

Tian, Jin

Quinn, Christopher

Lutz, Robyn

Li, Qi

Jannesari, Ali

Abstract

Deep Neural Networks (DNNs) are a class of Artificial Neural Networks that have shown remarkable success in various fields, such as Computer Vision, Reinforcement Learning, and Natural Language Processing. However, the applicability of DNNs in safety-critical domains has often been questioned due to observations that DNNs are brittle and fail to perform well when fed imperceptibly perturbed input examples. These imperceptibly perturbed inputs are referred to as adversarial examples. To address the problem of vulnerabilities of DNNs, it is imperative to improve their robustness. Of the several methods earlier proposed to improve the adversarial robustness of DNNs, Adversarial Training (AT), which involves training DNNs with adversarial examples, is found to be the most effective. While AT has proven to be effective in improving the robustness of DNNs, the generalization performance of AT methods, compared to natural training methods, remain inferior. Further, the fairness of AT methods have been questioned as the robustness gains of adversarially-trained DNN models exhibit substantial variations across classes. In this thesis, we explore more novel approaches for formulating Adversarial Training of DNNs, aiming to improve and achieve better robustness for DNNs. First, we show that incorporating Hypersphere Embedding and regularization terms that adequately capture angularly discriminative information into Adversarial Training helps improve the robustness of DNNs. Despite the effectiveness of AT methods in improving the adversarial robustness of DNNs, recent works have shown that the robustness obtained the AT methods exhibit significant inter-class variance. Motivated by this observation, we designed a instance-wise re-weighting method for uniquely assigning importance to losses of individual adversarial training examples. Our approach achieve the dual purpose of minimizing the inter-class variance inherent in robust models obtained through adversarial training and improve the overall robustness performance. We further question the use of uniform perturbation for crafting adversarial training examples. We theoretically show that under uniform perturbation, various training adversarial examples incur varying increment in losses. Consequent on this revelation, we argue that AT methods may generally benefit from customizing adversarial perturbations to individual adversarial training examples. We proposed two instance-wise weighting methods for perturbation budgets of each training example. Finally, drawing inspirations from the traditional standard deviation statistic, we propose a measure that may be utilized in any supervised machine learning settings. We show that the proposed standard-deviation-inspired measure, when used as a regularization term, effectively improves the robustness of AT methods to strong white-box attacks, in addition to improving the generalization of AT methods to adversarial attacks not seen during training. Furthermore, we show that the proposed measure may be used as a metric for generating adversarial attacks on DNN models.

Academic or Administrative Unit

Department of Computer Science

Type

dissertation