Refactoring programs to improve the performance of deep learning for vulnerability detection
Date
2021-12
Authors
Steenhoek, Benjamin Jeremiah
Major Professor
Advisor
Le, Wei
Gao, Hongyang
Cohen, Myra
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Abstract
Software vulnerabilities allow attackers to take down important services and steal users’ private data. Many new vulnerabilities are reported each year, showing that vulnerabilities are prevalent in software programs. Therefore, it is critically important for developers to detect vulnerabilities before releasing their software. Recently, deep learning models have been successfully trained to detect vulnerabilities by learning to classify vulnerable and non-vulnerable code from open-source projects on Github. However, the existing datasets suffer from limited and imbalanced data, and both factors hurt the models’ performance.
We implemented a framework for automatically applying refactoring as a data augmentation technique to increase the diversity of program datasets and address data imbalance. Our refactoring framework can be tuned for different models and datasets. We evaluated our approach by using it to train state-of-the-art deep learning models. Our results show that naively refactoring programs does not significantly improve model performance. We found that some refactorings decrease model performance because they introduce tokens that are outside of the model’s vocabulary, and that naive applications of refactoring do not produce sufficiently diverse programs. Our method can be tuned to improve model performance above state-of-the-art methods by producing diverse programs and targeting imbalanced data. We also found that our method can be applied to the majority of programs in practice. Based on our results, we believe that refactoring is a useful data augmentation technique that will benefit further research and applications of deep learning for vulnerability detection.
Series Number
Journal Issue
Is Version Of
Versions
Series
Academic or Administrative Unit
Type
thesis