Analyzing redundancy in code-trained language models

Sharma, Arushi

Analyzing redundancy in code-trained language models

File

Sharma_iastate_0097M_21894.pdf (3.05 MB)

Date

2024-12

Authors

Sharma, Arushi

Advisor

Jannesari, Ali

Quinn, Christopher J

Li, Yang

Abstract

Code-trained language models have proven to be highly effective for various code intelligence tasks. However, they can be challenging to train and deploy due to computational bottlenecks and memory constraints. Implementing effective strategies to address these issues requires a better understanding of these ’black box’ models. In this paper, I perform a neuron-level analysis of code-trained language models on three different software engineering and one high performance computing downstream task. I identify important neurons within latent representations by eliminating neurons that are highly similar or irrelevant to the given task. This approach helps us understand which neurons and layers can be eliminated (redundancy analysis) and where important code properties are located within the network (concept analysis). We find that over 95% of the neurons can be eliminated without significant loss in accuracy for our code intelligence tasks. We also discover several compositions of neurons that can make predictions with baseline accuracy. Additionally, I explore the traceability and distribution of human-recognizable concepts within latent representations. I also demonstrate the effectiveness of our redundancy approach by creating an efficient transfer learning pipeline.

Academic or Administrative Unit

Department of Computer Science

Type

thesis