Redundancy and concept analysis for code models on the token tagging task
Date
2024-08
Authors
Hu, Zefu
Major Professor
Advisor
Quinn, Chris J
Jannesari, Ali
Huang, Xiaoqiu
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Various code intelligence tasks have made massive progress in terms of performance due to the advance of code language models.
However, the research on the interpretability of these models is left far behind, bottlenecking the reliability and efficiency of the inference.
One obstacle is redundant layers and neurons. Identifying redundant neurons helps better understand the models and guides research on compact and efficient models. Focusing on the token tagging task and seven (code) models: BERT, RoBERTa(RBa), CodeBERT(CB), GraphCodeBERT(GCB), UniXCoder(UC), CodeGPT-Python(CGP), and CodeGPT-Java(CGJ), we leverage redundancy and concept analyses to perform layer- and fine-grained neuron-level analyses to show that not all the layers/neurons are necessary to encode those concepts.
In the analysis, we study how much general and task-specific redundancy the models exhibit at the layer and more fine-grained neuron levels. We find that over 95\% neurons are redundant, and removing these neurons would not have a serious negative impact on the accuracy. We also identify several compositions of neurons that can make predictions with the same accuracy as the entire network. Through concept analysis, we explore the traceability and distribution of human-recognizable concepts. We determine neurons that respond to specific code properties; for example, neurons that respond to ``number", ``string," and higher-level ``text" properties for the token tagging task are found. Our insights guide future research about compact and efficient code models.
Series Number
Journal Issue
Is Version Of
Versions
Series
Academic or Administrative Unit
Type
thesis