Redundancy and concept analysis for code models on the token tagging task

Hu, Zefu

Redundancy and concept analysis for code models on the token tagging task

File

Hu_iastate_0097M_21725.pdf (1.28 MB)

File Embargoed Until: (2026-10-15)

Date

2024-08

Authors

Hu, Zefu

Advisor

Quinn, Chris J

Jannesari, Ali

Huang, Xiaoqiu

Abstract

Various code intelligence tasks have made massive progress in terms of performance due to the advance of code language models. However, the research on the interpretability of these models is left far behind, bottlenecking the reliability and efficiency of the inference. One obstacle is redundant layers and neurons. Identifying redundant neurons helps better understand the models and guides research on compact and efficient models. Focusing on the token tagging task and seven (code) models: BERT, RoBERTa(RBa), CodeBERT(CB), GraphCodeBERT(GCB), UniXCoder(UC), CodeGPT-Python(CGP), and CodeGPT-Java(CGJ), we leverage redundancy and concept analyses to perform layer- and fine-grained neuron-level analyses to show that not all the layers/neurons are necessary to encode those concepts. In the analysis, we study how much general and task-specific redundancy the models exhibit at the layer and more fine-grained neuron levels. We find that over 95\% neurons are redundant, and removing these neurons would not have a serious negative impact on the accuracy. We also identify several compositions of neurons that can make predictions with the same accuracy as the entire network. Through concept analysis, we explore the traceability and distribution of human-recognizable concepts. We determine neurons that respond to specific code properties; for example, neurons that respond to ``number", ``string," and higher-level ``text" properties for the token tagging task are found. Our insights guide future research about compact and efficient code models.

Academic or Administrative Unit

Department of Computer Science

Type

thesis