Redundancy and concept analysis for code models on the token tagging task

dc.contributor.advisor Quinn, Chris J
dc.contributor.advisor Jannesari, Ali
dc.contributor.advisor Huang, Xiaoqiu
dc.contributor.author Hu, Zefu
dc.contributor.department Department of Computer Science
dc.date.accessioned 2024-10-15T22:26:19Z
dc.date.available 2024-10-15T22:26:19Z
dc.date.embargo 2026-10-15T00:00:00Z
dc.date.issued 2024-08
dc.date.updated 2024-10-15T22:26:21Z
dc.description.abstract Various code intelligence tasks have made massive progress in terms of performance due to the advance of code language models. However, the research on the interpretability of these models is left far behind, bottlenecking the reliability and efficiency of the inference. One obstacle is redundant layers and neurons. Identifying redundant neurons helps better understand the models and guides research on compact and efficient models. Focusing on the token tagging task and seven (code) models: BERT, RoBERTa(RBa), CodeBERT(CB), GraphCodeBERT(GCB), UniXCoder(UC), CodeGPT-Python(CGP), and CodeGPT-Java(CGJ), we leverage redundancy and concept analyses to perform layer- and fine-grained neuron-level analyses to show that not all the layers/neurons are necessary to encode those concepts. In the analysis, we study how much general and task-specific redundancy the models exhibit at the layer and more fine-grained neuron levels. We find that over 95\% neurons are redundant, and removing these neurons would not have a serious negative impact on the accuracy. We also identify several compositions of neurons that can make predictions with the same accuracy as the entire network. Through concept analysis, we explore the traceability and distribution of human-recognizable concepts. We determine neurons that respond to specific code properties; for example, neurons that respond to ``number", ``string," and higher-level ``text" properties for the token tagging task are found. Our insights guide future research about compact and efficient code models.
dc.format.mimetype PDF
dc.identifier.uri https://dr.lib.iastate.edu/handle/20.500.12876/GvqXQpqw
dc.language.iso en
dc.language.rfc3066 en
dc.subject.disciplines Computer science en_US
dc.subject.keywords Machine learning en_US
dc.subject.keywords Software engineering en_US
dc.title Redundancy and concept analysis for code models on the token tagging task
dc.type thesis en_US
dc.type.genre thesis en_US
dspace.entity.type Publication
relation.isOrgUnitOfPublication f7be4eb9-d1d0-4081-859b-b15cee251456
thesis.degree.discipline Computer science en_US
thesis.degree.grantor Iowa State University en_US
thesis.degree.level thesis $
thesis.degree.name Master of Science en_US
File
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
0 B
Format:
Item-specific license agreed upon to submission
Description: