Redundancy and concept analysis for code models on the token tagging task
dc.contributor.advisor | Quinn, Chris J | |
dc.contributor.advisor | Jannesari, Ali | |
dc.contributor.advisor | Huang, Xiaoqiu | |
dc.contributor.author | Hu, Zefu | |
dc.contributor.department | Department of Computer Science | |
dc.date.accessioned | 2024-10-15T22:26:19Z | |
dc.date.available | 2024-10-15T22:26:19Z | |
dc.date.embargo | 2026-10-15T00:00:00Z | |
dc.date.issued | 2024-08 | |
dc.date.updated | 2024-10-15T22:26:21Z | |
dc.description.abstract | Various code intelligence tasks have made massive progress in terms of performance due to the advance of code language models. However, the research on the interpretability of these models is left far behind, bottlenecking the reliability and efficiency of the inference. One obstacle is redundant layers and neurons. Identifying redundant neurons helps better understand the models and guides research on compact and efficient models. Focusing on the token tagging task and seven (code) models: BERT, RoBERTa(RBa), CodeBERT(CB), GraphCodeBERT(GCB), UniXCoder(UC), CodeGPT-Python(CGP), and CodeGPT-Java(CGJ), we leverage redundancy and concept analyses to perform layer- and fine-grained neuron-level analyses to show that not all the layers/neurons are necessary to encode those concepts. In the analysis, we study how much general and task-specific redundancy the models exhibit at the layer and more fine-grained neuron levels. We find that over 95\% neurons are redundant, and removing these neurons would not have a serious negative impact on the accuracy. We also identify several compositions of neurons that can make predictions with the same accuracy as the entire network. Through concept analysis, we explore the traceability and distribution of human-recognizable concepts. We determine neurons that respond to specific code properties; for example, neurons that respond to ``number", ``string," and higher-level ``text" properties for the token tagging task are found. Our insights guide future research about compact and efficient code models. | |
dc.format.mimetype | ||
dc.identifier.uri | https://dr.lib.iastate.edu/handle/20.500.12876/GvqXQpqw | |
dc.language.iso | en | |
dc.language.rfc3066 | en | |
dc.subject.disciplines | Computer science | en_US |
dc.subject.keywords | Machine learning | en_US |
dc.subject.keywords | Software engineering | en_US |
dc.title | Redundancy and concept analysis for code models on the token tagging task | |
dc.type | thesis | en_US |
dc.type.genre | thesis | en_US |
dspace.entity.type | Publication | |
relation.isOrgUnitOfPublication | f7be4eb9-d1d0-4081-859b-b15cee251456 | |
thesis.degree.discipline | Computer science | en_US |
thesis.degree.grantor | Iowa State University | en_US |
thesis.degree.level | thesis | $ |
thesis.degree.name | Master of Science | en_US |
File
License bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- license.txt
- Size:
- 0 B
- Format:
- Item-specific license agreed upon to submission
- Description: