Redundancy and concept analysis for code models on the token tagging task

Hu, Zefu

Redundancy and concept analysis for code models on the token tagging task

dc.contributor.advisor	Quinn, Chris J
dc.contributor.advisor	Jannesari, Ali
dc.contributor.advisor	Huang, Xiaoqiu
dc.contributor.author	Hu, Zefu
dc.contributor.department	Department of Computer Science
dc.date.accessioned	2024-10-15T22:26:19Z
dc.date.available	2024-10-15T22:26:19Z
dc.date.embargo	2026-10-15T00:00:00Z
dc.date.issued	2024-08
dc.date.updated	2024-10-15T22:26:21Z
dc.description.abstract	Various code intelligence tasks have made massive progress in terms of performance due to the advance of code language models. However, the research on the interpretability of these models is left far behind, bottlenecking the reliability and efficiency of the inference. One obstacle is redundant layers and neurons. Identifying redundant neurons helps better understand the models and guides research on compact and efficient models. Focusing on the token tagging task and seven (code) models: BERT, RoBERTa(RBa), CodeBERT(CB), GraphCodeBERT(GCB), UniXCoder(UC), CodeGPT-Python(CGP), and CodeGPT-Java(CGJ), we leverage redundancy and concept analyses to perform layer- and fine-grained neuron-level analyses to show that not all the layers/neurons are necessary to encode those concepts. In the analysis, we study how much general and task-specific redundancy the models exhibit at the layer and more fine-grained neuron levels. We find that over 95\% neurons are redundant, and removing these neurons would not have a serious negative impact on the accuracy. We also identify several compositions of neurons that can make predictions with the same accuracy as the entire network. Through concept analysis, we explore the traceability and distribution of human-recognizable concepts. We determine neurons that respond to specific code properties; for example, neurons that respond to ``number", ``string," and higher-level ``text" properties for the token tagging task are found. Our insights guide future research about compact and efficient code models.
dc.format.mimetype	PDF
dc.identifier.uri	https://dr.lib.iastate.edu/handle/20.500.12876/GvqXQpqw
dc.language.iso	en
dc.language.rfc3066	en
dc.subject.disciplines	Computer science	en_US
dc.subject.keywords	Machine learning	en_US
dc.subject.keywords	Software engineering	en_US
dc.title	Redundancy and concept analysis for code models on the token tagging task
dc.type	thesis	en_US
dc.type.genre	thesis	en_US
dspace.entity.type	Publication
relation.isOrgUnitOfPublication	f7be4eb9-d1d0-4081-859b-b15cee251456
thesis.degree.discipline	Computer science	en_US
thesis.degree.grantor	Iowa State University	en_US
thesis.degree.level	thesis	$
thesis.degree.name	Master of Science	en_US

File

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 0 B
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses and Dissertations