Syntax errors identification from compiler error messages using ML techniques

Thumbnail Image
Agrawal, Shubham
Major Professor
Jin Tian
Wei Le
Committee Member
Journal Title
Journal ISSN
Volume Title
Research Projects
Organizational Units
Organizational Unit
Computer Science

Computer Science—the theory, representation, processing, communication and use of information—is fundamentally transforming every aspect of human endeavor. The Department of Computer Science at Iowa State University advances computational and information sciences through; 1. educational and research programs within and beyond the university; 2. active engagement to help define national and international research, and 3. educational agendas, and sustained commitment to graduating leaders for academia, industry and government.

The Computer Science Department was officially established in 1969, with Robert Stewart serving as the founding Department Chair. Faculty were composed of joint appointments with Mathematics, Statistics, and Electrical Engineering. In 1969, the building which now houses the Computer Science department, then simply called the Computer Science building, was completed. Later it was named Atanasoff Hall. Throughout the 1980s to present, the department expanded and developed its teaching and research agendas to cover many areas of computing.

Dates of Existence

Related Units

Journal Issue
Is Version Of

Compiler error messages facilitate software development and debugging by providing cause and location of the error but due to various compiler bugs and inconsistencies it often fails its purpose and negatively affect performance of both novice and experienced programmers. An errant semicolon or brace can result in many errors reported throughout the program. This study tries to statistically analyze open source code base to predict real errors from different type of compiler error messages. It also tries to auto-fix these errors.

At the high level, this study handles two cases (1) when one error is present in code, (2) when two different errors are present in the code. We start with collecting different type of random error messages for both the cases by random error generation in C projects. We developed different models using document clustering, probabilistic topic modeling and multi-label classification algorithms for training and predicting real errors using collected error messages for both the cases.

Our empirical evaluation on open-source projects has shown that our model correctly predicts the real error in almost 95% cases, when only one error exists in program. In case of two errors, model correctly predicts at least one error in almost 91% cases and both the errors in almost 39% cases.

Subject Categories
Fri Jan 01 00:00:00 UTC 2016