Abstraction, aggregation and recursion for generating accurate and simple classifiers

dc.contributor.advisor Vasant Honavar
dc.contributor.author Kang, Dae-Ki
dc.contributor.department Computer Science
dc.date 2018-08-24T19:24:26.000
dc.date.accessioned 2020-06-30T08:10:50Z
dc.date.available 2020-06-30T08:10:50Z
dc.date.copyright Sun Jan 01 00:00:00 UTC 2006
dc.date.issued 2006-01-01
dc.description.abstract <p>An important goal of inductive learning is to generate accurate and compact classifiers from data. In a typical inductive learning scenario, instances in a data set are simply represented as ordered tuples of attribute values. In our research, we explore three methodologies to improve the accuracy and compactness of the classifiers: abstraction, aggregation, and recursion;Firstly, abstraction is aimed at the design and analysis of algorithms that generate and deal with taxonomies for the construction of compact and robust classifiers. In many applications of the data-driven knowledge discovery process, taxonomies have been shown to be useful in constructing compact, robust, and comprehensible classifiers. However, in many application domains, human-designed taxonomies are unavailable. We introduce algorithms for automated construction of taxonomies inductively from both structured (such as UCI Repository) and unstructured (such as text and biological sequences) data. We introduce AVT-Learner, an algorithm for automated construction of attribute value taxonomies (AVT) from data, and Word Taxonomy Learner (WTL), an algorithm for automated construction of word taxonomy from text and sequence data. We describe experiments on the UCI data sets and compare the performance of AVT-NBL (an AVT-guided Naive Bayes Learner) with that of the standard Naive Bayes Learner (NBL). Our results show that the AVTs generated by AVT-Learner are compeitive with human-generated AVTs (in cases where such AVTs are available). AVT-NBL using AVTs generated by AVT-Learner achieves classification accuracies that are comparable to or higher than those obtained by NBL; and the resulting classifiers are significantly more compact than those generated by NBL. Similarly, our experimental results of WTL and WTNBL on protein localization sequences and Reuters newswire text categorization data sets show that the proposed algorithms can generate Naive Bayes classifiers that are more compact and often more accurate than those produced by standard Naive Bayes learner for the Multinomial Model;Secondly, we apply aggregation to construct features as a multiset of values for the intrusion detection task. For this task, we propose a bag of system calls representation for system call traces and describe misuse and anomaly detection results on the University of New Mexico (UNM) and MIT Lincoln Lab (MIT LL) system call sequences with the proposed representation. With the feature representation as input, we compare the performance of several machine learning techniques for misuse detection and show experimental results on anomaly detection. The results show that standard machine learning and clustering techniques using the simple bag of system calls representation based on the system call traces generated by the operating system's kernel is effective and often performs better than approaches that use foreign contiguous sequences in detecting intrusive behaviors of compromised processes;Finally, we construct a set of classifiers by recursive application of the Naive Bayes learning algorithms. Naive Bayes (NB) classifier relies on the assumption that the instances in each class can be described by a single generative model. This assumption can be restrictive in many real world classification tasks. We describe recursive Naive Bayes learner (RNBL), which relaxes this assumption by constructing a tree of Naive Bayes classifiers for sequence classification, where each individual NB classifier in the tree is based on an event model (one model for each class at each node in the tree). In our experiments on protein sequences, Reuters newswire documents and UC-Irvine benchmark data sets, we observe that RNBL substantially outperforms NB classifier. Furthermore, our experiments on the protein sequences and the text documents show that RNBL outperforms C4.5 decision tree learner (using tests on sequence composition statistics as the splitting criterion) and yields accuracies that are comparable to those of support vector machines (SVM) using similar information.</p>
dc.format.mimetype application/pdf
dc.identifier archive/lib.dr.iastate.edu/rtd/1882/
dc.identifier.articleid 2881
dc.identifier.contextkey 6131401
dc.identifier.doi https://doi.org/10.31274/rtd-180813-12353
dc.identifier.s3bucket isulib-bepress-aws-west
dc.identifier.submissionpath rtd/1882
dc.identifier.uri https://dr.lib.iastate.edu/handle/20.500.12876/72761
dc.language.iso en
dc.source.bitstream archive/lib.dr.iastate.edu/rtd/1882/3243838.PDF|||Fri Jan 14 21:46:52 UTC 2022
dc.subject.disciplines Computer Sciences
dc.subject.keywords Computer science
dc.title Abstraction, aggregation and recursion for generating accurate and simple classifiers
dc.type article
dc.type.genre dissertation
dspace.entity.type Publication
relation.isOrgUnitOfPublication f7be4eb9-d1d0-4081-859b-b15cee251456
thesis.degree.level dissertation
thesis.degree.name Doctor of Philosophy
File
Original bundle
Now showing 1 - 1 of 1
Name:
3243838.PDF
Size:
1015.73 KB
Format:
Adobe Portable Document Format
Description: