Investigating the use of Bayesian networks for small dataset problems

Thumbnail Image
Date
2018-01-01
Authors
Macallister, Anastacia
Major Professor
Advisor
Eliot Winer
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract

Benefits associated with machine learning are extensive. Industry is increasingly beginning to recognize the wealth of information stored in the data they are collecting. To sort through and analyze all of this data specialized tools are required to come up with actionable strategies. Often this is done with supervised machine learning algorithms. While these algorithms can be extremely powerful data analysis tools, they require considerable understanding, expertise, and a significant amount of data to use. Selecting the appropriate data analysis method is important to coming up with valid strategies based on the collected data. In addition, a characteristic of machine learning is the need to have large amounts of data to train a system’s behavior. Large quantities of data, thousands to millions of data points, ensure that automated machine learning algorithms have enough information on a range of situations it may encounter. However, many real-world applications simply do not occur with enough frequency to accumulate a large enough dataset in a reasonable amount of time. Examples include low volume manufacturing, medical procedures, and disaster events. As a result, these application areas, and others like them, are not able to harness the power of traditional machine learning approaches. This is unfortunate for these underserved areas because they are the type of complex interdependent processes that could benefit from enhanced understanding and modeling capability that machine learning models can provide.

If there was a way to take the limited data available from these types of applications and apply machine learning strategies, valuable information could be gained. However, because of the differing natures of machine learning approaches, care needs to be taken when selecting a strategy. Approaches like linear classification are simple and widely used for less demanding machine learning applications. However, assumptions of constant distributions and the constraint of linear combination of terms means that the method is not well suited to more complex processes. Decision trees are another machine learning approach that is popular in many domains. They are easy to understand graph like structures, lending themselves well to applications in medicine. However, decision trees suffer from overfitting and issues handling noisy incomplete data, which are often found in many real-world applications. Support vector machines (SVMs) are another widely popular machine learning approach. They are inherently a binary classification method that projects complex multivariate problems into an n-dimensional space using a kernel. While SVMs are a popular and powerful tool, using a n-dimensional kernel can be complex and challenging. This challenge does not allow for investigation of the small data set issue easily. Neural networks are one of the most powerful machine learning tools. They are made up of neurons and synapses that activate based on certain inputs, much like the human nervous system. While they are a very powerful tool, they are very challenging to fully understand and dissect when trained. As a result, they are not a good option for investigating a machine learning algorithm for small data sets. Bayesian Networks (BN) are another widely used machine learning approach. They combine expert knowledge in the form of a network structure and prior probability distribution with Bayesian Statistics. The easily understandable network structure paired with flexible Bayesian Statistical methods lends itself well to investigating behaviors associated with small data sets for machine learning.

As a result, Bayesian Networks were selected as the method for investigating how to apply machine learning’s predictive abilities to small data set problems. This dissertation explores three research issues: 1) if small quantities of data can be used to construct a BN to accurately predict outcomes of a complex process, 2) if prior probabilities in a BN can be accurately modeled and/or modified for small data sets and 3) if likelihoods can be accurately calculated for small data sets. To address these issues, identification of where breakdowns due to data size occur, in a BN training process, must be identified. Identifying where and how these breakdowns occur will allow strategies to be developed to address them.

The first part of this research developed Bayesian Networks (BNs) using only small amounts of data. Networks were constructed to predict assembly accuracy and completion time for workers conducting assembly operations and to model the suitability of a buyer’s car choice. The goal was to identify areas where using small data sets to train a BN may encounter issues. Data for the project was collected from two sources. The first from a study using augmented reality guided work instructions and the second from a popular machine learning database. The first data set contained data from 75 participants was analyzed for trends to construct a Bayesian Network. The second contained about 1,700 data points. To train the network a subset of around 40 points were used to train the network and the remained to test. For the first dataset, results indicated the network could predict assembly time with around seventy percent accuracy but was only able to achieve thirty-eight percent error count accuracy. While these results were encouraging, further analysis demonstrated the network was biased by priors greatly influenced by the number of data points in a category. In an attempt to solve this issue, PSO was explored as a means to tune network parameters to increase network accuracy. However, results indicate that for a network with a higher degree of parameters like the car choice network there is not sufficient data to use this method or that the method need to be adapted to include more robust metrics. The results suggested that for more complex problems, a method of data simulation or generation should be explored to increase the training set size.

Data generation work explored the feasibility of using Kriging and Radial Basis Function models to generate data for four different Bayesian Networks. The goals of the networks were to predict completion time for workers conducting assembly operations, predict the number of errors an assembly worker made, a buyer’s car choice, and the income level of an adult. Data for the project was collected from a human-subjects study that used augmented reality guided work instructions and from the UCI machine learning database. Small amounts of data from each of these datasets were used to train the different BNs. Each of these training datasets were fitted with a Kriging and a Radial Basis Function model. Once models were created, they were randomly sampled to produce a larger dataset for training. The four networks were then tested under multiple conditions including the use of PSO to tune network parameters. The first set of results looked at how varying the proportion of generated to original training data would impact network accuracy. Results showed that in some cases generated data could increase the accuracy of the trained networks. In addition, it showed that the varying quantities of original to generated data could also impact the classification accuracy. From here, the authors generated larger amounts of data. Networks trained using ten thousand, one-hundred thousand, and a million data points were tested. Results showed that depending on the data set, increasing amounts of data did help increase accuracy for more complex network structures.

Results from tuning network parameters using PSO showed that it can help to produce accurate networks that improve on baseline original data only performance. However, the PSO method in general increased the standard deviation of accuracy results and lowered the median accuracy. This suggests alternate PSO formulations are required, taking into account more information about the parameters is necessary to see further accuracy enhancements. Overall, the exploratory results presented in this dissertation demonstrate the feasibility of using meta-model generated data and PSO to increase the accuracy of small sample set trained BN. Further developing this method will help underserved areas with access to only small datasets make use of the powerful predictive analytics of ML. Moving forward future work will continue to refine the data generation methods and look at alternate prior optimization formulations.

Series Number
Journal Issue
Is Version Of
Versions
Series
Academic or Administrative Unit
Type
article
Comments
Rights Statement
Copyright
Wed Aug 01 00:00:00 UTC 2018
Funding
Subject Categories
DOI
Supplemental Resources
Source