Three essays on applications of machine learning in problems with high dimensional data
Date
Authors
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The amount of data businesses collecting from the internet is massive. Researchers and analysts can now track various data features generated from log files, such as customers’ behavior history, product descriptions and aggregate level data. etc. In an ideal scenario, such data could be represented in a spreadsheet, with columns representing each dimension. In practice, the number of data dimensions can be staggering, making data processing difficult. With high dimensional data, the number of features can be more than the number of observations, and it can be very challenging for traditional econometric method to handle this scenario. My dissertation addresses this data issue by applying machine learning techniques, including LASSO (least absolute shrinkage and selection operator), decision trees, and neural networks, to help decision makers perform descriptive-predictive, and prescriptive analytics based on high dimensional data.
My dissertation comprises three essays. The first essay applies tree based machine learning models (random forest and gradient boosting decision tree) and free text information to predict house prices and understand how certain factors could affect the prices. In the second essay, I propose a LASSO method in high dimensional data and use daily prices of hotels to understand hotel’s competition pattern in a certain area. In the third essay, a word embedding and neural network model is applied to real estate data to more efficiently extract free text information, which leads to more accurate of house prices.
In these essays, I apply and extend a variety of analytic tools including supervised learning, unsupervised learning, statistics, and econometric methods. These essays contribute to the applied econometric and business analytics literature and can help researchers and analysts appreciate both traditional econometrics and predictive analytics tools, and make data-driven business decisions.