Research and Design of Machine Learning-based Product Classification System

Thumbnail Image
Date
2024-05
Authors
Wang, Jinghao
Major Professor
Townsend, Anthony
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Abstract
With the rapid development of emerging technologies such as big data and cloud computing, along with the swift expansion of global e-commerce platforms, the Internet has generated a massive amount of product data. The quantity of products continues to increase over time. In this vast ocean of product data, it becomes extremely important to find the products one needs, accurately extract relevant information, and then effectively classify and manage these numerous products. Therefore, in this paper, we employ machine learning methods to statistically analyze these product data to discover underlying patterns, and then use these patterns to predict and classify unknown product data. Product classification holds a significant position in the current e-commerce sector, yet traditional manual classification methods can no longer satisfy the rapidly increasing quantity and variety of products. This research aims to design a product classification system based on machine learning. By analyzing product data features, an effective classification model is established to improve the accuracy and efficiency of product classification. The research method includes steps such as data collection, feature engineering, and model construction. By comparing the performance of different machine learning algorithms, the optimal algorithm is selected for system implementation. Experimental results show that this system performs excellently in terms of product classification accuracy and efficiency, demonstrating high potential for application and promotion. Future research directions include optimizing model performance, expanding system functions, and applying it to actual e-commerce platforms. The main research content and results will be presented in the following sections: 1.Data cleaning is performed on the dataset, including removing duplicate values, handling missing values, and addressing outliers. The purpose is to ensure data consistency, completeness, uniqueness, and overall data quality. Then, the data is preprocessed, which involves tasks such as tokenization for both Chinese and English text, feature vectorization, dimensionality reduction, and feature selection. These processes transform the data into a format suitable for modeling. Tokenization breaks down sentences or paragraphs into individual words, allowing the computer to process and understand the meaning of each word as the smallest unit. Optimizations such as removing stop words and incorporating language resources enhance the accuracy of tokenization. Since classifiers can only handle numerical data, feature vectorization is necessary. However, the large number of words generated from tokenization can lead to high-dimensional vectors. Dimensionality reduction techniques significantly reduce the dimensionality of the feature vectors. Additionally, forward feature selection and backward feature elimination methods in feature selection can eliminate irrelevant and redundant features. 2. This study explores the extension of the Bagging algorithm using decision trees as the basic unit, known as the Random Forest algorithm. The generation process and combination strategies of the Random Forest algorithm are analyzed and introduced. A comparison is made between traditional decision tree algorithms and the Random Forest algorithm. Furthermore, the feature selection method of the Random Forest algorithm is improved by incorporating the use of Gini coefficient for feature selection and specifying the size of the feature subset. These improvements enhance the classification performance of the model. 3. During the experimental phase, the experimental effectiveness of addressing the data imbalance issue was validated using experimental data. Comparative experiments were conducted between the decision tree algorithm and the random forest algorithm, as well as between the random forest algorithm with feature selection improvement. Firstly, the holdout method was used to randomly extract 10% of the data from the dataset as the validation set, while the remaining 90% of the data served as the training set. Then, the grid search algorithm, which combines cross-validation and model evaluation methods, was employed to adjust the hyperparameters of the decision tree algorithm and the random forest algorithm. Finally, performance evaluation metrics were used to assess the model's performance, and the experimental results were analyzed.
Series Number
Journal Issue
Is Version Of
Versions
Series
Academic or Administrative Unit
Type
creative component
Comments
Rights Statement
Copyright
2024
Funding
Supplemental Resources
Source