Data science for mapping dynamic soil properties: Sustainable crop production with big data

Ferhatoglu, Caner

Data science for mapping dynamic soil properties: Sustainable crop production with big data

File

Ferhatoglu_iastate_0097E_20884.pdf (7.71 MB)

Date

2023-05

Authors

Ferhatoglu, Caner

Advisor

Miller, Bradley

McDaniel, Marshall

Manu, Andrew

Mallarino, Antonio

Quinn, Christopher

Abstract

Digital soil mapping (DSM) refers to the creation of soil maps using statistical learning algorithms (e.g., machine learning [ML], deep learning, and geostatistical interpolation methods), environmental covariates, and georeferenced soil samples. These algorithms build predictive models based on the relationships between measured soil data and environmental predictors co-varying with the soil data. DSM has proven its value to the soil science community by providing capabilities to model the spatial variation of soil properties and classes with high accuracy and fine resolution since the 1990's. Data science has been at the core of DSM by offering various methods to clean, pre-process, and spatially model soils. Due to the rapid growth of remote sensing and soil datasets, the usefulness of data science methods to make sense of these massive datasets has been more pronounced. Despite the challenge brought by the massive datasets, there are also opportunities to make better maps for dynamic soil properties (e.g., nitrate nitrogen, soil-test phosphorus and potassium, and soil pH) and less dynamic ones (e.g., soil particle size fractions) to support soil fertility management. Data science methods provide useful tools to handle large input datasets, which can improve the accuracy and efficacy of DSM. In this dissertation, Chapter 2 investigated the feature selection (FS) algorithms to identify the optimal selection of environmental covariates that can improve the robustness and accuracy of DSM. Chapter 3 explored the impact of data scaling methods on the performance of ML algorithms (i.e., linear, non-linear, and tree-based ML algorithms). To balance the trade-off between cost and mapping accuracy, Chapter 4 developed a value system for identifying the ideal soil sampling density selection based on a user's priorities. Chapter 2 compared the effectiveness of six different FS methods from four categories (i.e., filter, wrapper, embedded, and hybrid) to improve the robustness and accuracy of DSM for five heavily managed soil properties (i.e., nitrate, soil-test phosphorus and potassium, soil organic matter, and buffer pH) at the field-scale. The covariate stack without FS was the control treatment. The full covariate stack included 1,049 potential environmental covariates gathered from time-series aerial and satellite imagery, and digital terrain attributes from which the FS algorithms chose the relevant covariate subset. The performance of the resultant models was measured by cross-validation (CV), a new robustness ratio metric, and independent validation (IV) with Lin's concordance correlation coefficient (CCC). RR was useful for finding optimal FS and ML combinations to enhance DSM performance. Considering robustness ratio facilitated the identification of the optimal FS and ML combinations with better DSM performance over models built from full covariate stacks. Wrapper and embedded FS strategies usually produced the optimal models more frequently than the hybrid and filter FS strategies. Models created from covariate stacks reduced by FS methods were more robust, exhibiting better prediction performance. The unstructured nature of the datasets used in DSM is likely to lead to inferior performance of ML algorithms, given that most ML algorithms are sensitive to the attribute scale of input variables (i.e., covariates and target soil properties). Chapter 3 attempted to address this issue by comparing the effectiveness of various data scaling methods applied to covariates and target soil properties (i.e., covariate and target scaling strategies) under three types of ML algorithms (i.e., linear, non-linear, and tree-based ML algorithms). The prediction performance of most ML algorithms improved by covariate and target scaling strategies. The performance advantages of linear and tree-based ML techniques were less pronounced than those of non-linear algorithms. The effect of data scaling on prediction performance was negligible for tree-based ML algorithms. The covariate scaling strategy often produced higher prediction results. Data scaling affected different ML algorithms' covariate importance and map patterns differently. When employing optimal data scaling methods and strategies, non-linear ML algorithms demonstrated comparable or even better performance than tree-based ML algorithms, regardless of whether data scaling was used. These findings suggest that data scaling with non-linear ML algorithms led to the best predictive performance for some soil properties (i.e., modeling soil test phosphorus and soil organic matter), exceeding the performance of tree-based ML algorithms (with baseline models) deemed DSM's most predictive ML algorithms. Chapter 4 proposed a new method to formalize sampling quantity or density selection for mapping soil properties. The common challenge to making DSM operational to manage soil fertility is the cost of soil sampling and analysis, although the accuracy of soil maps tends to improve with the increasing sampling density. However, the use of resources needs to be balanced. A new index called Optimal Sample Size Index (OSSI) was created to achieve the balance between sampling cost and map accuracy. OSSI was based on the weighted sum of multiple evaluation metrics (including accuracy of maps and relative sampling cost of mapping, introduced in this study) and the Best-Worst scaling (BWS) method. For all soil properties and study regions, the optimal OSSI score found the models with map accuracy scores (i.e., CV-root-mean-squared error (RMSE) and IV-RMSE) comparable to those from all training samples while reducing the relative sampling cost by 50 to 92%. The optimal OSSI scores led to the soil maps with similar patterns (slightly less detailed than those from all samples) to the maps from all available training samples. OSSI is a promising index to inform future sampling and mapping efforts because it can help reduce the soil sampling cost while providing sufficient map accuracy and patterns. Therefore, OSSI can help operationalize DSM for monitoring soil fertility management.

Academic or Administrative Unit

Department of Agronomy

Type

dissertation