Volume 55, Issue 7 p. 5715-5737
Research Article
Open Access

Using Machine Learning for Prediction of Saturated Hydraulic Conductivity and Its Sensitivity to Soil Structural Perturbations

Samuel N. Araya

Corresponding Author

Samuel N. Araya

Life and Environmental Sciences Department, University of California, Merced, CA, USA

Correspondence to: S. N. Araya,

[email protected]

Search for more papers by this author
Teamrat A. Ghezzehei

Teamrat A. Ghezzehei

Life and Environmental Sciences Department, University of California, Merced, CA, USA

Search for more papers by this author
First published: 27 June 2019
Citations: 110

Abstract

Saturated hydraulic conductivity (Ks) is a fundamental soil property that regulates the fate of water in soils. Its measurement, however, is cumbersome and instead pedotransfer functions (PTFs) are routinely used to estimate it. Despite much progress over the years, the performance of current generic PTFs estimating Ks remains poor. Using machine learning, high-performance computing, and a large database of over 18,000 soils, we developed new PTFs to predict Ks. We compared the performances of four machine learning algorithms and different predictor sets. We evaluated the relative importance of soil properties in explaining Ks. PTF models based on boosted regression tree algorithm produced the best models with root-mean-squared log-transformed error in ranges of 0.4 to 0.3 (log10(cm/day)). The 10th percentile particle diameter (d10) was found to be the most important predictor followed by clay content, bulk density (ρb), and organic carbon content (C). The sensitivity of Ks to soil structure was investigated using ρb and C as proxies for soil structure. An inverse relationship was observed between ρb and Ks, with the highest sensitivity at around 1.8 g/cm3 for most textural classes. Soil C showed a complex relationship with Ks with an overall positive relation for fine-textured and midtextured soils but an inverse relation for coarse-textured soils. This study sought to maximize the extraction of information from a large database to develop generic machine learning-based PTFs for estimating Ks. Models developed here have been made publicly available and can be readily used to predict Ks.

Key Points

  • High accuracy machine learning-based pedotransfer function models are developed to predict saturated hydraulic conductivity
  • Variable importance measures are used to identify and rank soil properties that are most important for predictions
  • Bulk density and organic carbon content are used as proxies to describe the effect of soil structural alterations in predictions

1 Introduction

Hydraulic conductivity of water-saturated soils (Ks) is one of the most important soil characteristic that determines rate of infiltration, runoff generation, and deep drainage. Its importance is particularly elevated during precipitation, snowmelt, flooding, and/or irrigation events. Ks regulates the amount of plant-available water, overland flow and transport, erosion, groundwater recharge, and extent and duration of water inundation. The magnitude of soil hydraulic conductivity primarily depends on the size, distribution and, connectivity of pores (Alaoui et al., 2011; Bittelli et al., 2015; Nielsen et al., 2018). Thus, the first-order classification of soil hydraulic conductivity classes is typically dependent upon soil texture, ranging from >5 m/day for sandy soils to <0.01 m/day for clay textured soils (Rawls et al., 1982; e.g., Soil Science Division Staff, 2017). In addition to texture, hydraulic conductivity is influenced by the soil structure, which itself is the result of several factors. Structure determines the presence and connectivity of large pores—including macropores, cracks, and interaggregate pore spaces (Beven & Germann, 1982).

Unlike texture, soil structure is prone to substantial alteration in a relatively short time, which could have consequential effects on hydraulic conductivity and associated hydrologic processes (Assouline & Or, 2013). Common structure altering processes that have strong bearing on hydraulic conductivity include burrowing by roots and soil fauna, aggregation, compaction, and wetting/drying cycles (Brooks et al., 2004; Chivenge et al., 2007; Ghezzehei & Or, 2000; Kuncoro et al., 2014; Or et al., 2000). These processes may occur on seasonal cycles or over several decades as part of soil development. More drastic changes to soil structure such as tillage and cracking can increase hydraulic conductivity by several orders of magnitude, albeit for a short period of time (de Almeida et al., 2018; Jorda et al., 2015).

It is often impractical to measure hydraulic conductivity with adequate spatial density and frequency as soil hydraulic properties vary considerably across landscapes, often within short distances. Lack of adequate information that captures the spatial heterogeneity and temporal dynamics of soil hydraulic conductivity and related processes are often identified as critical shortcomings in land surface models that simulate processes across large regions and long periods. In this regard, pedotransfer functions (PTFs)—models that predict soil hydraulic properties from other more easily obtained soil and land characteristics—are valuable tools (Padarian et al., 2018; Van Looy et al., 2017). PTFs that consider some soil structural variable can be particularly useful in modeling changes to soil hydraulic properties arising from alterations in soil structure (e.g., soil bulk density). Several studies have shown that including structural variables improves PTF predictions. For example, in a study using 487 data points mined from the literature, Jorda et al. (2015) found that bulk density (ρb) and land use (a variable that most directly impacts soil structure) to be the most important predictors of hydraulic conductivity. Nguyen et al. (2015) found that prediction of soil moisture for their study sites was improved when soils were grouped by soil structural criteria. Other studies found that including soil structural variables in terms of fractal parameters improved predictions of soil hydraulic properties (G. Huang & Zhang, 2005; Mohammadi et al., 2013).

The need for parameterization and inclusion of new soil structural variables in PTFs is widely recognized (Patil & Singh, 2016; Van Looy et al., 2017; Vereecken et al., 2010). However, soil structure remains poorly represented in PTFs; the lack of universally applicable and quantitative measures of soil structure remains to be a key challenge (Diaz-Zorita et al., 2002; Ghezzehei, 2011). The relationships between soil structural variables (such as bulk density, aggregate stability, aggregate size distribution, and organic matter concentration) and hydraulic properties are very complex and highly nonlinear. It is extremely difficult to model these relationships accurately using physically based models or traditional statistical methods. This opens an opportunity to revisit these challenges using data-driven methods, which excel in such situations, such as machine learning (ML) techniques (Shen et al., 2018). This study was motivated by the growing availability of large databases of soil hydraulic properties and the current progress in ML tools.

The overarching aim of this work is to develop ML-based PTFs (ML-PTFs) for predicting Ks and advancing our quantitative knowledge of how soil structural indicators control hydraulic conductivity. The specific objectives of this study are to (1) develop a robust ML-PTFs, (2) identify important soil variables that control Ks, and (3) analyze the effect of soil structural alteration on Ks. Because of limitation on data availability, we used only bulk density (ρb) and total organic carbon content (C) as indicators of soil structure. These two variables are routinely characterized and reported in soil surveys and would make our models consistent with the objective of PTFs “to translate data we have to data we need” (Bouma, 1989).

2 Background

2.1 Indicators of Soil Structure

Indicators of soil structure that can have a direct or indirect effect on Ks are summarized in Table 1. Conceptual and mechanistic understanding of how several of these factors influence hydraulic conductivity has been the subject of numerous studies over the past decades. However, much of this knowledge remains qualitative or constrained to only a narrow range of soils. Although there has been considerable progress in linking topology and morphology of pore space (e.g., acquired via X-ray computed tomography) with hydraulic properties, the majority of this work involves advanced computation that is comparable to direct measurement in terms of the required effort. Perhaps the most glaring challenge is that only a few of these properties are characterized routinely. Moreover, generalizable quantitative indicators of soil structure that can be directly linked with hydraulic conductivity are very few. Of the listed parameters, ρb and organic matter are the two most common indicators of soil structure used in predicting hydraulic properties (e.g., Arya & Paris, 1981; Gupta & Larson, 1979; Nemes et al., 2005; Vereecken et al., 1989).

Table 1. Quantitative Soil Structure Metrics and Their Significance to Hydraulic Conductivity
Soil variable Significance and mechanism Example of studies
Aggregation (size distribution and stability) Indicates structure of macropores and mesopores. Aggregate strength, type, and compactness of aggregates Koekkoek and Booltink (1999)
Bulk density Indicates packing compaction. Influences total porosity, pore size distribution, and connectivity. Schaap et al. (2001)
Clay type and metal oxides Dominate properties that affect aggregation and water retention. Type and concentration of clay influence structure through aggregation, swell, and shrink behavior, etc. Rajkai and Varallyay (1992)
Fractal dimensions Quantification of the heterogeneity, tortuosity and connectivity of soil pore/solid space. Bayat et al. (2013) and Huang and Zhang (2005)
Mechanical properties and shrink-swell parameters (coefficient of linear extensibility) Indicate dynamic properties of structure. Baumer (1992), McKenzie et al. (1991), and Watt et al. (1998)
Organic matter (organic matter/carbon content) Influences compaction, bulk density, aggregation, porosity, and pore architecture. Dexter et al. (2008), Nemes et al. (2005), and Rawls et al. (2003)
Particle surface area Indicates particle and pore sizes. Bayat et al. (2013) and Watt et al. (1998)
Penetration resistance Indicates compactness and porosity. Bayat and Ebrahim Zadeh (2018), Lipiec et al. (2009), and Watt et al. (1998)
Porosity metrics (mercury porosimetry or Imaging methods: porosity, connectivity, pore-size distribution, and pore geometry. Indicate actual pore architecture. Otalvaro et al. (2016) and Romero and Simms (2008)
Water retention characteristics (characteristic water retention, retention curve fitting parameters, and S-index) Indicates the size distribution of pores. Dexter (2004), Koekkoek and Booltink (1999), and Rawls et al. (1982)

2.1.1 Bulk Density

ρb is routinely measured during soil characterization and is used to estimate total porosity. Within a given textural class, variation in bulk density can be directly attributed to the degree of compactness (Hakansson & Lipiec, 2000) or aggregation (Aksakal et al., 2019). Therefore, ρb has been an essential variable in both physically based and empirical models of hydraulic conductivity (Assouline & Or, 2013, and references therein.). It is well recognized that compaction of a given soil (hence, increase in ρb) leads to a reduction in saturated hydraulic conductivity (Assouline, 2006). However, the interactions of bulk density with other soil physical-chemical characteristics (such as texture and organic matter content) in influencing hydraulic conductivity is too complex to be captured by classical regression analyses or physically-based models. For example, in their study, Bogie et al. (2018) observed a significant decrease in both infiltration rate (a proxy of hydraulic conductivity) along with a decrease in ρb following a decade of organic matter input to a sandy soil.

2.1.2 Organic Carbon Content

Soil organic matter content is another routinely measured soil property that has a less direct, yet important, control on soil structure. Organic matter content affects soil structure largely because of its influence on soil aggregation, aggregate stability, and associated porosity (Haynes & Beare, 1996; Hudson, 1994; Huntington, 2007). Generally, an increase in organic matter increases soil aggregate formation and aggregate stability and thus Ks (Hudson, 1994; Saxton & Rawls, 2006). Studies show that organic matter effects on soil hydraulic properties are similar to those of clay and high clay content reduces the effects of increased organic matter (Rawls et al., 2003; Saxton & Rawls, 2006). On the other hand, Dexter et al. (2008) found that organic matter substantially influences soil physical behavior (i.e., matrix and structural porosity) only when the clay content is above a threshold relative to C.

2.2 ML Algorithms

Numerous ML algorithms exist for multivariate regression modeling. In this study, we compared four popular ML algorithms: the k-nearest neighbors (KNNs), support vector regression (SVR), random forest (RF), and boosted regression tree (BRT).

Many studies have used these ML algorithms in different problems related to soil hydraulic properties. Several studies have used KNN type of ML to predict soil hydraulic properties (e.g., Botula et al., 2013; Nemes et al., 2006, 2008); Elshorbagy et al. (2010) identify KNN as an attractive modeling technique for hydrology applications. Many studies have used SVR algorithm to model soil hydraulic properties (e.g., Angelaki et al., 2018; Kaingo et al., 2018; Kotlar et al., 2019; Mady & Shein, 2018; Singh et al., 2019). Recently, some studies have found that SVR models predicted soil hydraulic properties more accurately than artificial neural network models (Khlosi et al., 2016; Twarakavi et al., 2009; Zhang et al., 2018). Both the RF and BRT algorithms use an ensemble of regression trees as their base learners, and several studies have highlighted the power of these algorithms in predicting soil properties in general and hydraulic properties in particular. Hengl et al. (2017), for example, used RF and BRT among an ensemble of other models to build global soil map. Chaney et al. (2019) similarly employed RF to build a map of predicted soil properties over the United States. Recently Szabó et al. (2019) have developed PTFs based on RF and BRT to map soil hydraulic properties across a watershed. Koestel and Jorda (2014) showed that the RF algorithm can be used to accurately model soil preferential solute transport. Jorda et al. (2015) used BRT models to predict Ks and explore important variables that control it.

2.2.1 K-Nearest Neighbors (KNN)

KNNs are one of the simplest algorithms with respect to their underlying principle and, often, computational demand. Predictions for a new instance are made based on the average of the values of its “k”-nearest (i.e., most similar) neighbors in the training data. Nearest neighbors are commonly identified by Euclidean distances in the predictor parameter space. The number of KNNs is the only parameter to tune during the training of KNN models.

2.2.2 Support Vector Regression (SVR)

SVR is an adaptation of the support vector machine for regression problems (Cortes & Vapnik, 1995; Drucker et al., 1997). The support vector machine learning is a generalization of “maximal margin classifier.” The algorithm first maps the input variables into a high-dimensional space using a fixed mapping function—a kernel function. The algorithm then constructs hyperplanes, which are used for classification or, in the case of SVR, for regression. In this study, we use the Radial Basis Function kernel, which is one of the most commonly used kernels in SVR. Some advantages of SVR include the fact that they do not suffer from the problem of local minima and that they have few parameters to tune when training the model.

2.2.3 Random Forest (RF)

RF are popular models that are relatively simple to train and tune (Hastie et al., 2009). They apply ensemble techniques by averaging a large number of individual decision tree-based models. Tree models are “grown” by searching for a predictor that ensures the best split that results in the smallest model error. The individual trees in RF ensemble are built on bootstrapped training sample, and only a small group of predictor variables are considered at each split; this ensures that trees are decorrelated with each other (Breiman, 2001; James et al., 2013).

2.2.4 Boosted Regression Trees (BRT)

BRT, another form of decision tree model ensemble, enhances the model using the gradient boosting technique. The gradient boosting algorithm constructs additive regression models by sequentially fitting “simple base learner” functions (i.e., decision trees) to current pseudo-residuals at each iteration (Friedman, 2002). These pseudo-residuals are the gradient of the loss function being minimized. BRT models have shown considerable success and often outperform other ML algorithms (Elith et al., 2008; Natekin & Knoll, 2013). BRT models are also particularly adept for less-than-clean data (Friedman, 2001), which makes them particularly attractive in our work where the training data are compiled from various sources and different measurement methods, which make it prone to some inconsistencies.

Tree-based models, both the RF and BRT, have the advantage of being able to rank predictor variables' relative importance. For a tree-based model, the approximate relative influence ( urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0001) of a predictor variable xj is calculated by equation 1.
urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0002(1)
where urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0003 is the empirical improvement by splitting on predictor xj at that point. For tree ensemble models, relative importance is given by averaging the relative influence of variable xj across all trees of the model (Ridgeway, 2012).

3 Methods

The methods section is organized into four sections. The first section describes the training data and data preprocessing for the ML training. The second section describes the model training and testing procedures. The third section describes the predictor variable importance analysis procedure. The fourth section describes the methods used to test the response of Ks to perturbations on soil structural variables.

3.1 Data Preparation

The data used for training and testing PTFs is derived from the USKSAT database (Pachepsky & Park, 2015). The database contains Ks along with several textural, and structural information of over 27,000 U.S. soils compiled from 45 data sets. For over 95% of the soils, Ks was measured using a constant head method on samples sizes of approximately 5.5-cm length and 3-cm internal diameter. In addition to the USKSAT database, we also acquired a subset of the USKSAT soils directly from Florida Soil Characterization Data, hereafter, FLSOIL (University of Florida, n.d.). The FLSOIL contains data of over 8,000 soils, which are also part of the USKSAT but have additional variables of soil water contents at 11 pressure heads. We used the FLSOIL subset to build separate models that utilize water retention data in order to evaluate the effect of water retention variables in estimating Ks.

3.1.1 Data Cleaning

From the USKSAT soils, we selected a subset with only 11 variables (see variables in Table 2). To prepare this subset for the ML procedure, we removed soils that had either a missing data in one or more of the variables or contained values that met one of our exclusion criteria shown in equations 2c-4.
urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0004(2a)
urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0005(2b)
urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0006(2c)
urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0007(2d)
Table 2. Summary of Cleaned USKSAT Data Variables
Type Variable (abbreviation) Unit Min. Q1a Median Mean Q3a Max.
Measured Saturated hydraulic conductivity (Ks) loge(cm/hr) -7.5 0.68 2.6 1.9 3.4 6.7
Bulk density (ρb) g/cm3 0.02 1.5 1.6 1.5 1.6 2.6
Organic carbon content (C) loge(%) -4.6 -2.3 -1.5 -1.4 -0.4 2.9
Clay fraction (Cl) % 0 1.3 3.1 8.7 12.7 93.4
Silt fraction (Si) % 0 2 3.8 5.6 6.5 94.5
Sand fraction (Sa) % 0.2 79.8 92.1 85.7 96.2 99.9
Very coarse sand fraction (VCOS) % 0 0 0 0.3 0.2 19.6
Coarse sand fraction (COS) % 0 0.6 2.1 3.8 5.1 44.6
Medium sand fraction (MS) % 0 7.6 16.8 20.9 30.1 77.7
Fine sand fraction (FS) % 0.1 35.0 49.6 50.3 65.6 97.4
Very fine sand fraction (VFS) % 0 3.8 8.2 10.4 14.8 56.4
Calculated 10th percentile particle size (d10) μm 0.02 0.7 56.02 52.0 100.1 253.4
50th percentile particle size (d50) μm 0.17 133.9 156.4 167.1 193.3 534.1
60th percentile particle size (d60) μm 0.3 159.0 180.2 199.5 237.3 656.5
Coefficient of uniformity (CU) loge(−) 0.47 1.0 1.2 2.8 5.33 8.7
Complexed organic carbon (CX) % 0 0.1 0.1 0.2 0.3 5.19
  • a Q1 and Q3 are the first and third quartiles, respectively.

The exclusion criteria ensured that all soil texture fractions add up to 100 % (within a 5% margin to account for possible significant digit and rounding issues). The criteria also ensure that there are no outlier bulk densities (equation 2c) and, in the FLSOIL database, that water retention for soils do not increase with water tension. The resulting “cleaned” USKSAT database contained 18,644 soils. Summary of the cleaned USKSAT database is shown in Table 2.

From the FLSOIL database, we selected additional variables of volumetric water contents (θ) at 11 pressure heads (h), that is, 3.5 to 1,500 cm H2O. Measurements of Ks, ρb, and θ in the FLSOIL database were made in replicates of either two or three. We used the arithmetic means of these variables. The resulting cleaned FLSOIL database contained 5,985 soils.

The distribution of USKSAT soils across the U.S. Department of Agriculture textural classes and summary of Ks by textural classes is shown in Figure 1.

Details are in the caption following the image
(a) Distribution of the cleaned USKSAT soils in U.S. Department of Agriculture textural classes and (b) ranges of Ks and percent database by textural class.

3.1.2 Computed Secondary Soil Variables

3.1.2.1 Particle Size Distribution

Particle size distribution has a strong influence on hydraulic conductivity and variables of particle size distribution, particularly 10th percentile particle size, have been used in several semiempirical models (e.g., Carrier, 2003, and references within). We calculate the 10th, 50th, and 60th percentile particle sizes (d10, d50, and d60, respectively) from soil textural fraction data. For this, we constructed cumulative particle size distribution by linear-interpolation of the seven texture sizes (2, 50, 100, 250, 500, 1,000, and 2,000 μm), and a very small diameter of 0.01 μm as the minimum size. We then calculate the d10, d50, and d60 particle sizes from the fitted distribution. The coefficient of uniformity of the particle size distribution was calculated as CU = d60/d10 (Skaggs et al., 2001).

3.1.2.2 Complexed Organic Carbon

The concept of complexed organic carbon (CX) was introduced by Dexter et al. (2008) to better describe the influence of organic matter on soil physical behavior. CX is the proportion of C that forms complexes with the clay fraction, and it is calculated with the assumption that 1 g of C is complexed with n g of clay mass. Thus, for sufficiently high clay content (Cl > n C) all the C can be complexed. CX is computed as
urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0008(3)

Dexter et al. (2008) found that n = 10 best described the physical behavior of their study soils where were from French and Polish databases. We used the same ratio to calculate CX in our study.

3.2 Model Building

The overall procedure of building the ML models is illustrated in Figure 2. The computationally demanding steps of model training and testing were run using a high-performance computing cluster. The caret R package (Kuhn, 2017) was used to handle training and tuning procedures. The ML algorithms were implemented using the following R packages: KNN from the kknn package (Hechenbichler & Schliep, 2004), SVR from the kernlab package (Karatzoglou et al., 2004), RF from the randomForest package (Liaw & Wiener, 2015), and BRT from the gbm package (Ridgeway, 2017).

Details are in the caption following the image
Flow chart of the model building process.

3.2.1 Data Preprocessing

Data preprocessing prior to model training included the following. The Ks and C values were log transformed in order to make data more normally distributed (as inspected visually from density plot and Q-Q plot). Zero values of C were replaced with a small number of 0.001 prior to the log transformation. The USKSAT and FLSOIL databases were then split 75-25% into training and testing data sets, respectively. Prior to modeling, all variables excluding Ks were centered to the variable's mean and scaled by the variable's standard deviation in the training data (equation 4):
urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0009(4)

where x is the centered and scaled value of variable x; urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0010 and σx are the respective arithmetic mean and standard deviation of the variable in the training data set. The testing data sets are centered and scaled using the same mean and standard deviation values of the training set. No correlation (R2 > 0.8) was detected amongst all possible pairs of predictor variables except d50~d60 (R2 = 0.97), and Sand~Clay (R2 = 0.82).

3.2.2 Predictor Set Hierarchy

It is desirable to select the minimum subset of the predictors needed to construct a model without a substantial reduction in prediction accuracy. The selection of such subset of predictors—feature selection—is done with the objectives of (a) improving prediction accuracy; (b) reducing model complexity, which makes interpretation of the effects of predictors easier giving us a better understanding of the underlying processes; and (c) reducing the amount of input variables needed to use the model.

In addition to testing multiple ML algorithms, we also built and analyzed multiple models with different sets of predictors. Throughout this manuscript, we refer to individual models by the ML algorithm used and the set of input predictors it takes, that is, its predictor set hierarchy. We distinguish model hierarchy by appending a two-part numeric code separated by a hyphen where the first number denotes the number of textural variables the model uses (3, 7, or 10) and the second number denotes the number of structure related variables used (0, 1, or 2). The list of models by predictor set hierarchy is shown in Table 3. The lowest hierarchy model takes only the three textural size fractions. The three highest hierarchy models required variables of water retention and were trained on only the FLSOIL database. For these models only, we include a third numeric code, which represents the number of water retention variables included (either 1, 2, or 11).

Table 3. Predictor Set Hierarchy Codes and List of the Input Variables
Hierarchy ID Input Variables Database
3-0 Cl, Si, Sa USKSAT
3-1 Cl, Si, Sa, ρb USKSAT
3-2 Cl, Si, Sa, ρb, C USKSAT
7-0 Cl, Si, VFS, FS, MS, COS, VCOS USKSAT
7-1 Cl, Si, VFS, FS, MS, COS, VCOS, ρb USKSAT
7-2 Cl, Si, VFS, FS, MS, COS, VCOS, ρb, C USKSAT
10-2 Cl, Si, VFS, FS, MS, COS, VCOS, d10, d50, CU, ρb, C USKSAT
7-2-1 Cl, Si, VFS, FS, MS, COS, VCOS, ρb, C, θ(330) FLSOIL
7-2-2 Cl, Si, VFS, FS, MS, COS, VCOS, ρb, C, θ(330), θ(15,000) FLSOIL
7-2-11 Cl, Si, VFS, FS, MS, COS, VCOS, ρb, C, θ(3.5), θ(20), θ(30), θ(45), θ(60), θ(80), θ(100), θ(150), θ(200), θ(330), θ(15,000) FLSOIL
  • Abbreviations: ρb, bulk density; C, organic carbon content; Cl, clay fraction; COS, coarse sand fraction; CU, coefficient of uniformity; d10, 10th percentile particle size; d50, 50th percentile particle size; FS, fine sand fraction; MS, medium sand fraction; Sa, sand fraction; Si, silt fraction; VFS, very fine sand fraction; VCOS, very coarse sand fraction.

3.2.3 Model Training

The selection of optimal model parameters in the model training process, that is, model tuning, was done by k-fold cross-validation method. We used a five-times repeated, tenfold cross validation method to select optimum model parameters using a comprehensive grid search method. Cross validation is done to estimate the test error rate by holding out a subset of the training data (i.e., validation set) from the fitting process and then applying the fitted model to predict the validation subset. In k-fold cross validation, the training data are randomly divided into k approximately equal subsets, and model fitting repeated k times; each time, treating a different subset as a validation set. This process allows the calculation of a validation set error rate which estimates the test error rate (James et al., 2013).

3.2.4 Model Assessment

The final performance of models was assessed on the separate hold-out test data set that was not used in the model training. The performance of models is measured in terms of root mean squared log-transformed error (RMSLE), mean log-transformed error (MLE), and the coefficient of determination (R2) determined as follows:
urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0011(5)
urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0012(6)
urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0013(7)
where N is the number of observations, Ks is the measured value, urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0014 is the predicted value, and urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0015 is the mean of measured values.

The RMSLE indicates the average deviation of predictions from the measured value with smaller values indicating better performance. The MLE measure systematic bias, positive or negative values indicate the average tendency of the predicted values to be larger or smaller than the measured values, respectively. The R2 indicates correspondence between predicted and measured data with higher values indicating stronger correspondence.

3.2.5 Comparison With Other PTFs

Further evaluation of our models was done by comparing our models with 10 other PTF models frequently cited in the literature (Abdelbaki et al., 2009; Ghanbarian et al., 2017). The required predictors for each of the alternative models tested are given in Table 4. The Nemes et al. (2005) model is the only one of the alternative models that use organic matter content. We converted C to organic matter as OM = C × 1.724. This conventional conversion ratio is not accurate on all soils (Pribyl, 2010), but we considered it good enough for our purpose after not observing meaningful differences in the overall performance by using different ratios that range between1.5 and 2. Porosity values required by the Brakensiek et al. (1984) and Saxton et al. (1986) models were calculated from ρb as ϕ = 1 − (ρb/2.65). The Rosetta-3 model was run using the Python code made available by Zhang and Schaap (2017); we implemented the rest of models in R. The R code to calculate all the alternative PTFs we tested in this study is available online at https://github.com/saraya209/soil_ksat.

Table 4. Some proposed PTFs frequently used in literature to estimate Ks and their required input variable
Reference – Model name Required input variables
Zhang and Schaap (2017)—Rosetta-3 (H3) Cl, Si, Sa, ρb
Ghanbarian et al. (2015)—SHC2 Cl, Si, Sa, ρb, L, Da
Nemes et al. (2005) Cl, Sa, ρb, C, OMb
Campbell and Shiozawa (1994) Cl, Si
Dane and Puckett (1994) Cl
Jabro (1992) Cl, Si, ρb
Saxton et al. (1986) Cl, Sa, ϕ
Puckett et al. (1985) Cl
Brakensiek et al. (1984) Cl, Sa, ϕc
Cosby et al. (1984) Cl, Si
  • Abbreviations: ρb, bulk density; C, organic carbon content; Cl, clay fraction; PTFs; pedotransfer functions; Sa, sand fraction; Si, silt fraction.
  • a L = sample length (cm) and D = sample internal-diameter (cm).
  • b OM = organic matter content (%).
  • c ϕ = porosity (-).

3.3 Predictor Variable Importance

The predictor variable importance is the statistical significance of each predictor variable with respect to its effect on the generated model. For the tree-based models, RF and BRT, variable importance is calculated internally within the model algorithm (equation 1). For the rest of the ML models, we calculated the predictor variable importance by recursive feature elimination method, which is done by recursively removing predictors before training a model and evaluating the change in model performance. In this method, to account for possible bias in variable subset selection (Ambroise & McLachlan, 2002; Hastie et al., 2009), we included a separate layer of 10-fold cross-validation to the entire sequence modeling steps.

3.4 Sensitivity of Ks to Structural Perturbations

We used the ML-PTFs developed in this study to test the sensitivity of Ks to perturbations of soil structural variables (i.e., ρb and C). Sensitivity of Ks was analyzed by using our best performing model (BRT-7-2) to predict the marginal effect of varying one of the variables while keeping the others constant (Hastie et al., 2009; Hochachka et al., 2007; James et al., 2013). Interactive effect of ρb and C perturbations on Ks sensitivity was similarly analyzed by varying both structural variables together. Because the effects of ρb and C on morphology and topology of soils are likely to be dependent on soil texture, the partial dependence relationships were analyzed separately for each textural class.

To construct partial dependence relationships, we randomly selected 100 soils, with replacement, from each textural class of the cleaned USKSAT data set. The BRT-7-2 model was used to predict Ks of each sample while incrementally perturbing one of the two structural variables. To analyze the effect of ρb, the ρb value of each soil was incrementally varied from 0.5 to 2 g cm-3 while keeping the other variables of that soil constant.

Partial dependence of Ks with changes of C was determined similarly by incrementally varying C of each soil from 0.03 to 10 %. When perturbing the values of C, however, we also changed the values of ρb according to a linear correlation equation we developed between ρb and log(C) for each textural class based on USKSAT data set (Figure S1 in the supporting information). We did this to account for the observed relationship between C and ρb. For each incremental change of C value, we changed the value of ρb by a normal random variate about the mean and variance of the linear correlation fit.

We also analyzed the interactive effect of both structural variables on Ks by perturbing both ρb and C simultaneously for each of the 100 sampled soils. In this analysis, ρb and C were varied independently.

4 Results and Discussion

4.1 Model Performances

The model performance analyses are organized by the ML algorithm and the predictor set hierarchies.

The model performances, in terms of RMSLE, for all the learning algorithms and predictor set hierarchies combination are shown in Figure 3. These performance tests were conducted on the test set which includes 4,661 soils (25% of the cleaned USKSAT data set). The BRT algorithm consistently outperformed the other learning algorithms across all predictor set hierarchies with the RF algorithm closely behind. Performance for all ML algorithms generally increased with an increase in the number of predictors used. The one exception was the performance of KNN algorithm, which decreased when ρb was included from KNN-3-0 to KNN-3-1. Including sand subclass fractions led to large improvement on model performances across all learning algorithms. Using the seven textural size fractions instead of only three improved performance by a larger proportion than that of including ρb or C variables with only three texture size fractions. The final tune parameters (coefficients) of the BRT models is given in Table S1.

Details are in the caption following the image
Model performance in terms of root mean squared log-transformed error by machine learning algorithm type and number of predictor variables used (see Table 3). BRT-7-2 and 10-2 are the best models with lowest prediction error.

4.2 Comparison With Other PTFs

One-to-one scatter plot comparison of four of our models with the alternative models we tested is shown in Figure 4. The predicted and measured values are assigned to the x axis and y axis, respectively, as recommended by Piñeiro et al. (2008). Our models outperformed all 10 alternatives. The revised version of the popular PTF, Rosetta-3 (Zhang & Schaap, 2017), showed the best performance among all alternative models we tested. Based on the MLE and the 1 to 1 plot, the Rosetta-3 model slightly tends to overestimate Ks, particularly at lower magnitudes. The BRT-3-1 model is equivalent to the Rosetta-3 model we tested in terms of the required predictor variables to run the model (i.e., Sa, Si, Cl, and ρb). In terms of RMSLE and MLE, as well as the data clouds on the 1:1 line, all hierarchies of our model showed good performance across the full range of the Ks values. The distribution of the residuals for all hierarchies of our models appears similar, with increased performance (tighter distribution around the 1:1 line) for larger magnitudes of Ks.

Details are in the caption following the image
Comparisons between measured and model predicted saturated hydraulic conductivities for the testing data set (n = 4,661). Top row panels show 1 to 1 comparison of predictions made using four different input hierarchy boosted regression tree models we developed. The remaining ten panels show predictions made using other commonly used pedotransfer function models. Note that the number of samples is 4,540 for the evaluation of Jabro (1992) model (samples with either zero silt or clay content were removed); 4,562 for Nemes et al. (2005) model (samples with 0-cm/day prediction were removed); and 4, 650 for the evaluation of Rosetta-3 model (samples with NA prediction were removed). The color scale denotes density of the points estimated by 2-D kernel density estimation of the values.

The distribution of residuals of the best performing BRT-7-2 model is shown in Figure 5. The spread of the residuals is larger at the smaller magnitude predictions. Most of the predictions fall within an order of magnitude of the measured values and only very few predictions are off by more than two orders of magnitude. The performance within each textural class also showed that all hierarchies of our models performed better than the alternative models we tested. Figure 6 shows the performances of four selected hierarchies of our model and the Rosetta-3 within the soil textural classes. Performance statistics of our best model (BRT-7-2) by textural classes is given in Table 5. The RMSLE for the model ranged from 0.147 for silt group to 0.653 for clay loam. R2 ranged from the 0.996 for silt clay loam group to 0.497 for sandy clays.

Details are in the caption following the image
Residuals versus predicted values for the best performing boosted regression tree -7-2 model (n = 4,661).
Details are in the caption following the image
Model performances on USKSAT test data set by textural class.
Table 5. Best Model (BRT-7-2) Performance Metrics by Textural Classes
Texture class Count of soils RMSLE R2 MLE
Clay 83 0.504 0.634 0.028
Silty clay 0 - - -
Sandy clay 89 0.598 0.497 -0.044
Silty clay loam 3 0.369 0.996 -0.184
Clay loam 25 0.653 0.529 -0.247
Sandy clay loam 532 0.483 0.673 0.025
Silt 6 0.147 0.968 -0.076
Silt loam 9 0.485 0.741 0.208
Loam 20 0.516 0.732 0.049
Sandy loam 500 0.391 0.763 0.017
Loamy sand 443 0.296 0.844 0.001
Sand 2,951 0.181 0.875 0.0001
Overall 4,661 0.295 0.900 0.004
  • Note. Bold numbers indicate the largest and smallest values within each metric.
  • Abbreviations: MLE, mean log-transformed error; RMSLE, root mean squared log-transformed error.

4.3 Predictor Variable Importance

Both best performing ML algorithms, RF and BRT, showed a similar ranking of variable importance. The relative variable importance ranking for the best performing models are shown in Figure 7.

Details are in the caption following the image
Relative importance ranking of the top eight predictors for three different hierarchy models.

For the best performing model (BRT-7-2), the most important predictor was clay mass present followed by ρb and C. The dominance of the clay content as the most important variable even though 73% of the training data was classified as sand or loamy sand, with <15% clay highlights the disproportionate importance of the fine particles to Ks. It is also notable that the two structural indicators (ρb and C) were ranked as the second and third most important variables. When the variable d10 is included (BRT-10-2) it overtakes clay and becomes overwhelmingly the most important predictor. However, including d10 did not improve the model performance in terms of RMSLE, which suggests the ML algorithm is able to “learn” the importance of d10 from the raw textural size data where the d10 parameter is calculated from. Including CX did not lead to model improvement, and the CX variable was ranked as the least important. Models that include CX variable are hence not included in this paper.

4.3.1 Importance of Water Retention Variable

The analysis of water retention variable was done using models trained on only the FLSOIL database, a much smaller database but which had water retention values. Models trained on only the FLSOIL database had lower performance than those trained using the entire USKSAT database. In order to compare variable importance of water retention on relative bases, we trained similar hierarchy models on FLSOIL database. When looking at variable importance, water retention at field capacity (θFC) was the second most important variable, preceded only by clay content (Figure 7). The addition of a single water retention variable (i.e., θFC) led to a relatively large improvement in model performance with an RMSLE drop of 13% from RMSLE of 0.49 to 0.42 (Figure 8). Water content at field capacity is a strong indicator of soil structure and its importance in predicting Ks in our models highlights the importance of structure in Ks.

Details are in the caption following the image
Performance of boosted regression tree models trained with FLSOIL database only. White bars represent models that include water retention variables (see Table 3).

4.4 Prediction Interval

Providing uncertainty estimates in PTF predictions is important to assess the reliability of estimates (Schaap & Leij, 1998). Uncertainty estimates are also essential information in most applications such as use in land surface models (Baroni et al., 2017; Chaney et al., 2019; Folberth et al., 2016; Van Looy et al., 2017). Prediction intervals can be estimated by building an ensemble of models. The RF algorithm is an ensemble of regression trees, and prediction intervals can easily be calculated from the variance of the ensemble trees. Although the BRT models slightly outperformed the RF models, the possibility of producing a prediction interval may make these models more appealing choice in circumstances where knowledge of the prediction uncertainty is essential, such as long-term trends in soil processes using land surface models. Figure 9 demonstrates the prediction intervals from the best performing RF model (RF-7-2). The figure shows histograms of the deviations from measured values of 500 individual tree predictions for each texture for the 100 randomly selected soils within each texture group. The 75th percentile prediction intervals for almost all soils fall within an order of magnitude of the mean prediction.

Details are in the caption following the image
Histogram of Ks prediction deviations from the measured values for a subset of 500 individual regression trees that make up the RF-7-2 model. Prediction are done on 100 soils selected randomly with replacement from each textural class of the USKSAT test data set.

4.5 Sensitivity of Ks to Structural Perturbations

4.5.1 Bulk Density

Figure 10 shows the change of Ks prediction with ρb. Ks decreased with an increase in ρb, and a more uniform pattern is apparent when soils are grouped by textural class. The Ks pattern of change appears to follow an inverted s-curve. To enable a more quantitative description of the sensitivity, we fitted a Ks~ρb logistic-curve within each texture (equation 8)
urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0016(8)
where ρb. 0 is the midpoint (point of inflection) of the curve and kρ is the slope of the curve at ρb = ρb. 0. The complete parameters of the logistic curve fits are given in Table S2. On average across all textural classes the maximum change in Ks occurs at ρb. 0 ≈ 1.8 g/cm3. The steepness of the curve averaged at ≈ − 2 and ranged from kρ =  − 1.1 for silty loam soils to steepest rate of kρ =  − 13.28 for sandy clay soils. The trend of silty clay loam was not well approximated by logistic curve having its inflection point outside the range; this is likely because the number of records in this class were too few to generalize to a trend for the texture class.
Details are in the caption following the image
Predicted Ks changes across ρb for 100 randomly selected soils from soil textural groups. Black trend lines show logistic curve fit.

4.5.2 Organic Carbon Content

The relation between C content and Ks was difficult to discern and only became apparent when the relationships were plotted separately for each soil textural class (Figure 11). Nemes et al. (2005) have also noted this lack of generalizable explanatory trend between C and Ks .

Details are in the caption following the image
Predicted Ks changes across C for 100 randomly selected soils from soil textural groups. The black trend lines are logistic curve fits.

The Ks across all the textural classes except for the two coarsest classes—loamy sand and sand—increased with increase in C content. This trend is consistent with C being associated with structure development (e.g., aggregation and formation of biopores and macropores), which increases the overall permeability of soils. For sand and loamy sand soils, the apparent slow down and even reversal of the trends at higher C contents (Figure 11) may be due to the inherently high proportion of large pores in soils of this textural groups. This trend may also suggest that an increase in C content is associated with the shrinking of larger pores in these coarse textured soils by increased aggregation. Similar effect of C in reducing Ks of sandy soils while increasing that of finer textured soils was observed by Nemes et al. (2004, 2005); for soils that are 50% sand and clay content between ~25 to 45%, they reported lower Ks prediction for those soils with higher C (C = 5% compared to those with 1 %  ≤ C ≤ 3%), which led them to conclude that the relationship between C and Ks is very complex.

To enable quantitative analysis of the sensitivity, we fitted Ks~C logistic-curves to each texture class individually as
urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0017(9)
where C0 is the midpoint (point of inflection) of the curve and kC is the slope of the curve at C = C0. The complete parameters of the logistic curve fits are given in Table S3. The trends of Ks change with C generally appears weaker than that of with ρb. For the loamy sand and sand soil groups, however, the relationship is notably different with a decrease in Ks at C ≥ 3%.
To visualize the combined effect of ρb and C changes on Ks, we plotted a two-dimensional heatmap of Ks shown in Figure 12. The heatmap shows normalized Ks predictions across ρb and C changes of 100 randomly selected soils for each textural class. The predicted Ks were normalized to range from 0 to 1 for each soil using equation 10.
urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0018(10)
where urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0019 is the normalized Ks and urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0020 and urn:x-wiley:00431397:media:wrcr24057:wrcr24057-math-0021 are the minimum and maximum Ks values of the soil.
Details are in the caption following the image
Heatmap of predicted Ks values (scaled loge(cm/hr)) across ρb and C changes for 100 randomly selected soils from soil textural groups. Ks values in the heatmap have been smoothed with LOESS for clarity of display.

For all the classes, except loamy sand and sand, the highest and lowest Ks are in the top-left and bottom-right corners, respectively. The trends were significantly altered for the coarse textures. The lowest Ks are in the top-right corner. These observations suggest that the effects of C on soil structure and ultimately Ks are not masked by changes in ρb. The combined effect of C and ρb on Ks is illustrated in Figure 12.

5 Conclusions

We developed ML-based PTFs that were trained on large soil database (USKSAT). We tested four popular ML algorithms (KNN, SVR, RF, and BRT) with a range of predictor set hierarchies. The BRT models outperformed the other ML algorithms closely followed by the RF models. The best performing BRT model has prediction accuracy for log10KS of RMSE = 0.295. This RMSLE is lower by 50% than the reported accuracy for the revised version of the popular Rosetta-3 model, RMSLE = 0.6 (Zhang & Schaap, 2017). The accuracy achieved by our models in this study to predict Ks is far higher than any other PTF model we are aware of.

Based on their relative importance to predict Ks, d10 was by far the most important predictor for Ks followed by clay content. However, removing d10 did not reduce model performance, suggesting that the algorithms were able to learn the effect of d10 from the raw textural data. Following these two textural variables, the most important variables were ρb and C content which highlights the importance of these structural variables on Ks. The potential impact of structural perturbations on Ks was illustrated by the functional relationships between the structural variables and Ks. We observed that the effects of structural perturbation on Ks varied with textural classes. Generally, Ks decreased with an increase in ρb, with maximum sensitivity at ρb ≈ 1.8 g/cm3. The effect of C perturbation on Ks was more complex. For all textural classes, except loamy sand and sand, increasing C led to an increase in Ks. Whereas for the coarsest textural class, increasing C reduced Ks. These trends may suggest that C induced aggregation increases the relative proportion of large pores (inter-aggregate pores) in fine and medium textured soils. Whereas in sandy soils, aggregation increases the proportion of fine intraaggregate pores. These functional relationships demonstrate the potential that such models can be incorporated in land-surface and soil-systems models and be used to account for the sensitivity of infiltration and water flow to soil structural alterations and disturbances (e.g., organic matter accumulation or tillage).

Nomenclature

Acronyms

  • BRT
  • boosted regression trees
  • KNNs
  • k-nearest neighbors
  • ML
  • machine learning
  • ML-PTF
  • machine learning-based pedotransfer function
  • PTF
  • pedotransfer function
  • RF
  • random forest
  • SVR
  • support vector regression
  • Variables

  • ρb
  • bulk density
  • CX
  • complexed organic carbon
  • C
  • organic carbon
  • Ks
  • saturated hydraulic conductivity
  • MLE
  • mean log-transformed error
  • RMSLE
  • root-mean-squared log-transformed error
  • Acknowledgment

    We thank Yakov A Pachepsky (Environmental Microbial and Food Safety Lab, USDA) for making the USKSAT data available to us. We also thank Attila Nemes (Norwegian Institute for Bioeconomy Research) for his thorough review of the manuscript and insightful comments. We gratefully acknowledge computing time on the Multi-Environment Computer for Exploration and Discovery (MERCED) cluster at UC Merced, which was funded by National Science Foundation Grant No. ACI-1429783. The raw and cleaned version of the data sets used (USKSAT and FLSOIL), the R scripts used to generate and analyze models, and the final PTF models presented in this paper are available online at https://github.com/saraya209/soil_ksat.